WO2021017383A1

WO2021017383A1 - Method and system for parsing elements of legal document

Info

Publication number: WO2021017383A1
Application number: PCT/CN2019/126935
Authority: WO
Inventors: 戴威
Original assignee: 北京国双科技有限公司
Priority date: 2019-07-30
Filing date: 2019-12-20
Publication date: 2021-02-04
Also published as: CN112329436A

Abstract

Provided in the present invention are a method and system for parsing elements of a legal document. The method comprises: acquiring a legal document to be parsed; performing sentence segmentation processing on the legal document to obtain a plurality of sentences to be parsed; and inputting the sentences to be parsed into a pre-established element analysis model one by one for element analysis, to obtain elements contained in each sentence to be parsed in the legal document, wherein the element parsing model is obtained by means of training a language model on the basis of sample data. In this solution, the element parsing model is obtained by means of pre-training the language model with a large number of legal documents, and sentence segmentation processing is performed on the legal documents that need to be parsed to obtain a plurality of sentences to be parsed, and each sentence to be parsed is used as the input of the element parsing model to obtain elements in each sentence to be parsed, thereby saving on labor and time costs, and improving the accuracy and efficiency of determination.

Description

A method and system for analyzing elements of legal documents

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 30, 2019, the application number is 201910695870.8, and the invention title is "A method and system for analyzing elements of legal documents", the entire content of which is incorporated herein by reference Applying.

Technical field

The invention relates to the technical field of legal document processing, in particular to a method and system for analyzing elements of a legal document.

Background technique

With the development of modern society, law is one of the products of the development of civilized society. The law usually refers to a special code of conduct that is recognized by the society and the state confirms that the legislature establishes a normative code of conduct, and is guaranteed by the state's compulsory force to stipulate the rights and obligations of the parties as its content, and is a special code of conduct that is universally binding on all members of society. When disputes arise between members of society, the judicial organs shall file a case for ruling in accordance with the law.

When making legal judgments, the most common method currently is elementary trial. That is, based on the case information, extract the elements of the case one by one, and finally make a legal judgment based on the extracted case elements. But on the one hand, because the case information contains a variety of information, it usually takes a lot of time and labor costs to manually extract the elements required for the judgment from the various information. On the other hand, due to the diversity of languages, there are usually multiple different descriptions and expressions of the same conviction element, which will affect the accuracy and efficiency of the judgment.

Summary of the invention

In view of this, the embodiments of the present invention provide a method and system for analyzing elements of a legal document to solve the problems of high labor cost, high time cost, low accuracy, and low efficiency in existing manual element extraction.

In order to achieve the foregoing objective, the embodiments of the present invention provide the following technical solutions:

The first aspect of the embodiments of the present invention discloses a method for analyzing elements of a legal document, and the method includes:

Obtain legal documents to be resolved;

Perform sentence processing on the legal document to obtain multiple sentences to be parsed;

One by one, input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is obtained by training a language model based on sample data The language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.

Preferably, when the language model is a BERT model, the process of training the language model based on sample data to obtain an element analysis model includes:

Performing text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;

Using the first training data as the input of the first BERT model, combining the preset first loss function and the sample data, training the first BERT model until the first BERT model converges;

Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;

The second training data is used as the input of the second BERT model, and the second BERT model is trained in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model, wherein the The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements.

Preferably, the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes :

Using the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position and a sentence prediction result corresponding to a sentence splicing position;

Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the difference between the actual sentence at the sentence splicing position and the sentence prediction result Sentence error between

Training the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;

Wherein, the actual text and actual sentence are derived from the sample data.

Preferably, performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes:

Randomly replace the text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, where the second sentence is the next sentence or sentence corresponding to the first sentence Not the next sentence corresponding to the first sentence.

Preferably, after the element analysis is performed by using the sentences to be analyzed as the input of the element analysis model established in advance, and after obtaining the elements contained in each sentence to be analyzed in the legal document, the method further includes:

Combine the elements contained in each sentence to be parsed.

The second aspect of the embodiments of the present invention discloses a legal document element analysis system, the system includes:

The obtaining unit is used to obtain the legal document to be analyzed;

The processing unit is used to perform sentence processing on the legal document to obtain multiple sentences to be parsed;

The prediction unit is used to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.

Preferably, when the language model is a BERT model, the prediction unit includes:

A processing module, configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;

The first training module is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model convergence;

A setting module, configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;

The second training module is configured to use second training data as the input of the second BERT model, train the second BERT model in combination with a preset second loss function until the second BERT model converges, and obtain the element An analytical model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.

Preferably, the first training module includes:

A prediction submodule, configured to use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position, and to obtain a sentence prediction result corresponding to a sentence splicing position;

The error sub-module is used to calculate the text error between the actual text at the text replacement position and the text prediction result using the first sub-loss function, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and The sentence error between the sentence prediction results;

A training sub-module, configured to train the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;

Wherein, the actual text and actual sentence are derived from the sample data.

A third aspect of the embodiments of the present invention discloses a storage medium, the storage medium includes a stored program, wherein, when the program is running, the device where the storage medium is located is controlled to execute the law disclosed in the first aspect of the embodiment of the present invention. Analysis method of document elements.

The fourth aspect of the embodiments of the present invention discloses a legal document element analysis device, including a storage medium and a processor, the storage medium stores a program, and the processor is configured to run the program, wherein the program is executed when the program is running Such as the method for analyzing elements of a legal document disclosed in the first aspect of the embodiments of the present invention.

Based on the above-mentioned embodiment of the present invention, a method and system for analyzing elements of a legal document are provided. The method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed. One by one, the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained. The element analysis model is obtained by training the language model based on sample data. In this solution, the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are subdivided to obtain multiple sentences to be analyzed. Each sentence to be analyzed is used as the input of the element analysis model to obtain each sentence. The elements in the sentence to be parsed are analyzed and judged based on the extracted case elements, and there is no need to manually extract the elements in the case one by one, thereby saving labor and time costs, and improving the accuracy and efficiency of the judgment.

Description of the drawings

In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.

Figure 1 is a schematic structural diagram of a Transformer provided by an embodiment of the present invention;

2 is a flowchart of a method for analyzing elements of a legal document according to an embodiment of the present invention;

FIG. 3 is a flowchart of obtaining a factor analysis model provided by an embodiment of the present invention;

4 is a flowchart of training the first BERT model provided by an embodiment of the present invention;

5 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention;

6 is a schematic structural diagram of a system for analyzing elements of a crime legal document provided by an embodiment of the present invention;

7 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention;

FIG. 8 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention.

Detailed ways

The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.

In this application, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes no Other elements clearly listed, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.

It can be known from the background technology that the current method of extracting case elements is to manually extract the elements in the case based on case information, and finally make legal judgments based on the extracted case elements. But on the one hand, because the case information contains a variety of information, it usually takes a lot of time and labor costs to manually extract the elements required for the judgment from the various information. On the other hand, due to the diversity of languages, there are usually multiple different descriptions and expressions of the same conviction element, which will affect the accuracy and efficiency of the judgment.

Therefore, the embodiments of the present invention provide a legal document element analysis method and system. The element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed. A sentence to be parsed is used as the input of the element analysis model to obtain the elements of each sentence to be parsed, so as to save labor cost and time cost, and improve the accuracy and efficiency of judgment.

It should be noted that the (Bidirectional Encoder Representation from Transformers, BERT) model involved in the embodiment of the present invention is a language model proposed by Google, and has a strong ability to abstract text in the field of natural language processing. The BERT model has a 12-layer Transformer structure. The specific structure of the BERT model is as follows: the text of the input embedding layer is segmented according to words, the words are mapped into 768-dimensional vectors based on the word vector mapping weight provided by Google, and the encoding vector Enc is obtained through a 12-layer Transformer structure. .

Referring to Figure 1, a schematic structural diagram of Transformer is shown. In Figure 1, the Transformer includes Multihead Attention, Residual Unit, Layer Normalization (LayerNorm), and two-layer full Connect (FFN).

It should be noted that the element analysis model involved in the embodiment of the present invention is constructed for different legal fields, that is, for a type of legal field, the BERT model is trained using the sample data corresponding to the legal field to obtain the The corresponding element analysis model. For example, for the field of marriage and family affairs, the legal documents on the field of marriage and family affairs in the legal documents network are used as sample data to train the BERT model, and the element analysis model corresponding to the field of marriage and family affairs is obtained.

Referring to FIG. 2, there is shown a flowchart of a method for analyzing elements of a legal document provided by an embodiment of the present invention. The method for analyzing elements of a legal document includes the following steps:

Step S201: Obtain the legal document to be analyzed.

Step S202: Perform sentence processing on the legal document to obtain multiple sentences to be parsed.

In the process of specifically implementing step S202, a language technology platform (Language Technology Platform, LTP) is used to perform sentence processing on the legal document to obtain a set of sentences containing multiple sentences to be parsed.

Step S203: Input the sentences to be analyzed into the pre-established element analysis model one by one to perform element analysis to obtain the elements contained in each sentence to be analyzed in the legal document.

In the process of implementing step S203, the legal documents required by the training element analysis model are selected from the data disclosed by the legal documents network, and the legal documents are segmented using LTP to obtain sample data, and training language based on the sample data The model obtains the element analysis model. For example: assuming that the legal field corresponding to the element analysis model is the field of marriage and family affairs, the legal documents in the field of marriage and family affairs are screened out from the legal documents online, and the sentence segmentation of the legal documents in the field of marriage and family affairs is performed using LTP to obtain sample data. The sample data trains the language model to obtain the element analysis model corresponding to the field of marriage and family affairs. When it is necessary to analyze the elements of a legal document in the field of marriage and family affairs, enter the sentence processing of the legal document and enter the element analysis model corresponding to the field of marriage and family affairs for element analysis to obtain the content of each sentence contained in the legal document in the field of marriage and family affairs. Elements.

It should be noted that the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model. The types of language models include but are not limited to: ELMo model, GPT model and BERT model.

It should be noted that by inputting each sentence to be analyzed into a pre-established element analysis model for element analysis, 0 or more elements contained in each sentence to be analyzed can be obtained.

Preferably, after step S203 is executed, the elements contained in each sentence to be parsed are combined.

It should be noted that there are two requirements for the analysis results of the corresponding elements of the legal document. One is to obtain only the elements contained in each sentence in the legal document, and the other is to merge the elements contained in each sentence in the legal document. Get the set of elements of the legal document.

In the embodiment of the present invention, the language model is pre-trained through a large number of legal documents to obtain the element analysis model, the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed, and each sentence to be analyzed is used as the input of the element analysis model Obtain the elements of each sentence to be parsed, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make a legal judgment based on the manually extracted elements, saving labor and time costs, and improving the accuracy and efficiency of the judgment.

The process of training a language model based on sample data to obtain an element analysis model involved in step S203 disclosed in FIG. 2 of the above embodiment of the present invention. In the case that the language model is a BERT model, refer to FIG. 3, which shows the original The flowchart for obtaining the element analysis model provided by the embodiment of the invention includes the following steps:

Step S301: Perform text replacement and sentence splicing processing on the sample data to obtain first training data.

In the process of specifically implementing step S301, the sample data is obtained by performing sentence processing on the disclosed legal document. For the specific processing process, refer to the corresponding content in step S203 disclosed in FIG. 2 of the above embodiment of the present invention. Let me repeat.

When performing word replacement and sentence splicing processing, randomly select a word from the sample data, replace the word with a preset character, and randomly splice a second sentence for the first sentence in the sample data, wherein the The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence. For example: randomly replace the text in the sample with "[MASK]". Select the sentence to be processed for sentence splicing. There is a 50% probability that the sentence is spliced to the next sentence, and there is a 50% probability that the sentence is spliced to other sentences.

It should be noted that the random word replacement and sentence splicing mentioned above are only suitable for example, and the technician can also specifically select which words need to be replaced with characters and specifically select which sentences need to be spliced. Similarly, it is also possible to replace some characters with characters every preset number of characters, and to splice a certain sentence every preset number of sentences, which is not specifically limited in the embodiment of the present invention.

Step S302: Use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model converges.

It should be noted that in the process of specifically implementing step S302, by using the first training data as the input of the first BERT model to predict the text at the text replacement part and the sentence at the sentence splicing part, combining the prediction result and the actual result Training the first BERT model’s ability to judge words and sentences. For example, for a complete sentence, randomly replace a word in the sentence with a preset character, and train the first BERT model to determine what the actual text of the preset character part of the sentence is. For a whole paragraph of content composed of multiple sentences, sentence splicing processing is performed on a sentence, and the first BERT model is trained to determine what the actual sentence corresponds to the sentence splicing part.

Step S303: Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.

In the process of specifically implementing step S303, the parameters of the embedding layer and the 12-layer transformer structure in the first BERT model after convergence are used as the initialization parameters of the embedding layer and the 12-layer transformer structure in the second BERT model.

Step S304: Use the second training data as the input of the second BERT model, and train the second BERT model in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model.

In the process of specifically implementing step S304, a 768-dimensional vector is selected after the encoding vector Enc in the second BERT model, and the 768-dimensional vector is analyzed through a 768-dimensional fully connected layer connection element to analyze the required number of categories. A weighted cross entropy loss function (sigmoid cross entropy loss) is used as the second loss function to train the second BERT model. The specific training process is as shown in the process A1-A3. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions. The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements. For example, 800-1000 legal documents are selected from the sample data for sentence processing, and then the elements are labeled to obtain the second training data.

A1: For each training sentence in the second training data, input each training sentence into the second BERT model for prediction, and obtain prediction elements contained in each training sentence.

A2: Use the second loss function to calculate the error between the predicted element and the actual element contained in each training sentence.

A3: If the error is less than the threshold, construct the element analysis model based on the current model parameters of the second BERT model. If the error is greater than the threshold, adjust the model parameters of the second BERT model based on the error, train the second BERT model based on the second training data until the error is less than the threshold, and determine the trained second BERT model Analyze the model for the element.

It should be noted that the content involved in the above process A1-A3 is only for illustration.

It should be noted that training a neural network model requires one or a series of initial parameters during training. The initial parameters of the traditional neural network model usually use a random parameter with a normal distribution with a small mean variance of 0. The traditional neural network model The initial parameter determination method has poor predictive effect on text elements. In the embodiment of the present invention, by pre-training the first BERT model until convergence, when training the second BERT model, the parameter structure of the first BERT model after training is used to initialize the parameters of the second BERT model The parameters provide sufficient prior information in the legal field for the second BERT model, and effectively improve the element prediction accuracy of the element analysis model.

In the embodiment of the present invention, the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data The model converges, and the element analysis model is obtained. Use the element analysis model to analyze the elements of the legal document after the clause processing, obtain the elements contained in each sentence in the legal document, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make legal judgments based on the manually extracted elements, which can effectively reduce labor and time costs, and provide judgment accuracy and efficiency.

The process of training the first BERT model involved in step S302 disclosed in FIG. 3 of the above embodiment of the present invention. Referring to FIG. 4, a flowchart of training the first BERT model provided by the embodiment of the present invention is shown, including the following steps:

Step S401: Use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to the text replacement position, and obtain a sentence prediction result corresponding to the sentence splicing position.

It should be noted that, for the process of obtaining the first training data, refer to the content corresponding to step S301 disclosed in FIG. 3 of the above embodiment of the present invention, and details are not described herein again.

Step S402: Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and the sentence Sentence error between prediction results.

In the process of implementing step S402, the first 768-dimensional vector is selected from the encoding vector Enc, and the 768-dimensional vector is connected to the first sub-loss function and the second sub-loss through a 768-dimensional fully connected layer. function. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions.

It should be noted that the first sub-loss function includes but is not limited to: a multi-class softmax cross-entropy loss function, and the second sub-loss function includes but is not limited to: a two-class softmax cross-entropy loss function.

Step S403: Based on the text error and sentence error, train the first BERT model in combination with the first training data until the first BERT model converges.

In the process of specifically implementing step S403, the actual text and the actual sentence are derived from the sample data, that is, the actual text at the text replacement position and the actual sentence at the sentence splicing position can be obtained through the sample data. If the text error and sentence error are both smaller than the threshold, the first BERT is converged. If the text error and sentence error are both greater than the threshold, adjust the model parameters of the first BERT model based on the text error and sentence error, and use the first training data to continue training the first BERT model until the Both the text error and the sentence error are less than the threshold.

In the embodiment of the present invention, before obtaining the element analysis model, based on the first sub-loss function and the second sub-loss function, the first BERT model is trained through the first training data until convergence, and the converged first BERT model The model parameters are used as the initialization model parameters of the second BERT model, and then the second BERT model is trained based on the training data until convergence to obtain the element analysis model, which can improve the accuracy of element analysis.

Corresponding to the method for analyzing elements of a legal document disclosed in the foregoing embodiment of the present invention, referring to FIG. 5, an embodiment of the present invention also provides a system for analyzing elements of a legal document. The system for analyzing elements of a legal document includes: an acquiring unit 501, The processing unit 502 and the prediction unit 503.

The obtaining unit 501 is configured to obtain a legal document to be analyzed.

The processing unit 502 is configured to perform sentence processing on the legal document to obtain multiple sentences to be parsed. For the specific processing process of the legal document, refer to the content corresponding to step S202 disclosed in FIG. 2 of the foregoing embodiment of the present invention.

The prediction unit 503 is configured to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on The sample data is obtained by training a language model, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model. For the process of obtaining the sample data, refer to the content corresponding to step S203 disclosed in FIG. 2 of the foregoing embodiment of the present invention.

Referring to FIG. 6, there is shown a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention. When the language model is a BERT model, the prediction unit 503 includes: a processing module 5031, a first training module 5032 Setting module 5033 and second training module 5034.

The processing module 5031 is configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, where the sample data is obtained based on sentence processing on a public legal document.

In specific implementation, the processing module 5031 is specifically configured to randomly replace text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, wherein the first sentence The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence. For specific content, refer to the content corresponding to step S301 disclosed in FIG. 3 of the foregoing embodiment of the present invention.

The first training module 5032 is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT The model converges.

The setting module 5033 is configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.

The second training module 5034 is configured to use second training data as the input of the second BERT model, and train the second BERT model in combination with a preset second loss function until the second BERT model converges to obtain the The element analysis model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling. For the specific process of training the second BERT model, refer to the content corresponding to step S304 disclosed in FIG. 3 of the above embodiment of the present invention.

In the embodiment of the present invention, the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data The model converges, and the element analysis model is obtained. The element analysis model is used to analyze the elements of the legal document after the clause processing, to obtain the elements contained in each sentence in the legal document, and perform operations such as analysis and legal judgments based on the extracted case elements. The elements are extracted one by one, which can effectively reduce labor costs and time costs, and provide judgment accuracy and efficiency.

Referring to FIG. 7, there is shown a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention. The first training module 5032 includes: a prediction submodule 50321, an error submodule 50322, and a training submodule 50323.

The prediction sub-module 50321 is configured to use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position and obtain the sentence prediction result corresponding to the sentence splicing position.

The error sub-module 50322 is configured to use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and to use the second sub-loss function to calculate the actual sentence at the sentence splicing position The sentence error with the sentence prediction result.

The training sub-module 50323 is configured to train the first BERT model based on the text error and sentence error in combination with the first training data until the first BERT model converges. For the process of training the first BERT model, refer to the content corresponding to step S403 disclosed in FIG. 4 of the foregoing embodiment of the present invention.

Wherein, the actual text and actual sentence are derived from the sample data.

Preferably, referring to FIG. 5 and FIG. 8, a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention is shown, and the legal document element analysis system further includes:

The merging unit 504 is used to merge the elements contained in each sentence to be parsed.

In the embodiment of the present invention, according to actual needs, the elements contained in each sentence to be parsed can be combined to obtain a set of elements of the legal document to be parsed to meet different legal requirements.

Based on the legal document element analysis system disclosed in the foregoing embodiment of the present invention, the foregoing various modules may be implemented by a hardware device composed of a processor and a memory. Specifically, each of the foregoing modules is stored in the memory as a program unit, and the processor executes the foregoing program unit stored in the memory to realize the analysis of legal document elements.

Among them, the processor contains a kernel, which calls the corresponding program unit from the memory. One or more kernels can be set, and the analysis of legal document elements can be realized by adjusting kernel parameters.

The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.

Further, an embodiment of the present invention provides a processor configured to run a program, wherein the legal document element analysis method is executed when the program is running.

Further, an embodiment of the present invention provides a device for analyzing elements of a legal document. The device includes a processor, a memory, and a program stored in the memory and running on the processor. When the processor executes the program, the following steps are implemented: Analyzed legal documents; perform sentence processing on the legal documents to obtain multiple sentences to be parsed; input the sentences to be parsed into the pre-established element analysis model for element analysis, and obtain each of the sentences in the legal document Elements included in the sentence to be parsed, wherein the element analysis model is obtained by training a language model based on sample data, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.

Wherein, when the language model is a BERT model, the process of obtaining the element analysis model by training the language model based on sample data includes: performing text replacement and sentence splicing processing on the sample data to obtain the first training data, wherein the The sample data is obtained based on the sentence processing of a public legal document; the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges; use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model; use the second training data as the input of the second BERT model in combination with presets The second loss function of training the second BERT model until the second BERT model converges to obtain the element analysis model, wherein the second training data selects a preset number of legal documents from the sample data Perform feature labeling.

Wherein, the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes: Use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position, and obtain the sentence prediction result corresponding to the sentence splicing position; use the first sub-loss function to calculate the text replacement position The text error between the actual text and the text prediction result, and the sentence error between the actual sentence at the sentence splicing position and the sentence prediction result using the second sub-loss function; based on the text error and the sentence Error, training the first BERT model in combination with the first training data until the first BERT model converges; wherein the actual text and actual sentence are derived from the sample data.

Wherein, performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes: randomly replacing text in the sample data with preset characters, and randomly selecting the first training data in the sample data. The sentence is spliced into a second sentence, wherein the second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.

Wherein, the step of performing element analysis using the sentences to be parsed as the input of a pre-established element analysis model, and obtaining the elements contained in each sentence to be parsed in the legal document, further includes: merging each sentence Elements contained in the sentence to be parsed.

Further, an embodiment of the present invention also provides a storage medium on which a program is stored, and when the program is executed by a processor, the analysis of elements of a legal document is realized.

This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps: obtaining a legal document to be parsed; performing sentence processing on the legal document to obtain more Sentence to be parsed; input the sentence to be parsed into a pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.

In summary, the embodiments of the present invention provide a method and system for analyzing elements of a legal document. The method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed. One by one, the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained. The element analysis model is obtained by training the language model based on sample data. In this solution, the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are segmented to obtain multiple sentences to be analyzed. Each sentence to be analyzed is used as the input of the element analysis model to obtain each The elements in the sentence to be parsed shall be judged according to the extracted case elements. There is no need to manually extract the elements of the case one by one, and then perform analysis and legal judgments based on the manually extracted elements, thereby saving labor and time costs, and improving the accuracy and efficiency of judgments.

Those skilled in the art should understand that the embodiments of the present application can be provided as methods, devices, clients, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.

This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.

These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.

In a typical configuration, the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.

The memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.

Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment. The system and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, namely It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.

Professionals may further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been described generally in terms of function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.

The above description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be obvious to those skilled in the art, and the general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.

Claims

A method for analyzing elements of a legal document, characterized in that the method includes:

Obtain legal documents to be resolved;

Perform sentence processing on the legal document to obtain multiple sentences to be parsed;

One by one, input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is obtained by training a language model based on sample data The language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
The method according to claim 1, wherein when the language model is a BERT model, the process of training the language model based on sample data to obtain an element analysis model comprises:

Performing text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;

Using the first training data as the input of the first BERT model, combining the preset first loss function and the sample data, training the first BERT model until the first BERT model converges;

Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;

The second training data is used as the input of the second BERT model, and the second BERT model is trained in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model, wherein the The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements.
The method according to claim 2, wherein the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges, including:

Using the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position and a sentence prediction result corresponding to a sentence splicing position;

Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the difference between the actual sentence at the sentence splicing position and the sentence prediction result Sentence error between

Training the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;

Wherein, the actual text and actual sentence are derived from the sample data.
The method according to claim 2, wherein the performing text replacement and sentence splicing processing on the sample data to obtain the first training data comprises:

Randomly replace the text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, where the second sentence is the next sentence or sentence corresponding to the first sentence Not the next sentence corresponding to the first sentence.
The method according to claim 1, wherein the sentence to be parsed is used one by one as the input of a pre-established element analysis model for element analysis to obtain the content of each sentence to be parsed in the legal document After the elements, it also includes:

Combine the elements contained in each sentence to be parsed.
A legal document element analysis system, characterized in that the system includes:

The obtaining unit is used to obtain the legal document to be analyzed;

The processing unit is used to perform sentence processing on the legal document to obtain multiple sentences to be parsed;

The prediction unit is used to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
The system according to claim 6, wherein when the language model is a BERT model, the prediction unit comprises:

A processing module, configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;

The first training module is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model convergence;

A setting module, configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;

The second training module is configured to use second training data as the input of the second BERT model, train the second BERT model in combination with a preset second loss function until the second BERT model converges, and obtain the element An analytical model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.
The system according to claim 7, wherein the first training module comprises:

A prediction submodule, configured to use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position, and to obtain a sentence prediction result corresponding to a sentence splicing position;

The error sub-module is used to calculate the text error between the actual text at the text replacement position and the text prediction result using the first sub-loss function, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and The sentence error between the sentence prediction results;

A training sub-module, configured to train the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;

Wherein, the actual text and actual sentence are derived from the sample data.
A storage medium, characterized in that the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute the analysis of the legal document elements according to any one of claims 1-5 method.
A device for analyzing elements of a legal document, comprising a storage medium and a processor, the storage medium stores a program, and the processor is used to run the program, wherein the program executes as claimed in claim 1- The analysis method of any of the legal document elements mentioned in 5.