WO2021017383A1 - Method and system for parsing elements of legal document - Google Patents

Method and system for parsing elements of legal document Download PDF

Info

Publication number
WO2021017383A1
WO2021017383A1 PCT/CN2019/126935 CN2019126935W WO2021017383A1 WO 2021017383 A1 WO2021017383 A1 WO 2021017383A1 CN 2019126935 W CN2019126935 W CN 2019126935W WO 2021017383 A1 WO2021017383 A1 WO 2021017383A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
model
training
text
bert model
Prior art date
Application number
PCT/CN2019/126935
Other languages
French (fr)
Chinese (zh)
Inventor
戴威
Original Assignee
北京国双科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京国双科技有限公司 filed Critical 北京国双科技有限公司
Publication of WO2021017383A1 publication Critical patent/WO2021017383A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/18Legal services; Handling legal documents

Definitions

  • the invention relates to the technical field of legal document processing, in particular to a method and system for analyzing elements of a legal document.
  • law is one of the products of the development of civilized society.
  • the law usually refers to a special code of conduct that is recognized by the society and the state confirms that the legislation establishes a normative code of conduct, and is guaranteed by the state's compulsory force to stipulate the rights and obligations of the parties as its content, and is a special code of conduct that is universally binding on all members of society.
  • the judicial organs shall file a case for ruling in accordance with the law.
  • the embodiments of the present invention provide a method and system for analyzing elements of a legal document to solve the problems of high labor cost, high time cost, low accuracy, and low efficiency in existing manual element extraction.
  • the first aspect of the embodiments of the present invention discloses a method for analyzing elements of a legal document, and the method includes:
  • One by one input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is obtained by training a language model based on sample data
  • the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the process of training the language model based on sample data to obtain an element analysis model includes:
  • the second training data is used as the input of the second BERT model, and the second BERT model is trained in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model, wherein the The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements.
  • the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes :
  • the actual text and actual sentence are derived from the sample data.
  • performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes:
  • the method further includes:
  • the second aspect of the embodiments of the present invention discloses a legal document element analysis system, the system includes:
  • the obtaining unit is used to obtain the legal document to be analyzed
  • the processing unit is used to perform sentence processing on the legal document to obtain multiple sentences to be parsed;
  • the prediction unit is used to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample
  • the data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the prediction unit includes:
  • a processing module configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;
  • the first training module is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model convergence;
  • a setting module configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model
  • the second training module is configured to use second training data as the input of the second BERT model, train the second BERT model in combination with a preset second loss function until the second BERT model converges, and obtain the element An analytical model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.
  • the first training module includes:
  • a prediction submodule configured to use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position, and to obtain a sentence prediction result corresponding to a sentence splicing position;
  • the error sub-module is used to calculate the text error between the actual text at the text replacement position and the text prediction result using the first sub-loss function, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and The sentence error between the sentence prediction results;
  • a training sub-module configured to train the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges
  • the actual text and actual sentence are derived from the sample data.
  • a third aspect of the embodiments of the present invention discloses a storage medium, the storage medium includes a stored program, wherein, when the program is running, the device where the storage medium is located is controlled to execute the law disclosed in the first aspect of the embodiment of the present invention. Analysis method of document elements.
  • the fourth aspect of the embodiments of the present invention discloses a legal document element analysis device, including a storage medium and a processor, the storage medium stores a program, and the processor is configured to run the program, wherein the program is executed when the program is running Such as the method for analyzing elements of a legal document disclosed in the first aspect of the embodiments of the present invention.
  • a method and system for analyzing elements of a legal document are provided.
  • the method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed.
  • the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained.
  • the element analysis model is obtained by training the language model based on sample data.
  • the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are subdivided to obtain multiple sentences to be analyzed.
  • Each sentence to be analyzed is used as the input of the element analysis model to obtain each sentence.
  • the elements in the sentence to be parsed are analyzed and judged based on the extracted case elements, and there is no need to manually extract the elements in the case one by one, thereby saving labor and time costs, and improving the accuracy and efficiency of the judgment.
  • Figure 1 is a schematic structural diagram of a Transformer provided by an embodiment of the present invention.
  • FIG. 2 is a flowchart of a method for analyzing elements of a legal document according to an embodiment of the present invention
  • FIG. 3 is a flowchart of obtaining a factor analysis model provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of a system for analyzing elements of a crime legal document provided by an embodiment of the present invention.
  • FIG. 7 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention.
  • FIG. 8 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention.
  • the terms “include”, “include” or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes no Other elements clearly listed, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence “including a" does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
  • the embodiments of the present invention provide a legal document element analysis method and system.
  • the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed.
  • a sentence to be parsed is used as the input of the element analysis model to obtain the elements of each sentence to be parsed, so as to save labor cost and time cost, and improve the accuracy and efficiency of judgment.
  • the (Bidirectional Encoder Representation from Transformers, BERT) model involved in the embodiment of the present invention is a language model proposed by Google, and has a strong ability to abstract text in the field of natural language processing.
  • the BERT model has a 12-layer Transformer structure.
  • the specific structure of the BERT model is as follows: the text of the input embedding layer is segmented according to words, the words are mapped into 768-dimensional vectors based on the word vector mapping weight provided by Google, and the encoding vector Enc is obtained through a 12-layer Transformer structure. .
  • FIG. 1 a schematic structural diagram of Transformer is shown.
  • the Transformer includes Multihead Attention, Residual Unit, Layer Normalization (LayerNorm), and two-layer full Connect (FFN).
  • LayerNorm Layer Normalization
  • FPN two-layer full Connect
  • the element analysis model involved in the embodiment of the present invention is constructed for different legal fields, that is, for a type of legal field, the BERT model is trained using the sample data corresponding to the legal field to obtain the The corresponding element analysis model.
  • the legal documents on the field of marriage and family affairs in the legal documents network are used as sample data to train the BERT model, and the element analysis model corresponding to the field of marriage and family affairs is obtained.
  • FIG. 2 there is shown a flowchart of a method for analyzing elements of a legal document provided by an embodiment of the present invention.
  • the method for analyzing elements of a legal document includes the following steps:
  • Step S201 Obtain the legal document to be analyzed.
  • Step S202 Perform sentence processing on the legal document to obtain multiple sentences to be parsed.
  • a language technology platform (Language Technology Platform, LTP) is used to perform sentence processing on the legal document to obtain a set of sentences containing multiple sentences to be parsed.
  • LTP Language Technology Platform
  • Step S203 Input the sentences to be analyzed into the pre-established element analysis model one by one to perform element analysis to obtain the elements contained in each sentence to be analyzed in the legal document.
  • the legal documents required by the training element analysis model are selected from the data disclosed by the legal documents network, and the legal documents are segmented using LTP to obtain sample data, and training language based on the sample data
  • the model obtains the element analysis model. For example: assuming that the legal field corresponding to the element analysis model is the field of marriage and family affairs, the legal documents in the field of marriage and family affairs are screened out from the legal documents online, and the sentence segmentation of the legal documents in the field of marriage and family affairs is performed using LTP to obtain sample data.
  • the sample data trains the language model to obtain the element analysis model corresponding to the field of marriage and family affairs.
  • the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the types of language models include but are not limited to: ELMo model, GPT model and BERT model.
  • each sentence to be analyzed into a pre-established element analysis model for element analysis, 0 or more elements contained in each sentence to be analyzed can be obtained.
  • step S203 the elements contained in each sentence to be parsed are combined.
  • the language model is pre-trained through a large number of legal documents to obtain the element analysis model, the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed, and each sentence to be analyzed is used as the input of the element analysis model Obtain the elements of each sentence to be parsed, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make a legal judgment based on the manually extracted elements, saving labor and time costs, and improving the accuracy and efficiency of the judgment.
  • the process of training a language model based on sample data to obtain an element analysis model involved in step S203 disclosed in FIG. 2 of the above embodiment of the present invention includes the following steps:
  • Step S301 Perform text replacement and sentence splicing processing on the sample data to obtain first training data.
  • step S301 the sample data is obtained by performing sentence processing on the disclosed legal document.
  • sentence processing process refer to the corresponding content in step S203 disclosed in FIG. 2 of the above embodiment of the present invention. Let me repeat.
  • random word replacement and sentence splicing mentioned above are only suitable for example, and the technician can also specifically select which words need to be replaced with characters and specifically select which sentences need to be spliced. Similarly, it is also possible to replace some characters with characters every preset number of characters, and to splice a certain sentence every preset number of sentences, which is not specifically limited in the embodiment of the present invention.
  • Step S302 Use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model converges.
  • step S302 by using the first training data as the input of the first BERT model to predict the text at the text replacement part and the sentence at the sentence splicing part, combining the prediction result and the actual result Training the first BERT model’s ability to judge words and sentences. For example, for a complete sentence, randomly replace a word in the sentence with a preset character, and train the first BERT model to determine what the actual text of the preset character part of the sentence is. For a whole paragraph of content composed of multiple sentences, sentence splicing processing is performed on a sentence, and the first BERT model is trained to determine what the actual sentence corresponds to the sentence splicing part.
  • Step S303 Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.
  • step S303 the parameters of the embedding layer and the 12-layer transformer structure in the first BERT model after convergence are used as the initialization parameters of the embedding layer and the 12-layer transformer structure in the second BERT model.
  • Step S304 Use the second training data as the input of the second BERT model, and train the second BERT model in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model.
  • a 768-dimensional vector is selected after the encoding vector Enc in the second BERT model, and the 768-dimensional vector is analyzed through a 768-dimensional fully connected layer connection element to analyze the required number of categories.
  • a weighted cross entropy loss function (sigmoid cross entropy loss) is used as the second loss function to train the second BERT model.
  • the specific training process is as shown in the process A1-A3. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions.
  • the second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements. For example, 800-1000 legal documents are selected from the sample data for sentence processing, and then the elements are labeled to obtain the second training data.
  • A1 For each training sentence in the second training data, input each training sentence into the second BERT model for prediction, and obtain prediction elements contained in each training sentence.
  • A2 Use the second loss function to calculate the error between the predicted element and the actual element contained in each training sentence.
  • A3 If the error is less than the threshold, construct the element analysis model based on the current model parameters of the second BERT model. If the error is greater than the threshold, adjust the model parameters of the second BERT model based on the error, train the second BERT model based on the second training data until the error is less than the threshold, and determine the trained second BERT model Analyze the model for the element.
  • training a neural network model requires one or a series of initial parameters during training.
  • the initial parameters of the traditional neural network model usually use a random parameter with a normal distribution with a small mean variance of 0.
  • the traditional neural network model The initial parameter determination method has poor predictive effect on text elements.
  • the parameter structure of the first BERT model after training is used to initialize the parameters of the second BERT model
  • the parameters provide sufficient prior information in the legal field for the second BERT model, and effectively improve the element prediction accuracy of the element analysis model.
  • the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data
  • the model converges, and the element analysis model is obtained.
  • Use the element analysis model to analyze the elements of the legal document after the clause processing, obtain the elements contained in each sentence in the legal document, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make legal judgments based on the manually extracted elements, which can effectively reduce labor and time costs, and provide judgment accuracy and efficiency.
  • step S302 The process of training the first BERT model involved in step S302 disclosed in FIG. 3 of the above embodiment of the present invention.
  • FIG. 4 a flowchart of training the first BERT model provided by the embodiment of the present invention is shown, including the following steps:
  • Step S401 Use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to the text replacement position, and obtain a sentence prediction result corresponding to the sentence splicing position.
  • Step S402 Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and the sentence Sentence error between prediction results.
  • the first 768-dimensional vector is selected from the encoding vector Enc, and the 768-dimensional vector is connected to the first sub-loss function and the second sub-loss through a 768-dimensional fully connected layer. function. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions.
  • the first sub-loss function includes but is not limited to: a multi-class softmax cross-entropy loss function
  • the second sub-loss function includes but is not limited to: a two-class softmax cross-entropy loss function.
  • Step S403 Based on the text error and sentence error, train the first BERT model in combination with the first training data until the first BERT model converges.
  • the actual text and the actual sentence are derived from the sample data, that is, the actual text at the text replacement position and the actual sentence at the sentence splicing position can be obtained through the sample data. If the text error and sentence error are both smaller than the threshold, the first BERT is converged. If the text error and sentence error are both greater than the threshold, adjust the model parameters of the first BERT model based on the text error and sentence error, and use the first training data to continue training the first BERT model until the Both the text error and the sentence error are less than the threshold.
  • the first BERT model before obtaining the element analysis model, based on the first sub-loss function and the second sub-loss function, the first BERT model is trained through the first training data until convergence, and the converged first BERT model
  • the model parameters are used as the initialization model parameters of the second BERT model, and then the second BERT model is trained based on the training data until convergence to obtain the element analysis model, which can improve the accuracy of element analysis.
  • an embodiment of the present invention also provides a system for analyzing elements of a legal document.
  • the system for analyzing elements of a legal document includes: an acquiring unit 501, The processing unit 502 and the prediction unit 503.
  • the obtaining unit 501 is configured to obtain a legal document to be analyzed.
  • the processing unit 502 is configured to perform sentence processing on the legal document to obtain multiple sentences to be parsed.
  • sentence processing process of the legal document refer to the content corresponding to step S202 disclosed in FIG. 2 of the foregoing embodiment of the present invention.
  • the prediction unit 503 is configured to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on
  • the sample data is obtained by training a language model, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the process of obtaining the sample data refer to the content corresponding to step S203 disclosed in FIG. 2 of the foregoing embodiment of the present invention.
  • the language model is pre-trained through a large number of legal documents to obtain the element analysis model, the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed, and each sentence to be analyzed is used as the input of the element analysis model Obtain the elements of each sentence to be parsed, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make a legal judgment based on the manually extracted elements, saving labor and time costs, and improving the accuracy and efficiency of the judgment.
  • the prediction unit 503 includes: a processing module 5031, a first training module 5032 Setting module 5033 and second training module 5034.
  • the processing module 5031 is configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, where the sample data is obtained based on sentence processing on a public legal document.
  • the processing module 5031 is specifically configured to randomly replace text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, wherein the first sentence The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
  • the processing module 5031 is specifically configured to randomly replace text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, wherein the first sentence The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
  • the first training module 5032 is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT The model converges.
  • the setting module 5033 is configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.
  • the second training module 5034 is configured to use second training data as the input of the second BERT model, and train the second BERT model in combination with a preset second loss function until the second BERT model converges to obtain the The element analysis model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.
  • the specific process of training the second BERT model refer to the content corresponding to step S304 disclosed in FIG. 3 of the above embodiment of the present invention.
  • the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data
  • the model converges, and the element analysis model is obtained.
  • the element analysis model is used to analyze the elements of the legal document after the clause processing, to obtain the elements contained in each sentence in the legal document, and perform operations such as analysis and legal judgments based on the extracted case elements.
  • the elements are extracted one by one, which can effectively reduce labor costs and time costs, and provide judgment accuracy and efficiency.
  • the first training module 5032 includes: a prediction submodule 50321, an error submodule 50322, and a training submodule 50323.
  • the prediction sub-module 50321 is configured to use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position and obtain the sentence prediction result corresponding to the sentence splicing position.
  • the error sub-module 50322 is configured to use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and to use the second sub-loss function to calculate the actual sentence at the sentence splicing position The sentence error with the sentence prediction result.
  • the training sub-module 50323 is configured to train the first BERT model based on the text error and sentence error in combination with the first training data until the first BERT model converges.
  • the process of training the first BERT model refer to the content corresponding to step S403 disclosed in FIG. 4 of the foregoing embodiment of the present invention.
  • the actual text and actual sentence are derived from the sample data.
  • the first BERT model before obtaining the element analysis model, based on the first sub-loss function and the second sub-loss function, the first BERT model is trained through the first training data until convergence, and the converged first BERT model
  • the model parameters are used as the initialization model parameters of the second BERT model, and then the second BERT model is trained based on the training data until convergence to obtain the element analysis model, which can improve the accuracy of element analysis.
  • FIG. 5 and FIG. 8 a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention is shown, and the legal document element analysis system further includes:
  • the merging unit 504 is used to merge the elements contained in each sentence to be parsed.
  • the elements contained in each sentence to be parsed can be combined to obtain a set of elements of the legal document to be parsed to meet different legal requirements.
  • the foregoing various modules may be implemented by a hardware device composed of a processor and a memory. Specifically, each of the foregoing modules is stored in the memory as a program unit, and the processor executes the foregoing program unit stored in the memory to realize the analysis of legal document elements.
  • the processor contains a kernel, which calls the corresponding program unit from the memory.
  • One or more kernels can be set, and the analysis of legal document elements can be realized by adjusting kernel parameters.
  • the memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • an embodiment of the present invention provides a processor configured to run a program, wherein the legal document element analysis method is executed when the program is running.
  • an embodiment of the present invention provides a device for analyzing elements of a legal document.
  • the device includes a processor, a memory, and a program stored in the memory and running on the processor.
  • the processor executes the program, the following steps are implemented: Analyzed legal documents; perform sentence processing on the legal documents to obtain multiple sentences to be parsed; input the sentences to be parsed into the pre-established element analysis model for element analysis, and obtain each of the sentences in the legal document Elements included in the sentence to be parsed, wherein the element analysis model is obtained by training a language model based on sample data, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the process of obtaining the element analysis model by training the language model based on sample data includes: performing text replacement and sentence splicing processing on the sample data to obtain the first training data, wherein the The sample data is obtained based on the sentence processing of a public legal document; the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges; use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model; use the second training data as the input of the second BERT model in combination with presets The second loss function of training the second BERT model until the second BERT model converges to obtain the element analysis model, wherein the second training data selects a preset number of legal documents from the sample data Perform feature labeling.
  • the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes: Use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position, and obtain the sentence prediction result corresponding to the sentence splicing position; use the first sub-loss function to calculate the text replacement position The text error between the actual text and the text prediction result, and the sentence error between the actual sentence at the sentence splicing position and the sentence prediction result using the second sub-loss function; based on the text error and the sentence Error, training the first BERT model in combination with the first training data until the first BERT model converges; wherein the actual text and actual sentence are derived from the sample data.
  • performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes: randomly replacing text in the sample data with preset characters, and randomly selecting the first training data in the sample data.
  • the sentence is spliced into a second sentence, wherein the second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
  • the step of performing element analysis using the sentences to be parsed as the input of a pre-established element analysis model, and obtaining the elements contained in each sentence to be parsed in the legal document further includes: merging each sentence Elements contained in the sentence to be parsed.
  • an embodiment of the present invention also provides a storage medium on which a program is stored, and when the program is executed by a processor, the analysis of elements of a legal document is realized.
  • This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps: obtaining a legal document to be parsed; performing sentence processing on the legal document to obtain more Sentence to be parsed; input the sentence to be parsed into a pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample
  • the data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  • the process of obtaining the element analysis model by training the language model based on sample data includes: performing text replacement and sentence splicing processing on the sample data to obtain the first training data, wherein the The sample data is obtained based on the sentence processing of a public legal document; the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges; use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model; use the second training data as the input of the second BERT model in combination with presets The second loss function of training the second BERT model until the second BERT model converges to obtain the element analysis model, wherein the second training data selects a preset number of legal documents from the sample data Perform feature labeling.
  • the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes: Use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position, and obtain the sentence prediction result corresponding to the sentence splicing position; use the first sub-loss function to calculate the text replacement position The text error between the actual text and the text prediction result, and the sentence error between the actual sentence at the sentence splicing position and the sentence prediction result using the second sub-loss function; based on the text error and the sentence Error, training the first BERT model in combination with the first training data until the first BERT model converges; wherein the actual text and actual sentence are derived from the sample data.
  • performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes: randomly replacing text in the sample data with preset characters, and randomly selecting the first training data in the sample data.
  • the sentence is spliced into a second sentence, wherein the second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
  • the step of performing element analysis using the sentences to be parsed as the input of a pre-established element analysis model, and obtaining the elements contained in each sentence to be parsed in the legal document further includes: merging each sentence Elements contained in the sentence to be parsed.
  • the embodiments of the present invention provide a method and system for analyzing elements of a legal document.
  • the method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed. One by one, the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained.
  • the element analysis model is obtained by training the language model based on sample data. In this solution, the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are segmented to obtain multiple sentences to be analyzed.
  • Each sentence to be analyzed is used as the input of the element analysis model to obtain each
  • the elements in the sentence to be parsed shall be judged according to the extracted case elements. There is no need to manually extract the elements of the case one by one, and then perform analysis and legal judgments based on the manually extracted elements, thereby saving labor and time costs, and improving the accuracy and efficiency of judgments.
  • the embodiments of the present application can be provided as methods, devices, clients, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
  • a computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device.
  • the device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
  • These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment.
  • the instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
  • the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
  • processors CPU
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • the memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM).
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash memory
  • Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology.
  • the information can be computer-readable instructions, data structures, program modules, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Theoretical Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Strategic Management (AREA)
  • General Physics & Mathematics (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Technology Law (AREA)
  • Entrepreneurship & Innovation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Artificial Intelligence (AREA)
  • Game Theory and Decision Science (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

Provided in the present invention are a method and system for parsing elements of a legal document. The method comprises: acquiring a legal document to be parsed; performing sentence segmentation processing on the legal document to obtain a plurality of sentences to be parsed; and inputting the sentences to be parsed into a pre-established element analysis model one by one for element analysis, to obtain elements contained in each sentence to be parsed in the legal document, wherein the element parsing model is obtained by means of training a language model on the basis of sample data. In this solution, the element parsing model is obtained by means of pre-training the language model with a large number of legal documents, and sentence segmentation processing is performed on the legal documents that need to be parsed to obtain a plurality of sentences to be parsed, and each sentence to be parsed is used as the input of the element parsing model to obtain elements in each sentence to be parsed, thereby saving on labor and time costs, and improving the accuracy and efficiency of determination.

Description

一种法律文书要素解析方法及系统A method and system for analyzing elements of legal documents
本申请要求于2019年07月30日提交中国专利局、申请号为201910695870.8、发明名称为“一种法律文书要素解析方法及系统”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on July 30, 2019, the application number is 201910695870.8, and the invention title is "A method and system for analyzing elements of legal documents", the entire content of which is incorporated herein by reference Applying.
技术领域Technical field
本发明涉及法律文书处理技术领域,具体涉及一种法律文书要素解析方法及系统。The invention relates to the technical field of legal document processing, in particular to a method and system for analyzing elements of a legal document.
背景技术Background technique
随着现代社会的发展,法律是文明社会发展过程中的产物之一。法律通常是指由社会认可国家确认立法机关制定规范的行为准则,并由国家强制力保证规定当事人权利和义务为内容的,对全体社会成员具有普遍约束力的一种特殊行为规范。当社会成员之间出现纠纷时,由司法机关按照法律进行立案裁定。With the development of modern society, law is one of the products of the development of civilized society. The law usually refers to a special code of conduct that is recognized by the society and the state confirms that the legislature establishes a normative code of conduct, and is guaranteed by the state's compulsory force to stipulate the rights and obligations of the parties as its content, and is a special code of conduct that is universally binding on all members of society. When disputes arise between members of society, the judicial organs shall file a case for ruling in accordance with the law.
在进行法律判决时,目前较为常见的方式为要素式审判。即基于案件信息,将案件中的要素逐一提取出来,最后根据提取出来的案件要素进行法律判决。但是一方面,由于案件信息包含多种信息,人工从多种信息中提取判决所需要的要素通常需要花费大量时间和人力成本。另一方面,由于语言的多样性,对同一个定罪要素通常有多个不同的描述和表达方式,会影响判决的准确性和效率。When making legal judgments, the most common method currently is elementary trial. That is, based on the case information, extract the elements of the case one by one, and finally make a legal judgment based on the extracted case elements. But on the one hand, because the case information contains a variety of information, it usually takes a lot of time and labor costs to manually extract the elements required for the judgment from the various information. On the other hand, due to the diversity of languages, there are usually multiple different descriptions and expressions of the same conviction element, which will affect the accuracy and efficiency of the judgment.
发明内容Summary of the invention
有鉴于此,本发明实施例提供一种法律文书要素解析方法及系统,以解决现有人工进行要素提取存在的人力成本高、时间成本高、准确性低和效率低等问题。In view of this, the embodiments of the present invention provide a method and system for analyzing elements of a legal document to solve the problems of high labor cost, high time cost, low accuracy, and low efficiency in existing manual element extraction.
为实现上述目的,本发明实施例提供如下技术方案:In order to achieve the foregoing objective, the embodiments of the present invention provide the following technical solutions:
本发明实施例第一方面公开了一种法律文书要素解析方法,所述方法包括:The first aspect of the embodiments of the present invention discloses a method for analyzing elements of a legal document, and the method includes:
获取待解析的法律文书;Obtain legal documents to be resolved;
对所述法律文书进行分句处理,得到多条待解析语句;Perform sentence processing on the legal document to obtain multiple sentences to be parsed;
逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。One by one, input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is obtained by training a language model based on sample data The language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
优选的,当所述语言模型为BERT模型,所述由基于样本数据训练语言模型获得要素解析模型的过程包括:Preferably, when the language model is a BERT model, the process of training the language model based on sample data to obtain an element analysis model includes:
对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;Performing text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;
将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;Using the first training data as the input of the first BERT model, combining the preset first loss function and the sample data, training the first BERT model until the first BERT model converges;
将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;
将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。The second training data is used as the input of the second BERT model, and the second BERT model is trained in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model, wherein the The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements.
优选的,所述将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据训练所述第一BERT模型直至所述第一BERT模型收敛,包括:Preferably, the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes :
将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;Using the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position and a sentence prediction result corresponding to a sentence splicing position;
使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the difference between the actual sentence at the sentence splicing position and the sentence prediction result Sentence error between
基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;Training the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;
其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the actual text and actual sentence are derived from the sample data.
优选的,所述对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,包括:Preferably, performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes:
随机将所述样本数据中的文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。Randomly replace the text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, where the second sentence is the next sentence or sentence corresponding to the first sentence Not the next sentence corresponding to the first sentence.
优选的,所述逐一将所述待解析语句作为预先建立的要素解析模型的输入进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素之后,还包括:Preferably, after the element analysis is performed by using the sentences to be analyzed as the input of the element analysis model established in advance, and after obtaining the elements contained in each sentence to be analyzed in the legal document, the method further includes:
合并每条所述待解析语句包含的要素。Combine the elements contained in each sentence to be parsed.
本发明实施例第二方面公开了一种法律文书要素解析系统,所述系统包括:The second aspect of the embodiments of the present invention discloses a legal document element analysis system, the system includes:
获取单元,用于获取待解析的法律文书;The obtaining unit is used to obtain the legal document to be analyzed;
处理单元,用于对所述法律文书进行分句处理,得到多条待解析语句;The processing unit is used to perform sentence processing on the legal document to obtain multiple sentences to be parsed;
预测单元,用于逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。The prediction unit is used to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
优选的,当所述语言模型为BERT模型,所述预测单元包括:Preferably, when the language model is a BERT model, the prediction unit includes:
处理模块,用于对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;A processing module, configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;
第一训练模块,用于将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;The first training module is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model convergence;
设置模块,用于将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;A setting module, configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;
第二训练模块,用于将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。The second training module is configured to use second training data as the input of the second BERT model, train the second BERT model in combination with a preset second loss function until the second BERT model converges, and obtain the element An analytical model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.
优选的,所述第一训练模块包括:Preferably, the first training module includes:
预测子模块,用于将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;A prediction submodule, configured to use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position, and to obtain a sentence prediction result corresponding to a sentence splicing position;
误差子模块,用于使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;The error sub-module is used to calculate the text error between the actual text at the text replacement position and the text prediction result using the first sub-loss function, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and The sentence error between the sentence prediction results;
训练子模块,用于基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;A training sub-module, configured to train the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;
其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the actual text and actual sentence are derived from the sample data.
本发明实施例第三方面公开了一种存储介质,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行如本发明实施例第一方面公开的法律文书要素解析方法。A third aspect of the embodiments of the present invention discloses a storage medium, the storage medium includes a stored program, wherein, when the program is running, the device where the storage medium is located is controlled to execute the law disclosed in the first aspect of the embodiment of the present invention. Analysis method of document elements.
本发明实施例第四方面公开了一种法律文书要素解析设备,包括存储介质和处理器,所述存储介质存储有程序,所述处理器用于运行所述程序,其中,所述程序运行时执行如本发明实施例第一方面公开的法律文书要素解析方法。The fourth aspect of the embodiments of the present invention discloses a legal document element analysis device, including a storage medium and a processor, the storage medium stores a program, and the processor is configured to run the program, wherein the program is executed when the program is running Such as the method for analyzing elements of a legal document disclosed in the first aspect of the embodiments of the present invention.
基于上述本发明实施例提供的一种法律文书要素解析方法及系统,该方法为:获取待解析的法律文书。对法律文书进行分句处理,得到多条待解析语句。逐一将待解析语句输入预先建立的要素解析模型进行要素解析,得到法律文书中每条待解析语句包含的要素,其中,要素解析模型由基于样本数据训练语言模型获得。在本方案中,通过海量的法律文书预先训练语言模型得到要素解析模型,将需要解析的法律文书进行分句处理得到多条待解析语句,将每一条待解析语句作为要素解析模型的输入得到每条待解析语句中的要素,根据提取出来的案件要素进行分析和判决等操作,不需要人工将案件中的要素逐一提取出来,从而节约人力成本和时间成本,提高判决的准确性和效率。Based on the above-mentioned embodiment of the present invention, a method and system for analyzing elements of a legal document are provided. The method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed. One by one, the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained. The element analysis model is obtained by training the language model based on sample data. In this solution, the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are subdivided to obtain multiple sentences to be analyzed. Each sentence to be analyzed is used as the input of the element analysis model to obtain each sentence. The elements in the sentence to be parsed are analyzed and judged based on the extracted case elements, and there is no need to manually extract the elements in the case one by one, thereby saving labor and time costs, and improving the accuracy and efficiency of the judgment.
附图说明Description of the drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的实施例,对于本领域普通技术人员来讲,在不付出创 造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only It is an embodiment of the present invention. For those of ordinary skill in the art, other drawings can be obtained based on the provided drawings without creative work.
图1为本发明实施例提供的Transformer结构示意图;Figure 1 is a schematic structural diagram of a Transformer provided by an embodiment of the present invention;
图2为本发明实施例提供的一种法律文书要素解析方法的流程图;2 is a flowchart of a method for analyzing elements of a legal document according to an embodiment of the present invention;
图3为本发明实施例提供的获得要素解析模型的流程图;FIG. 3 is a flowchart of obtaining a factor analysis model provided by an embodiment of the present invention;
图4为本发明实施例提供的训练第一BERT模型的流程图;4 is a flowchart of training the first BERT model provided by an embodiment of the present invention;
图5为本发明实施例提供的一种法律文书要素解析系统的结构示意图;5 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention;
图6为本发明实施例提供的一种罪法律文书要素解析系统的结构示意图;6 is a schematic structural diagram of a system for analyzing elements of a crime legal document provided by an embodiment of the present invention;
图7为本发明实施例提供的一种法律文书要素解析系统的结构示意图;7 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention;
图8为本发明实施例提供的一种法律文书要素解析系统的结构示意图。FIG. 8 is a schematic structural diagram of a legal document element analysis system provided by an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below in conjunction with the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, rather than all the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of the present invention.
在本申请中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。In this application, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements not only includes those elements, but also includes no Other elements clearly listed, or also include elements inherent to this process, method, article or equipment. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other same elements in the process, method, article, or equipment including the element.
由背景技术可知,目前对于案件要素提取的方式为人工基于案件信息,将案件中的要素逐一提取出来,最后根据提取出来的案件要素进行法律判决。但是一方面,由于案件信息包含多种信息,人工从多种信息中提取判决所需要的要素通常需要花费大量时间和人力成本。另一方面,由于语言的多样性,对同一个定罪要素通常有多个不同的描述和表达方式,会影响判决的准确性和效率。It can be known from the background technology that the current method of extracting case elements is to manually extract the elements in the case based on case information, and finally make legal judgments based on the extracted case elements. But on the one hand, because the case information contains a variety of information, it usually takes a lot of time and labor costs to manually extract the elements required for the judgment from the various information. On the other hand, due to the diversity of languages, there are usually multiple different descriptions and expressions of the same conviction element, which will affect the accuracy and efficiency of the judgment.
因此,本发明实施例提供一种法律文书要素解析方法及系统,通过海量的法律文书预先训练语言模型得到要素解析模型,将需要解析的法律文书进行分 句处理得到多条待解析语句,将每一条待解析语句作为要素解析模型的输入得到每条待解析语句中的要素,以节约人力成本和时间成本,以及提高判决的准确性和效率。Therefore, the embodiments of the present invention provide a legal document element analysis method and system. The element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed. A sentence to be parsed is used as the input of the element analysis model to obtain the elements of each sentence to be parsed, so as to save labor cost and time cost, and improve the accuracy and efficiency of judgment.
需要说明的是,在本发明实施例中涉及到的(BidirectionalEncoder Representation from Transformers,BERT)模型是由谷歌提出的语言模型,在自然语言处理领域上对于文本具有较强的抽象能力。所述BERT模型具有12层Transformer结构。所述BERT模型的具体结构为:将输入嵌入(embedding)层的文本按照字进行切分,基于谷歌提供的字向量映射权重将字映射为768维向量,再经过12层Transformer结构得到编码向量Enc。It should be noted that the (Bidirectional Encoder Representation from Transformers, BERT) model involved in the embodiment of the present invention is a language model proposed by Google, and has a strong ability to abstract text in the field of natural language processing. The BERT model has a 12-layer Transformer structure. The specific structure of the BERT model is as follows: the text of the input embedding layer is segmented according to words, the words are mapped into 768-dimensional vectors based on the word vector mapping weight provided by Google, and the encoding vector Enc is obtained through a 12-layer Transformer structure. .
参考图1,示出了Transformer的结构示意图,在所述图1中,所述Transformer包括多头注意力(Multihead Attention),残差单元(Residual Unit)、层归一化(LayerNorm)和两层全连接(FFN)。Referring to Figure 1, a schematic structural diagram of Transformer is shown. In Figure 1, the Transformer includes Multihead Attention, Residual Unit, Layer Normalization (LayerNorm), and two-layer full Connect (FFN).
需要说明的是,本发明实施例中涉及到的要素解析模型是针对不同的法律领域构建的,即针对一类型的法律领域,采用该法律领域对应的样本数据训练BERT模型,得到与该法律领域对应的要素解析模型。比如对于婚姻家事领域,将法律文书网中有关于婚姻家事领域的法律文书作为样本数据训练BERT模型,得到与婚姻家事领域对应的要素解析模型。It should be noted that the element analysis model involved in the embodiment of the present invention is constructed for different legal fields, that is, for a type of legal field, the BERT model is trained using the sample data corresponding to the legal field to obtain the The corresponding element analysis model. For example, for the field of marriage and family affairs, the legal documents on the field of marriage and family affairs in the legal documents network are used as sample data to train the BERT model, and the element analysis model corresponding to the field of marriage and family affairs is obtained.
参考图2,示出了本发明实施例提供的一种法律文书要素解析方法的流程图,所述法律文书要素解析方法包括以下步骤:Referring to FIG. 2, there is shown a flowchart of a method for analyzing elements of a legal document provided by an embodiment of the present invention. The method for analyzing elements of a legal document includes the following steps:
步骤S201:获取待解析的法律文书。Step S201: Obtain the legal document to be analyzed.
步骤S202:对所述法律文书进行分句处理,得到多条待解析语句。Step S202: Perform sentence processing on the legal document to obtain multiple sentences to be parsed.
在具体实现步骤S202的过程中,使用语言技术平台(Language Technology Platform,LTP)对所述法律文书进行分句处理,得到包含多条待解析语句的句子集合。In the process of specifically implementing step S202, a language technology platform (Language Technology Platform, LTP) is used to perform sentence processing on the legal document to obtain a set of sentences containing multiple sentences to be parsed.
步骤S203:逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素。Step S203: Input the sentences to be analyzed into the pre-established element analysis model one by one to perform element analysis to obtain the elements contained in each sentence to be analyzed in the legal document.
在具体实现步骤S203的过程中,从法律文书网公开的数据中筛选出训练要素解析模型所需要的法律文书,并利用LTP对该法律文书进行分句处理,得到样本数据,基于样本数据训练语言模型获得所述要素解析模型。比如:假设所 述要素解析模型对应的法律领域为婚姻家事领域,则从法律文书网上筛选出婚姻家事领域的法律文书,使用LTP对婚姻家事领域的法律文书进行句子切分,得到样本数据,基于所述样本数据训练语言模型获得婚姻家事领域对应的要素解析模型。当需要对婚姻家事领域的法律文书进行要素解析时,对该法律文书进行分句处理后输入婚姻家事领域对应的要素解析模型中进行要素解析,得到婚姻家事领域的法律文书中每一语句包含的要素。In the process of implementing step S203, the legal documents required by the training element analysis model are selected from the data disclosed by the legal documents network, and the legal documents are segmented using LTP to obtain sample data, and training language based on the sample data The model obtains the element analysis model. For example: assuming that the legal field corresponding to the element analysis model is the field of marriage and family affairs, the legal documents in the field of marriage and family affairs are screened out from the legal documents online, and the sentence segmentation of the legal documents in the field of marriage and family affairs is performed using LTP to obtain sample data. The sample data trains the language model to obtain the element analysis model corresponding to the field of marriage and family affairs. When it is necessary to analyze the elements of a legal document in the field of marriage and family affairs, enter the sentence processing of the legal document and enter the element analysis model corresponding to the field of marriage and family affairs for element analysis to obtain the content of each sentence contained in the legal document in the field of marriage and family affairs. Elements.
需要说明的是,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。所述语言模型的类型包括但不仅限于:ELMo模型、GPT模型和BERT模型。It should be noted that the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model. The types of language models include but are not limited to: ELMo model, GPT model and BERT model.
需要说明的是,将每一条所述待解析语句输入预先建立的要素解析模型进行要素解析,可以获得每一条所述待解析语句中包含的0个要素或者1个以上要素。It should be noted that by inputting each sentence to be analyzed into a pre-established element analysis model for element analysis, 0 or more elements contained in each sentence to be analyzed can be obtained.
优选的,在执行步骤S203之后,合并每条所述待解析语句包含的要素。Preferably, after step S203 is executed, the elements contained in each sentence to be parsed are combined.
需要说明的是,对应法律文书要素解析结果有以下两种需要,一种是只需要获得法律文书中每一语句包含的要素,另一种是需要将法律文书中每一语句包含的要素合并,得到该法律文书的要素集合。It should be noted that there are two requirements for the analysis results of the corresponding elements of the legal document. One is to obtain only the elements contained in each sentence in the legal document, and the other is to merge the elements contained in each sentence in the legal document. Get the set of elements of the legal document.
在本发明实施例中,通过海量的法律文书预先训练语言模型得到要素解析模型,将需要解析的法律文书进行分句处理得到多条待解析语句,将每一条待解析语句作为要素解析模型的输入得到每条待解析语句中的要素,根据提取出来的案件要素进行法律判决。不需要人工将案件中的要素逐一提取出来,再根据人工提取的要素进行法律判决,节约人力成本和时间成本,提高判决的准确性和效率。In the embodiment of the present invention, the language model is pre-trained through a large number of legal documents to obtain the element analysis model, the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed, and each sentence to be analyzed is used as the input of the element analysis model Obtain the elements of each sentence to be parsed, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make a legal judgment based on the manually extracted elements, saving labor and time costs, and improving the accuracy and efficiency of the judgment.
上述本发明实施例图2公开的步骤S203中涉及到的基于样本数据训练语言模型获得要素解析模型的过程,在所述语言模型为BERT模型的情况下,可以参考图3,其中示出了本发明实施例提供的获得要素解析模型的流程图,包括以下步骤:The process of training a language model based on sample data to obtain an element analysis model involved in step S203 disclosed in FIG. 2 of the above embodiment of the present invention. In the case that the language model is a BERT model, refer to FIG. 3, which shows the original The flowchart for obtaining the element analysis model provided by the embodiment of the invention includes the following steps:
步骤S301:对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据。Step S301: Perform text replacement and sentence splicing processing on the sample data to obtain first training data.
在具体实现步骤S301的过程中,所述样本数据经由对公开的法律文书进行 分句处理获得,具体的处理过程参见上述本发明实施例图2公开的步骤S203中相对应的内容,在此不再进行赘述。In the process of specifically implementing step S301, the sample data is obtained by performing sentence processing on the disclosed legal document. For the specific processing process, refer to the corresponding content in step S203 disclosed in FIG. 2 of the above embodiment of the present invention. Let me repeat.
在进行文字替换以及句子拼接处理时,从所述样本数据中随机挑选文字,将该文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。比如:随机将所述样本中的文字替换成“[MASK]”。选择需要进行句子拼接处理的句子,有50%的概率为该句子拼接对应的下一句,有50%的概率为该句子拼接其它句子。When performing word replacement and sentence splicing processing, randomly select a word from the sample data, replace the word with a preset character, and randomly splice a second sentence for the first sentence in the sample data, wherein the The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence. For example: randomly replace the text in the sample with "[MASK]". Select the sentence to be processed for sentence splicing. There is a 50% probability that the sentence is spliced to the next sentence, and there is a 50% probability that the sentence is spliced to other sentences.
需要说明的是,上述涉及到的随机进行文字替换和句子拼接仅适用于举例说明,也可由技术人员具体选择哪些文字需要替换成字符,以及具体选择哪些句子需要进行拼接。同理,也可每隔预设个数的文字将某些文字替换成字符,每隔预设条数的句子将某一句子进行拼接,在本发明实施例中不做具体限定。It should be noted that the random word replacement and sentence splicing mentioned above are only suitable for example, and the technician can also specifically select which words need to be replaced with characters and specifically select which sentences need to be spliced. Similarly, it is also possible to replace some characters with characters every preset number of characters, and to splice a certain sentence every preset number of sentences, which is not specifically limited in the embodiment of the present invention.
步骤S302:将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛。Step S302: Use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model converges.
需要说明的是,在具体实现步骤S302的过程中,通过将所述第一训练数据作为第一BERT模型的输入,预测文字替换部位的文字和句子拼接部位的句子,结合预测结果和实际结果之间的误差,训练所述第一BERT模型对文字和句子的判断能力。例如:对于一句完整的句子,随机将该句子中的一个字替换成预设字符,训练所述第一BERT模型判断该句子的预设字符部位的实际文字是什么。对于由多个语句构成的一整段内容,对某一语句进行句子拼接处理,训练所述第一BERT模型判断该语句拼接部位对应的实际句子是什么。It should be noted that in the process of specifically implementing step S302, by using the first training data as the input of the first BERT model to predict the text at the text replacement part and the sentence at the sentence splicing part, combining the prediction result and the actual result Training the first BERT model’s ability to judge words and sentences. For example, for a complete sentence, randomly replace a word in the sentence with a preset character, and train the first BERT model to determine what the actual text of the preset character part of the sentence is. For a whole paragraph of content composed of multiple sentences, sentence splicing processing is performed on a sentence, and the first BERT model is trained to determine what the actual sentence corresponds to the sentence splicing part.
步骤S303:将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数。Step S303: Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.
在具体实现步骤S303的过程中,将收敛后的所述第一BERT模型中embedding层和12层transformer结构的参数作为所述第二BERT模型中embedding层和12层transformer结构的初始化参数。In the process of specifically implementing step S303, the parameters of the embedding layer and the 12-layer transformer structure in the first BERT model after convergence are used as the initialization parameters of the embedding layer and the 12-layer transformer structure in the second BERT model.
步骤S304:将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述 要素解析模型。Step S304: Use the second training data as the input of the second BERT model, and train the second BERT model in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model.
在具体实现步骤S304的过程中,在所述第二BERT模型中的编码向量Enc之后选取一个768维的向量,该768维的向量通过一个768维的全连接层连接要素解析需要的类别数目,将加权交叉熵损失函数(sigmoid cross entropy loss)作为所述第二损失函数,训练所述第二BERT模型。具体训练过程如过程A1-A3示出的内容。需要说明的是,上述涉及到的向量和全连接层的维数包括但不仅限于768维。所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。比如:从所述样本数据中选取800-1000篇法律文书进行分句处理后进行要素标注,得到所述第二训练数据。In the process of specifically implementing step S304, a 768-dimensional vector is selected after the encoding vector Enc in the second BERT model, and the 768-dimensional vector is analyzed through a 768-dimensional fully connected layer connection element to analyze the required number of categories. A weighted cross entropy loss function (sigmoid cross entropy loss) is used as the second loss function to train the second BERT model. The specific training process is as shown in the process A1-A3. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions. The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements. For example, 800-1000 legal documents are selected from the sample data for sentence processing, and then the elements are labeled to obtain the second training data.
A1:针对所述第二训练数据中的每一训练语句,将每一所述训练语句输入所述第二BERT模型进行预测,得到每一所述训练语句中包含的预测要素。A1: For each training sentence in the second training data, input each training sentence into the second BERT model for prediction, and obtain prediction elements contained in each training sentence.
A2:使用所述第二损失函数计算所述预测要素与每一所述训练语句中包含的实际要素之间的误差。A2: Use the second loss function to calculate the error between the predicted element and the actual element contained in each training sentence.
A3:若所述误差小于阈值,则基于所述第二BERT模型当前的模型参数,构建所述要素解析模型。若所述误差大于阈值,则基于所述误差调整所述第二BERT模型的模型参数,基于所述第二训练数据训练所述第二BERT模型直至误差小于阈值,确定训练后的第二BERT模型为所述要素解析模型。A3: If the error is less than the threshold, construct the element analysis model based on the current model parameters of the second BERT model. If the error is greater than the threshold, adjust the model parameters of the second BERT model based on the error, train the second BERT model based on the second training data until the error is less than the threshold, and determine the trained second BERT model Analyze the model for the element.
需要说明的是,上述过程A1-A3涉及的内容仅用于举例说明。It should be noted that the content involved in the above process A1-A3 is only for illustration.
需要说明的是,训练神经网络模型在训练时需要一个或一系列初始参数,传统的神经网络模型的初始参数通常采用一个0均值方差较小的正态分布的随机参数,传统的神经网络模型的初始参数确定方式,对文本要素的预测效果较差。在本发明实施例中,通过预先训练所述第一BERT模型直至收敛,在训练所述第二BERT模型时,利用训练完成的所述第一BERT模型的参数结构初始化所述第二BERT模型的参数,为所述第二BERT模型提供充分的法律领域先验信息,有效提高所述要素解析模型的要素预测准确度。It should be noted that training a neural network model requires one or a series of initial parameters during training. The initial parameters of the traditional neural network model usually use a random parameter with a normal distribution with a small mean variance of 0. The traditional neural network model The initial parameter determination method has poor predictive effect on text elements. In the embodiment of the present invention, by pre-training the first BERT model until convergence, when training the second BERT model, the parameter structure of the first BERT model after training is used to initialize the parameters of the second BERT model The parameters provide sufficient prior information in the legal field for the second BERT model, and effectively improve the element prediction accuracy of the element analysis model.
在本发明实施例中,通过第一训练数据训练第一BERT模型直至收敛,将收敛的第一BERT模型的模型参数作为第二BERT模型的初始化模型参数,并通过第二训练数据训练第二BERT模型直至收敛,得到要素解析模型。利用要素解析模型对进行分句处理后的法律文书进行要素解析,得到法律文书中每一语 句所包含的要素,根据提取出来的案件要素进行法律判决。不需要人工将案件中的要素逐一提取出来,再根据人工提取的要素进行法律判决,能有效降低人力成本和时间成本,提供判决准确性和效率。In the embodiment of the present invention, the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data The model converges, and the element analysis model is obtained. Use the element analysis model to analyze the elements of the legal document after the clause processing, obtain the elements contained in each sentence in the legal document, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make legal judgments based on the manually extracted elements, which can effectively reduce labor and time costs, and provide judgment accuracy and efficiency.
上述本发明实施例图3公开的步骤S302涉及到的训练第一BERT模型的过程,参考图4,示出了本发明实施例提供的训练第一BERT模型的流程图,包括以下步骤:The process of training the first BERT model involved in step S302 disclosed in FIG. 3 of the above embodiment of the present invention. Referring to FIG. 4, a flowchart of training the first BERT model provided by the embodiment of the present invention is shown, including the following steps:
步骤S401:将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果。Step S401: Use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to the text replacement position, and obtain a sentence prediction result corresponding to the sentence splicing position.
需要说明的是,获取所述第一训练数据的过程参见上述本发明实施例图3公开的步骤S301相对应的内容,在此不再进行赘述。It should be noted that, for the process of obtaining the first training data, refer to the content corresponding to step S301 disclosed in FIG. 3 of the above embodiment of the present invention, and details are not described herein again.
步骤S402:使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差。Step S402: Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and the sentence Sentence error between prediction results.
在具体实现步骤S402的过程中,在所述编码向量Enc中选取第一个768维向量,该768维向量通过一个768维的全连接层分别连接所述第一子损失函数和第二子损失函数。需要说明的是,上述涉及到的向量和全连接层的维数包括但不仅限于768维。In the process of implementing step S402, the first 768-dimensional vector is selected from the encoding vector Enc, and the 768-dimensional vector is connected to the first sub-loss function and the second sub-loss through a 768-dimensional fully connected layer. function. It should be noted that the dimensions of the vectors and fully connected layers mentioned above include but are not limited to 768 dimensions.
需要说明的是,所述第一子损失函数包括但不仅限于:多分类的softmax交叉熵损失函数,所述第二子损失函数包括但不仅限于:二分类的softmax交叉熵损失函数。It should be noted that the first sub-loss function includes but is not limited to: a multi-class softmax cross-entropy loss function, and the second sub-loss function includes but is not limited to: a two-class softmax cross-entropy loss function.
步骤S403:基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛。Step S403: Based on the text error and sentence error, train the first BERT model in combination with the first training data until the first BERT model converges.
在具体实现步骤S403的过程中,所述实际文字和实际句子来源于所述样本数据,即通过所述样本数据可以获得文字替换位置的实际文字和句子拼接位置的实际句子。若所述文字误差和句子误差均小于阈值,则将所述第一BERT收敛。若所述文字误差和句子误差均大于阈值,则基于所述文字误差和句子误差调节所述第一BERT模型的模型参数,使用所述第一训练数据继续训练所述第一BERT模型直至所述文字误差和句子误差均小于阈值。In the process of specifically implementing step S403, the actual text and the actual sentence are derived from the sample data, that is, the actual text at the text replacement position and the actual sentence at the sentence splicing position can be obtained through the sample data. If the text error and sentence error are both smaller than the threshold, the first BERT is converged. If the text error and sentence error are both greater than the threshold, adjust the model parameters of the first BERT model based on the text error and sentence error, and use the first training data to continue training the first BERT model until the Both the text error and the sentence error are less than the threshold.
在本发明实施例中,在获取要素解析模型之前,先基于第一子损失函数和第二子损失函数,通过第一训练数据训练第一BERT模型直至收敛,将收敛后的第一BERT模型的模型参数作为第二BERT模型的初始化模型参数,再基于训练数据训练第二BERT模型直至收敛获得要素解析模型,能提高要素解析的准确性。In the embodiment of the present invention, before obtaining the element analysis model, based on the first sub-loss function and the second sub-loss function, the first BERT model is trained through the first training data until convergence, and the converged first BERT model The model parameters are used as the initialization model parameters of the second BERT model, and then the second BERT model is trained based on the training data until convergence to obtain the element analysis model, which can improve the accuracy of element analysis.
与上述本发明实施例公开的一种法律文书要素解析方法相对应,参考图5,本发明实施例还提供了一种法律文书要素解析系统,所述法律文书要素解析系统包括:获取单元501、处理单元502和预测单元503。Corresponding to the method for analyzing elements of a legal document disclosed in the foregoing embodiment of the present invention, referring to FIG. 5, an embodiment of the present invention also provides a system for analyzing elements of a legal document. The system for analyzing elements of a legal document includes: an acquiring unit 501, The processing unit 502 and the prediction unit 503.
获取单元501,用于获取待解析的法律文书。The obtaining unit 501 is configured to obtain a legal document to be analyzed.
处理单元502,用于对所述法律文书进行分句处理,得到多条待解析语句。对所述法律文书的具体处理过程参见上述本发明实施例图2公开的步骤S202相对应的内容。The processing unit 502 is configured to perform sentence processing on the legal document to obtain multiple sentences to be parsed. For the specific processing process of the legal document, refer to the content corresponding to step S202 disclosed in FIG. 2 of the foregoing embodiment of the present invention.
预测单元503,用于逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。获取所述样本数据的过程参见上述本发明实施例图2公开的步骤S203相对应的内容。The prediction unit 503 is configured to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on The sample data is obtained by training a language model, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model. For the process of obtaining the sample data, refer to the content corresponding to step S203 disclosed in FIG. 2 of the foregoing embodiment of the present invention.
在本发明实施例中,通过海量的法律文书预先训练语言模型得到要素解析模型,将需要解析的法律文书进行分句处理得到多条待解析语句,将每一条待解析语句作为要素解析模型的输入得到每条待解析语句中的要素,根据提取出来的案件要素进行法律判决。不需要人工将案件中的要素逐一提取出来,再根据人工提取的要素进行法律判决,节约人力成本和时间成本,提高判决的准确性和效率。In the embodiment of the present invention, the language model is pre-trained through a large number of legal documents to obtain the element analysis model, the legal documents to be analyzed are subjected to sentence processing to obtain multiple sentences to be analyzed, and each sentence to be analyzed is used as the input of the element analysis model Obtain the elements of each sentence to be parsed, and make legal judgments based on the extracted case elements. There is no need to manually extract the elements in the case one by one, and then make a legal judgment based on the manually extracted elements, saving labor and time costs, and improving the accuracy and efficiency of the judgment.
参考图6,示出了本发明实施例提供的一种法律文书要素解析系统的结构框图,当所述语言模型为BERT模型,所述预测单元503包括:处理模块5031、第一训练模块5032、设置模块5033和第二训练模块5034。Referring to FIG. 6, there is shown a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention. When the language model is a BERT model, the prediction unit 503 includes: a processing module 5031, a first training module 5032 Setting module 5033 and second training module 5034.
处理模块5031,用于对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得。The processing module 5031 is configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, where the sample data is obtained based on sentence processing on a public legal document.
在具体实现中,所述处理模块5031具体用于随机将所述样本数据中的文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。具体内容参见上述本发明实施例图3公开的步骤S301相对应的内容。In specific implementation, the processing module 5031 is specifically configured to randomly replace text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, wherein the first sentence The second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence. For specific content, refer to the content corresponding to step S301 disclosed in FIG. 3 of the foregoing embodiment of the present invention.
第一训练模块5032,用于将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛。The first training module 5032 is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT The model converges.
设置模块5033,用于将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数。The setting module 5033 is configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model.
第二训练模块5034,用于将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。具体训练所述第二BERT模型的过程参见上述本发明实施例图3公开的步骤S304相对应的内容。The second training module 5034 is configured to use second training data as the input of the second BERT model, and train the second BERT model in combination with a preset second loss function until the second BERT model converges to obtain the The element analysis model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling. For the specific process of training the second BERT model, refer to the content corresponding to step S304 disclosed in FIG. 3 of the above embodiment of the present invention.
在本发明实施例中,通过第一训练数据训练第一BERT模型直至收敛,将收敛的第一BERT模型的模型参数作为第二BERT模型的初始化模型参数,并通过第二训练数据训练第二BERT模型直至收敛,得到要素解析模型。利用要素解析模型对进行分句处理后的法律文书进行要素解析,得到法律文书中每一语句所包含的要素,根据提取出来的案件要素进行分析和法律判决等操作,不需要人工将案件中的要素逐一提取出来,从而能有效降低人力成本和时间成本,提供判决准确性和效率。In the embodiment of the present invention, the first BERT model is trained through the first training data until convergence, the model parameters of the converged first BERT model are used as the initialization model parameters of the second BERT model, and the second BERT is trained through the second training data The model converges, and the element analysis model is obtained. The element analysis model is used to analyze the elements of the legal document after the clause processing, to obtain the elements contained in each sentence in the legal document, and perform operations such as analysis and legal judgments based on the extracted case elements. The elements are extracted one by one, which can effectively reduce labor costs and time costs, and provide judgment accuracy and efficiency.
参考图7,示出了本发明实施例提供的一种法律文书要素解析系统的结构框图,所述第一训练模块5032包括:预测子模块50321、误差子模块50322和训练子模块50323。Referring to FIG. 7, there is shown a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention. The first training module 5032 includes: a prediction submodule 50321, an error submodule 50322, and a training submodule 50323.
预测子模块50321,用于将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果。The prediction sub-module 50321 is configured to use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position and obtain the sentence prediction result corresponding to the sentence splicing position.
误差子模块50322,用于使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差。The error sub-module 50322 is configured to use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and to use the second sub-loss function to calculate the actual sentence at the sentence splicing position The sentence error with the sentence prediction result.
训练子模块50323,用于基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛。训练所述第一BERT模型的过程参见上述本发明实施例图4公开的步骤S403相对应的内容。The training sub-module 50323 is configured to train the first BERT model based on the text error and sentence error in combination with the first training data until the first BERT model converges. For the process of training the first BERT model, refer to the content corresponding to step S403 disclosed in FIG. 4 of the foregoing embodiment of the present invention.
其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the actual text and actual sentence are derived from the sample data.
在本发明实施例中,在获取要素解析模型之前,先基于第一子损失函数和第二子损失函数,通过第一训练数据训练第一BERT模型直至收敛,将收敛后的第一BERT模型的模型参数作为第二BERT模型的初始化模型参数,再基于训练数据训练第二BERT模型直至收敛获得要素解析模型,能提高要素解析的准确性。In the embodiment of the present invention, before obtaining the element analysis model, based on the first sub-loss function and the second sub-loss function, the first BERT model is trained through the first training data until convergence, and the converged first BERT model The model parameters are used as the initialization model parameters of the second BERT model, and then the second BERT model is trained based on the training data until convergence to obtain the element analysis model, which can improve the accuracy of element analysis.
优选的,结合图5,参考图8,示出了本发明实施例提供的一种法律文书要素解析系统的结构框图,所述法律文书要素解析系统还包括:Preferably, referring to FIG. 5 and FIG. 8, a structural block diagram of a legal document element analysis system provided by an embodiment of the present invention is shown, and the legal document element analysis system further includes:
合并单元504,用于合并每条所述待解析语句包含的要素。The merging unit 504 is used to merge the elements contained in each sentence to be parsed.
需要说明的是,对应法律文书要素解析结果有以下两种需要,一种是只需要获得法律文书中每一语句包含的要素,另一种是需要将法律文书中每一语句包含的要素合并,得到该法律文书的要素集合。It should be noted that there are two requirements for the analysis results of the corresponding elements of the legal document. One is to obtain only the elements contained in each sentence in the legal document, and the other is to merge the elements contained in each sentence in the legal document. Get the set of elements of the legal document.
在本发明实施例中,根据实际需求,可以合并每条所述待解析语句包含的要素,得到所述待解析的法律文书的要素集合,以满足不同的法律需求。In the embodiment of the present invention, according to actual needs, the elements contained in each sentence to be parsed can be combined to obtain a set of elements of the legal document to be parsed to meet different legal requirements.
基于上述本发明实施例公开的法律文书要素解析系统,上述各个模块可以通过一种由处理器和存储器构成的硬件设备实现。具体为:上述各个模块作为程序单元存储于存储器中,由处理器执行存储在存储器中的上述程序单元来实现法律文书要素解析。Based on the legal document element analysis system disclosed in the foregoing embodiment of the present invention, the foregoing various modules may be implemented by a hardware device composed of a processor and a memory. Specifically, each of the foregoing modules is stored in the memory as a program unit, and the processor executes the foregoing program unit stored in the memory to realize the analysis of legal document elements.
其中,处理器中包含内核,由内核去存储器中调取相应的程序单元。内核可以设置一个或以上,通过调整内核参数来实现法律文书要素解析。Among them, the processor contains a kernel, which calls the corresponding program unit from the memory. One or more kernels can be set, and the analysis of legal document elements can be realized by adjusting kernel parameters.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM),存储器包括至少一个存储芯片。The memory may include non-permanent memory in computer-readable media, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM), and the memory includes at least one Memory chip.
进一步的,本发明实施例提供了一种处理器,所述处理器用于运行程序,其中,所述程序运行时执行所述法律文书要素解析方法。Further, an embodiment of the present invention provides a processor configured to run a program, wherein the legal document element analysis method is executed when the program is running.
进一步的,本发明实施例提供了一种法律文书要素解析设备,该设备包括处理器、存储器及存储在存储器上并可在处理器上运行的程序,处理器执行程序时实现以下步骤:获取待解析的法律文书;对所述法律文书进行分句处理,得到多条待解析语句;逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。Further, an embodiment of the present invention provides a device for analyzing elements of a legal document. The device includes a processor, a memory, and a program stored in the memory and running on the processor. When the processor executes the program, the following steps are implemented: Analyzed legal documents; perform sentence processing on the legal documents to obtain multiple sentences to be parsed; input the sentences to be parsed into the pre-established element analysis model for element analysis, and obtain each of the sentences in the legal document Elements included in the sentence to be parsed, wherein the element analysis model is obtained by training a language model based on sample data, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
其中,当所述语言模型为BERT模型,所述由基于样本数据训练语言模型获得要素解析模型的过程包括:对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。Wherein, when the language model is a BERT model, the process of obtaining the element analysis model by training the language model based on sample data includes: performing text replacement and sentence splicing processing on the sample data to obtain the first training data, wherein the The sample data is obtained based on the sentence processing of a public legal document; the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges; use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model; use the second training data as the input of the second BERT model in combination with presets The second loss function of training the second BERT model until the second BERT model converges to obtain the element analysis model, wherein the second training data selects a preset number of legal documents from the sample data Perform feature labeling.
其中,所述将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据训练所述第一BERT模型直至所述第一BERT模型收敛,包括:将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;基于所述文字误差和句子误差,结合 所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes: Use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position, and obtain the sentence prediction result corresponding to the sentence splicing position; use the first sub-loss function to calculate the text replacement position The text error between the actual text and the text prediction result, and the sentence error between the actual sentence at the sentence splicing position and the sentence prediction result using the second sub-loss function; based on the text error and the sentence Error, training the first BERT model in combination with the first training data until the first BERT model converges; wherein the actual text and actual sentence are derived from the sample data.
其中,所述对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,包括:随机将所述样本数据中的文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。Wherein, performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes: randomly replacing text in the sample data with preset characters, and randomly selecting the first training data in the sample data. The sentence is spliced into a second sentence, wherein the second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
其中,所述逐一将所述待解析语句作为预先建立的要素解析模型的输入进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素之后,还包括:合并每条所述待解析语句包含的要素。Wherein, the step of performing element analysis using the sentences to be parsed as the input of a pre-established element analysis model, and obtaining the elements contained in each sentence to be parsed in the legal document, further includes: merging each sentence Elements contained in the sentence to be parsed.
进一步的,本发明实施例还提供了一种存储介质,其上存储有程序,该程序被处理器执行时实现法律文书要素解析。Further, an embodiment of the present invention also provides a storage medium on which a program is stored, and when the program is executed by a processor, the analysis of elements of a legal document is realized.
本申请还提供了一种计算机程序产品,当在数据处理设备上执行时,适于执行初始化有如下方法步骤的程序:获取待解析的法律文书;对所述法律文书进行分句处理,得到多条待解析语句;逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。This application also provides a computer program product, which when executed on a data processing device, is suitable for executing a program that initializes the following method steps: obtaining a legal document to be parsed; performing sentence processing on the legal document to obtain more Sentence to be parsed; input the sentence to be parsed into a pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
其中,当所述语言模型为BERT模型,所述由基于样本数据训练语言模型获得要素解析模型的过程包括:对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。Wherein, when the language model is a BERT model, the process of obtaining the element analysis model by training the language model based on sample data includes: performing text replacement and sentence splicing processing on the sample data to obtain the first training data, wherein the The sample data is obtained based on the sentence processing of a public legal document; the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges; use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model; use the second training data as the input of the second BERT model in combination with presets The second loss function of training the second BERT model until the second BERT model converges to obtain the element analysis model, wherein the second training data selects a preset number of legal documents from the sample data Perform feature labeling.
其中,所述将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据训练所述第一BERT模型直至所述第一BERT模型收敛,包括:将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the step of using the first training data as the input of the first BERT model and combining the preset first loss function and the sample data to train the first BERT model until the first BERT model converges includes: Use the first training data as the input of the first BERT model to obtain the text prediction result corresponding to the text replacement position, and obtain the sentence prediction result corresponding to the sentence splicing position; use the first sub-loss function to calculate the text replacement position The text error between the actual text and the text prediction result, and the sentence error between the actual sentence at the sentence splicing position and the sentence prediction result using the second sub-loss function; based on the text error and the sentence Error, training the first BERT model in combination with the first training data until the first BERT model converges; wherein the actual text and actual sentence are derived from the sample data.
其中,所述对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,包括:随机将所述样本数据中的文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。Wherein, performing text replacement and sentence splicing processing on the sample data to obtain the first training data includes: randomly replacing text in the sample data with preset characters, and randomly selecting the first training data in the sample data. The sentence is spliced into a second sentence, wherein the second sentence is the next sentence corresponding to the first sentence or not the next sentence corresponding to the first sentence.
其中,所述逐一将所述待解析语句作为预先建立的要素解析模型的输入进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素之后,还包括:合并每条所述待解析语句包含的要素。Wherein, the step of performing element analysis using the sentences to be parsed as the input of a pre-established element analysis model, and obtaining the elements contained in each sentence to be parsed in the legal document, further includes: merging each sentence Elements contained in the sentence to be parsed.
综上所述,本发明实施例提供一种法律文书要素解析方法及系统,该方法为:获取待解析的法律文书。对法律文书进行分句处理,得到多条待解析语句。逐一将待解析语句输入预先建立的要素解析模型进行要素解析,得到法律文书中每条待解析语句包含的要素,其中,要素解析模型由基于样本数据训练语言模型获得。在本方案中,通过海量的法律文书预先训练语言模型得到要素解析模型,将需要解析的法律文书进行分句处理得到多条待解析语句,将每一条待解析语句作为要素解析模型的输入得到每条待解析语句中的要素,根据提取出来的案件要素进行法律判决。不需要人工将案件中的要素逐一提取出来,再根据人工提取的要素进行分析和法律判决等操作,从而节约人力成本和时间成本,提高判决的准确性和效率。In summary, the embodiments of the present invention provide a method and system for analyzing elements of a legal document. The method is to obtain a legal document to be analyzed. Perform sentence processing on legal documents and get multiple sentences to be parsed. One by one, the sentences to be analyzed are input into the pre-established element analysis model for element analysis, and the elements contained in each sentence to be analyzed in the legal document are obtained. The element analysis model is obtained by training the language model based on sample data. In this solution, the element analysis model is obtained by pre-training the language model of a large number of legal documents, and the legal documents that need to be analyzed are segmented to obtain multiple sentences to be analyzed. Each sentence to be analyzed is used as the input of the element analysis model to obtain each The elements in the sentence to be parsed shall be judged according to the extracted case elements. There is no need to manually extract the elements of the case one by one, and then perform analysis and legal judgments based on the manually extracted elements, thereby saving labor and time costs, and improving the accuracy and efficiency of judgments.
本领域内的技术人员应明白,本申请的实施例可提供为方法、装置、客户端、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多 个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。Those skilled in the art should understand that the embodiments of the present application can be provided as methods, devices, clients, or computer program products. Therefore, the present application may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware. Moreover, this application may adopt the form of a computer program product implemented on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program codes.
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。This application is described with reference to flowcharts and/or block diagrams of methods, equipment (systems), and computer program products according to the embodiments of this application. It should be understood that each process and/or block in the flowchart and/or block diagram, and the combination of processes and/or blocks in the flowchart and/or block diagram can be implemented by computer program instructions. These computer program instructions can be provided to the processor of a general-purpose computer, a special-purpose computer, an embedded processor, or other programmable data processing equipment to generate a machine, so that the instructions executed by the processor of the computer or other programmable data processing equipment are generated It is a device that realizes the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions can also be stored in a computer-readable memory that can guide a computer or other programmable data processing equipment to work in a specific manner, so that the instructions stored in the computer-readable memory produce an article of manufacture including the instruction device. The device implements the functions specified in one process or multiple processes in the flowchart and/or one block or multiple blocks in the block diagram.
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing equipment, so that a series of operation steps are executed on the computer or other programmable equipment to produce computer-implemented processing, so as to execute on the computer or other programmable equipment. The instructions provide steps for implementing functions specified in a flow or multiple flows in the flowchart and/or a block or multiple blocks in the block diagram.
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, the computing device includes one or more processors (CPU), input/output interfaces, network interfaces, and memory.
存储器可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。The memory may include non-permanent memory in a computer readable medium, random access memory (RAM) and/or non-volatile memory, such as read-only memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁带磁 磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。Computer-readable media include permanent and non-permanent, removable and non-removable media, and information storage can be realized by any method or technology. The information can be computer-readable instructions, data structures, program modules, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), flash memory or other memory technology, CD-ROM, digital versatile disc (DVD) or other optical storage, Magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission media can be used to store information that can be accessed by computing devices. According to the definition in this article, computer-readable media does not include transitory media, such as modulated data signals and carrier waves.
本说明书中的各个实施例均采用递进的方式描述,各个实施例之间相同相似的部分互相参见即可,每个实施例重点说明的都是与其他实施例的不同之处。尤其,对于系统或系统实施例而言,由于其基本相似于方法实施例,所以描述得比较简单,相关之处参见方法实施例的部分说明即可。以上所描述的系统及系统实施例仅仅是示意性的,其中所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部模块来实现本实施例方案的目的。本领域普通技术人员在不付出创造性劳动的情况下,即可以理解并实施。The various embodiments in this specification are described in a progressive manner, and the same or similar parts between the various embodiments can be referred to each other, and each embodiment focuses on the differences from other embodiments. In particular, for the system or the system embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and for related parts, please refer to the partial description of the method embodiment. The system and system embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, namely It can be located in one place, or it can be distributed to multiple network units. Some or all of the modules may be selected according to actual needs to achieve the objectives of the solutions of the embodiments. Those of ordinary skill in the art can understand and implement it without creative work.
专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of the examples described in the embodiments disclosed in this article can be implemented by electronic hardware, computer software, or a combination of both, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, in the above description, the composition and steps of each example have been described generally in terms of function. Whether these functions are executed by hardware or software depends on the specific application and design constraint conditions of the technical solution. Professionals and technicians can use different methods for each specific application to implement the described functions, but such implementation should not be considered as going beyond the scope of the present invention.
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本发明。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本发明的精神或范围的情况下,在其它实施例中实现。因此,本发明将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。The above description of the disclosed embodiments enables those skilled in the art to implement or use the present invention. Various modifications to these embodiments will be obvious to those skilled in the art, and the general principles defined herein can be implemented in other embodiments without departing from the spirit or scope of the present invention. Therefore, the present invention will not be limited to the embodiments shown in this document, but should conform to the widest scope consistent with the principles and novel features disclosed in this document.

Claims (10)

  1. 一种法律文书要素解析方法,其特征在于,所述方法包括:A method for analyzing elements of a legal document, characterized in that the method includes:
    获取待解析的法律文书;Obtain legal documents to be resolved;
    对所述法律文书进行分句处理,得到多条待解析语句;Perform sentence processing on the legal document to obtain multiple sentences to be parsed;
    逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。One by one, input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is obtained by training a language model based on sample data The language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  2. 根据权利要求1所述的方法,其特征在于,当所述语言模型为BERT模型,所述由基于样本数据训练语言模型获得要素解析模型的过程包括:The method according to claim 1, wherein when the language model is a BERT model, the process of training the language model based on sample data to obtain an element analysis model comprises:
    对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;Performing text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;
    将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;Using the first training data as the input of the first BERT model, combining the preset first loss function and the sample data, training the first BERT model until the first BERT model converges;
    将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;Use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;
    将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。The second training data is used as the input of the second BERT model, and the second BERT model is trained in combination with the preset second loss function until the second BERT model converges to obtain the element analysis model, wherein the The second training data is obtained by selecting a preset number of legal documents from the sample data and labeling elements.
  3. 根据权利要求2所述的方法,其特征在于,所述将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据训练所述第一BERT模型直至所述第一BERT模型收敛,包括:The method according to claim 2, wherein the first training data is used as the input of the first BERT model, and the first BERT model is trained in combination with the preset first loss function and the sample data Until the first BERT model converges, including:
    将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;Using the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position and a sentence prediction result corresponding to a sentence splicing position;
    使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;Use the first sub-loss function to calculate the text error between the actual text at the text replacement position and the text prediction result, and use the second sub-loss function to calculate the difference between the actual sentence at the sentence splicing position and the sentence prediction result Sentence error between
    基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;Training the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;
    其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the actual text and actual sentence are derived from the sample data.
  4. 根据权利要求2所述的方法,其特征在于,所述对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,包括:The method according to claim 2, wherein the performing text replacement and sentence splicing processing on the sample data to obtain the first training data comprises:
    随机将所述样本数据中的文字替换为预设字符,以及随机为所述样本数据中的第一语句拼接第二语句,其中,所述第二语句为所述第一语句对应的下一句或不是所述第一语句对应的下一句。Randomly replace the text in the sample data with preset characters, and randomly splice a second sentence for the first sentence in the sample data, where the second sentence is the next sentence or sentence corresponding to the first sentence Not the next sentence corresponding to the first sentence.
  5. 根据权利要求1所述的方法,其特征在于,所述逐一将所述待解析语句作为预先建立的要素解析模型的输入进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素之后,还包括:The method according to claim 1, wherein the sentence to be parsed is used one by one as the input of a pre-established element analysis model for element analysis to obtain the content of each sentence to be parsed in the legal document After the elements, it also includes:
    合并每条所述待解析语句包含的要素。Combine the elements contained in each sentence to be parsed.
  6. 一种法律文书要素解析系统,其特征在于,所述系统包括:A legal document element analysis system, characterized in that the system includes:
    获取单元,用于获取待解析的法律文书;The obtaining unit is used to obtain the legal document to be analyzed;
    处理单元,用于对所述法律文书进行分句处理,得到多条待解析语句;The processing unit is used to perform sentence processing on the legal document to obtain multiple sentences to be parsed;
    预测单元,用于逐一将所述待解析语句输入预先建立的要素解析模型进行要素解析,得到所述法律文书中每条所述待解析语句包含的要素,其中,所述要素解析模型由基于样本数据训练语言模型获得,所述语言模型用于根据预设数量的法律文本进行预训练确定所述要素解析模型的初始化模型参数。The prediction unit is used to input the sentence to be parsed into the pre-established element analysis model for element analysis to obtain the elements contained in each sentence to be parsed in the legal document, wherein the element analysis model is based on the sample The data training language model is obtained, and the language model is used for pre-training according to a preset number of legal texts to determine the initialization model parameters of the element analysis model.
  7. 根据权利要求6所述的系统,其特征在于,当所述语言模型为BERT模型,所述预测单元包括:The system according to claim 6, wherein when the language model is a BERT model, the prediction unit comprises:
    处理模块,用于对所述样本数据进行文字替换以及句子拼接处理得到第一训练数据,其中,所述样本数据基于对公开的法律文书进行分句处理获得;A processing module, configured to perform text replacement and sentence splicing processing on the sample data to obtain first training data, wherein the sample data is obtained based on sentence processing on a public legal document;
    第一训练模块,用于将所述第一训练数据作为第一BERT模型的输入,结合预设的第一损失函数和所述样本数据,训练所述第一BERT模型直至所述第一BERT模型收敛;The first training module is configured to use the first training data as the input of the first BERT model, and combine the preset first loss function and the sample data to train the first BERT model until the first BERT model convergence;
    设置模块,用于将收敛后的所述第一BERT模型的模型参数作为第二BERT模型的初始化模型参数;A setting module, configured to use the converged model parameters of the first BERT model as the initialization model parameters of the second BERT model;
    第二训练模块,用于将第二训练数据作为所述第二BERT模型的输入,结合预设的第二损失函数训练所述第二BERT模型直至所述第二BERT模型收敛,得到所述要素解析模型,其中,所述第二训练数据通过从所述样本数据中选取预设数量的法律文书进行要素标注获得。The second training module is configured to use second training data as the input of the second BERT model, train the second BERT model in combination with a preset second loss function until the second BERT model converges, and obtain the element An analytical model, wherein the second training data is obtained by selecting a preset number of legal documents from the sample data to perform element labeling.
  8. 根据权利要求7所述的系统,其特征在于,所述第一训练模块包括:The system according to claim 7, wherein the first training module comprises:
    预测子模块,用于将所述第一训练数据作为所述第一BERT模型的输入,得到对应文字替换位置的文字预测结果,以及得到对应句子拼接位置的句子预测结果;A prediction submodule, configured to use the first training data as the input of the first BERT model to obtain a text prediction result corresponding to a text replacement position, and to obtain a sentence prediction result corresponding to a sentence splicing position;
    误差子模块,用于使用第一子损失函数计算所述文字替换位置的实际文字和所述文字预测结果之间的文字误差,以及使用第二子损失函数计算所述句子拼接位置的实际句子与所述句子预测结果之间的句子误差;The error sub-module is used to calculate the text error between the actual text at the text replacement position and the text prediction result using the first sub-loss function, and use the second sub-loss function to calculate the actual sentence at the sentence splicing position and The sentence error between the sentence prediction results;
    训练子模块,用于基于所述文字误差和句子误差,结合所述第一训练数据训练所述第一BERT模型直至所述第一BERT模型收敛;A training sub-module, configured to train the first BERT model in combination with the first training data based on the text error and sentence error until the first BERT model converges;
    其中,所述实际文字和实际句子来源于所述样本数据。Wherein, the actual text and actual sentence are derived from the sample data.
  9. 一种存储介质,其特征在于,所述存储介质包括存储的程序,其中,在所述程序运行时控制所述存储介质所在设备执行如权利要求1-5中任一所述的法律文书要素解析方法。A storage medium, characterized in that the storage medium includes a stored program, wherein when the program is running, the device where the storage medium is located is controlled to execute the analysis of the legal document elements according to any one of claims 1-5 method.
  10. 一种法律文书要素解析设备,其特征在于,包括存储介质和处理器,所述存储介质存储有程序,所述处理器用于运行所述程序,其中,所述程序运行时执行如权利要求1-5中任一所述的法律文书要素解析方法。A device for analyzing elements of a legal document, comprising a storage medium and a processor, the storage medium stores a program, and the processor is used to run the program, wherein the program executes as claimed in claim 1- The analysis method of any of the legal document elements mentioned in 5.
PCT/CN2019/126935 2019-07-30 2019-12-20 Method and system for parsing elements of legal document WO2021017383A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910695870.8 2019-07-30
CN201910695870.8A CN112329436A (en) 2019-07-30 2019-07-30 Legal document element analysis method and system

Publications (1)

Publication Number Publication Date
WO2021017383A1 true WO2021017383A1 (en) 2021-02-04

Family

ID=74229390

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/126935 WO2021017383A1 (en) 2019-07-30 2019-12-20 Method and system for parsing elements of legal document

Country Status (2)

Country Link
CN (1) CN112329436A (en)
WO (1) WO2021017383A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100095166A1 (en) * 2008-10-10 2010-04-15 Lecroy Corporation Protocol Aware Error Ratio Tester
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN109992782A (en) * 2019-04-02 2019-07-09 深圳市华云中盛科技有限公司 Legal documents name entity recognition method, device and computer equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108241621B (en) * 2016-12-23 2019-12-10 北京国双科技有限公司 legal knowledge retrieval method and device
CN109447105A (en) * 2018-09-10 2019-03-08 平安科技(深圳)有限公司 Contract audit method, apparatus, computer equipment and storage medium
CN109815331A (en) * 2019-01-07 2019-05-28 平安科技(深圳)有限公司 Construction method, device and the computer equipment of text emotion disaggregated model
CN109766537A (en) * 2019-01-16 2019-05-17 北京未名复众科技有限公司 Study abroad document methodology of composition, device and electronic equipment

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100095166A1 (en) * 2008-10-10 2010-04-15 Lecroy Corporation Protocol Aware Error Ratio Tester
CN108197163A (en) * 2017-12-14 2018-06-22 上海银江智慧智能化技术有限公司 A kind of structuring processing method based on judgement document
CN109992782A (en) * 2019-04-02 2019-07-09 深圳市华云中盛科技有限公司 Legal documents name entity recognition method, device and computer equipment

Also Published As

Publication number Publication date
CN112329436A (en) 2021-02-05

Similar Documents

Publication Publication Date Title
CN111931517B (en) Text translation method, device, electronic equipment and storage medium
WO2022121251A1 (en) Method and apparatus for training text processing model, computer device and storage medium
CN111737991B (en) Text sentence breaking position identification method and system, electronic equipment and storage medium
CN112270546A (en) Risk prediction method and device based on stacking algorithm and electronic equipment
CN110428823A (en) Speech understanding device and the speech understanding method for using the device
WO2021143206A1 (en) Single-statement natural language processing method and apparatus, computer device, and readable storage medium
CN112992125B (en) Voice recognition method and device, electronic equipment and readable storage medium
CN109389418A (en) Electric service client's demand recognition methods based on LDA model
CN114118065A (en) Chinese text error correction method and device in electric power field, storage medium and computing equipment
KR102409667B1 (en) Method of building training data of machine translation
CN113626608B (en) Semantic-enhancement relationship extraction method and device, computer equipment and storage medium
CN112951233A (en) Voice question and answer method and device, electronic equipment and readable storage medium
CN112434514B (en) Multi-granularity multi-channel neural network based semantic matching method and device and computer equipment
CN113918710A (en) Text data processing method and device, electronic equipment and readable storage medium
WO2021017383A1 (en) Method and system for parsing elements of legal document
CN112818688B (en) Text processing method, device, equipment and storage medium
CN113257230B (en) Voice processing method and device and computer storage medium
CN115357684A (en) Method and device for determining loss parameters of dialogue generation model
WO2020162240A1 (en) Language model score calculation device, language model creation device, methods therefor, program, and recording medium
CN114429121A (en) Method for extracting emotion and reason sentence pairs of test corpus
CN113849634A (en) Method for improving interpretability of depth model recommendation scheme
CN113283218A (en) Semantic text compression method and computer equipment
CN116882398B (en) Implicit chapter relation recognition method and system based on phrase interaction
US11664010B2 (en) Natural language domain corpus data set creation based on enhanced root utterances
Mitra et al. ICM: Intent and Conversational Mining from Conversation Logs

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19940072

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19940072

Country of ref document: EP

Kind code of ref document: A1