CN111581332A

CN111581332A - Similar judicial case matching method and system based on triple deep hash learning

Info

Publication number: CN111581332A
Application number: CN202010354059.6A
Authority: CN
Inventors: 尹义龙; 聂秀山; 刘兴波; 崔超然; 韩晓晖; 马玉玲
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2020-04-29
Filing date: 2020-04-29
Publication date: 2020-08-25

Abstract

The disclosure discloses a similar judicial case matching method and system based on triple deep hash learning, which comprises the steps of obtaining a judicial case document to be matched; inputting the judicial case document to be matched into a pre-trained feature extraction model to obtain a feature expression vector of the judicial case document to be matched; simultaneously inputting the feature expression vectors of the judicial case documents to be matched into a pre-trained triple deep Hash learning model to obtain Hash codes of the judicial case documents to be matched; and calculating the similarity of the judicial case documents based on the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents. And the similarity accurate matching of the judicial case documents is realized.

Description

Similar judicial case matching method and system based on triple deep hash learning

Technical Field

The disclosure relates to the technical field of natural language processing and big data retrieval, in particular to a similar judicial case matching method and system based on triple deep hash learning.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of society, the number of various legal cases is also rapidly increasing. Similar case matching technology is widely concerned, and in order to pursue accuracy, the existing method generally converts case documents into real-value representations, measures similarity by calculating distances between the real-value representations, and judges matching degree. In implementing the present disclosure, the inventors found that this approach is not suitable for large-scale similar case matching scenarios.

Disclosure of Invention

In order to solve the defects of the prior art, the invention provides a similar judicial case matching method and system based on triple deep hash learning;

in a first aspect, the present disclosure provides a similar judicial case matching method based on triple deep hash learning;

the similar judicial case matching method based on triple deep hash learning comprises the following steps:

acquiring a judicial case document to be matched;

inputting the judicial case document to be matched into a pre-trained feature extraction model to obtain a feature expression vector of the judicial case document to be matched;

simultaneously inputting the feature expression vectors of the judicial case documents to be matched into a pre-trained triple deep Hash learning model to obtain Hash codes of the judicial case documents to be matched;

and calculating the similarity of the judicial case documents based on the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents.

In a second aspect, the present disclosure provides a similar judicial case matching system based on triple deep hash learning;

the similar judicial case matching system based on triple deep hash learning comprises the following steps:

an acquisition module configured to: acquiring a judicial case document to be matched;

a feature extraction module configured to: inputting the judicial case document to be matched into a pre-trained feature extraction model to obtain a feature expression vector of the judicial case document to be matched;

a hash code extraction module configured to: simultaneously inputting the feature expression vectors of the judicial case documents to be matched into a pre-trained triple deep Hash learning model to obtain Hash codes of the judicial case documents to be matched;

a similarity matching module configured to: and calculating the similarity of the judicial case documents based on the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents.

In a third aspect, the present disclosure also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs are stored in the memory, and when the electronic device is running, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first aspect.

In a fourth aspect, the present disclosure also provides a computer-readable storage medium for storing computer instructions which, when executed by a processor, perform the method of the first aspect.

In a fifth aspect, the present disclosure also provides a computer program (product) comprising a computer program for implementing the method of any one of the preceding first aspects when run on one or more processors.

Compared with the prior art, the beneficial effect of this disclosure is:

inputting the judicial case documents to be matched into a pre-trained feature extraction model to obtain feature expression vectors of the judicial case documents to be matched, and assisting in realizing the similarity accurate matching of the judicial case documents;

the characteristic expression vector of the judicial case document to be matched is simultaneously input into the pre-trained triple deep Hash learning model to obtain the Hash code of the judicial case document to be matched, so that the precise matching of the similarity of the judicial case documents is realized.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a flow chart of the method of the first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, and it should be understood that the terms "comprises" and "comprising", and any variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The embodiments and features of the embodiments in the present disclosure may be combined with each other without conflict.

Example one

The embodiment provides a similar judicial case matching method based on triple deep hash learning;

as shown in fig. 1, the similar judicial case matching method based on triple deep hash learning includes:

s101: acquiring a judicial case document to be matched;

s102: inputting the judicial case document to be matched into a pre-trained feature extraction model to obtain a feature expression vector of the judicial case document to be matched;

s103: simultaneously inputting the feature expression vectors of the judicial case documents to be matched into a pre-trained triple deep Hash learning model to obtain Hash codes of the judicial case documents to be matched;

s104: and calculating the similarity of the judicial case documents based on the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents.

As one or more embodiments, in S101, a judicial case document to be matched is obtained; the method comprises the following specific steps:

acquiring a judicial case document to be matched;

deleting characters which have no practical significance to the judicial case documents to be matched;

grouping the processed judicial case documents into a group according to N Chinese characters; n is a positive integer.

It should be understood that words that have no practical significance to the deletion of the judicial case documents to be matched; the method comprises the following steps:

characters such as numbers, punctuation marks, virtual words without practical significance and the like are removed by text preprocessing means.

It should be understood that the treated judicial case documents are grouped according to N Chinese characters as a group; the method comprises the following steps:

the document is grouped into 1024 Chinese characters.

Further, for the document fragment with less than 1024 Chinese characters, the number 0 is used for filling.

As one or more embodiments, in S102, inputting the judicial case document to be matched into the pre-trained feature extraction model to obtain a feature expression vector of the judicial case document to be matched; the method comprises the following specific steps:

inputting each group of grouped Chinese characters into a pre-trained feature extraction model to obtain vector representation; repeating the current step to obtain the vector representation corresponding to each group of Chinese characters;

and splicing all vector representations to obtain the feature representation vectors of the judicial case documents to be matched.

Illustratively, the feature extraction model may be, for example: the natural language processing model BERT.

The natural language processing model BERT is used for reducing the number of characters of the document as much as possible on the premise of keeping the document semantics so as to reduce the dimensionality of the compressed document feature representation.

Specifically, in the pre-trained feature extraction model, a training set used in the training process is a plurality of groups of Chinese characters represented by known vectors.

It should be understood that each group of Chinese characters after grouping is input into the pre-trained feature extraction model, resulting in 768-dimensional vector representation.

As one or more embodiments, in S103, the triple deep hash learning model specifically includes: a deep neural network, the loss function of which is a triplet loss function.

As one or more embodiments, in S103, the training process of the pre-trained triple deep hash learning model includes:

constructing a Hash learning model; constructing a training set;

and inputting the training set into a Hash learning model for training, and stopping training when the triple loss function reaches the minimum value to obtain a pre-trained triple deep Hash learning model.

Further, the training set is a plurality of document triplets; each document triplet comprises a known feature representation vector of each document in the three documents and a known hash code of each document in the three documents;

assuming that a document triplet is represented as (d, d1, d2), d represents a feature representation vector of a first document, d1 represents a feature representation vector of a second document, and d2 represents a feature representation vector of a third document, for the training set, the similarity between the feature representation vector of the first document and the feature representation vector of the second document is greater than the similarity between the feature representation vector of the first document and the feature representation vector of the third document.

Further, the loss function is:

wherein F is a deep neural network, I is a feature expression vector extracted by a BERT model of a first document, and I is⁺The feature representation vector, I, extracted for the BERT model of the second document^-The feature extracted for the BERT model of the third document represents a vector, K is the hash code length, F (I) is the hash code of the first document, F (I)⁺) As hash code of the second document, F (I)^-) Is the hash code of the third document,

L_triplet(F(I),F(I⁺),F(I^-) Is) represents a loss function.

And establishing a loss function based on the similarity of the triple documents, namely designing according to the consistency of the similarity between the three documents and the similarity between the hash codes corresponding to the three documents, and further keeping the similarity relation between the original documents by the final document hash codes.

Given the characteristic representation of three documents, I⁺And I^-Wherein the document I and the document I⁺Has a similarity greater than that of document I and document I^-The similarity of (c).

Using the document feature representation extracted by BERT in the previous step as an input, a non-linear mapping from the document feature representation to the hamming space is learned using a deep neural network for generating a hash representation of the unknown document.

In this disclosure, to reduce training overhead and reduce model complexity, F is defined in a deep neural network that is two hidden layers. The first hidden layer adopts a ReLU activation function to deal with gradient disappearance and gradient explosion conveniently, and the second hidden layer adopts a sigmoid activation function to map output between 0 and 1, so 0.5 is used as a threshold value for converting a real value into a binary code (hash code). For training of the neural network, we implemented using a random gradient descent, where the learning rate was set to 0.001 and the number of iterations was set to 120 rounds.

And inputting the characteristic representation of the document into a pre-trained Hash learning deep neural network to obtain the real number representation of the document. Further, a real number is used for binarization with a threshold value of 0.5, i.e., a transformation greater than 0.5 is 1, and a transformation less than 0.5 is 0. Finally, the document of a Chinese character is converted into a hash code with the length of K.

As one or more embodiments, in S104, the similarity of the judicial case documents is calculated based on the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents; the method comprises the following steps: and calculating the similarity of the judicial case documents according to the hamming distance between the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents.

Specifically, when the hamming distance is less than a set threshold, the similarity of the judicial case documents is high; otherwise, the similarity of the judicial case documents is low.

Most of the existing case matching algorithms adopt real-value representation to compare the similarity of cases, and large-scale matching is not used. The hash method can convert multimedia such as documents, images, videos and the like into a compact binary code, and the similarity relation among original data is reserved. The distance measure between the binary codes (also called hash codes) uses hamming distance, which can be solved quickly by hardware xor operation. Therefore, the hash method can have great advantages in storage and efficiency.

The method and the device greatly reduce the storage overhead of the document representation by adopting Hash learning, and improve the matching speed of similar cases. The method is suitable for large-scale similar case matching scenes.

Table 1 is a simulation experiment of the disclosed method, measured with matching accuracy. The data set used for this task was a legal document from the "network of official documents" disclosure, where each data set consisted of three legal documents. For each legal instrument, we provide only a description of the fact.

For each data, we represent the set of data by (d, d1, d2), where d, d1, d2 all correspond to a certain document. For training data, we guarantee that our paperwork data d is more similar to d1 than d is to d2, i.e., sim (d, d1) > sim (d, d 2). This data set relates to a total of five thousand sets of document triplets, all of which must be private lending. We used 4500 sets of document triplets as the training set and 500 sets of document triplets as the test set.

Compared with the prior art, the method and the device have the advantages that Hash learning is adopted, so that the storage cost of document feature representation is greatly reduced, and the matching speed of similar cases is improved.

Table 1 comparison of accuracy of the present disclosure with other algorithms

Example two

The embodiment provides a similar judicial case matching system based on triple deep hash learning;

It should be noted here that the above-mentioned obtaining module, the feature extracting module, the hash code extracting module and the similarity matching module correspond to steps S101 to S104 in the first embodiment, and the above-mentioned modules are the same as the corresponding steps in the implementation example and application scenarios, but are not limited to the contents disclosed in the first embodiment. It should be noted that the modules described above as part of a system may be implemented in a computer system such as a set of computer-executable instructions.

In the foregoing embodiments, the descriptions of the embodiments have different emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

The proposed system can be implemented in other ways. For example, the above-described system embodiments are merely illustrative, and for example, the division of the above-described modules is merely a logical functional division, and in actual implementation, there may be other divisions, for example, multiple modules may be combined or integrated into another system, or some features may be omitted, or not executed.

EXAMPLE III

The present embodiment also provides an electronic device, including: one or more processors, one or more memories, and one or more computer programs; wherein, a processor is connected with the memory, the one or more computer programs are stored in the memory, and when the electronic device runs, the processor executes the one or more computer programs stored in the memory, so as to make the electronic device execute the method according to the first embodiment.

It should be understood that in this embodiment, the processor may be a central processing unit CPU, and the processor may also be other general purpose processors, digital signal processors DSP, application specific integrated circuits ASIC, off-the-shelf programmable gate arrays FPGA or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory may include both read-only memory and random access memory, and may provide instructions and data to the processor, and a portion of the memory may also include non-volatile random access memory. For example, the memory may also store device type information.

In implementation, the steps of the above method may be performed by integrated logic circuits of hardware in a processor or instructions in the form of software.

The method in the first embodiment may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in the processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, among other storage media as is well known in the art. The storage medium is located in a memory, and a processor reads information in the memory and completes the steps of the method in combination with hardware of the processor. To avoid repetition, it is not described in detail here.

Those of ordinary skill in the art will appreciate that the various illustrative elements, i.e., algorithm steps, described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Example four

The present embodiments also provide a computer-readable storage medium for storing computer instructions, which when executed by a processor, perform the method of the first embodiment.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Claims

1. The similar judicial case matching method based on triple deep hash learning is characterized by comprising the following steps of:

acquiring a judicial case document to be matched;

2. The method as claimed in claim 1, characterized by obtaining the judicial case documents to be matched; the method comprises the following specific steps:

acquiring a judicial case document to be matched;

3. The method as claimed in claim 1, wherein the judicial case documents to be matched are input into the pre-trained feature extraction model to obtain the feature expression vectors of the judicial case documents to be matched; the method comprises the following specific steps:

4. The method of claim 1, wherein the pre-trained triple deep hash learning model is trained by a process comprising:

constructing a Hash learning model; constructing a training set;

5. The method of claim 1, wherein the training set is a plurality of document triplets; each document triplet comprises a known feature representation vector of each document in the three documents and a known hash code of each document in the three documents;

6. The method of claim 1, wherein a loss function based on triple document similarity is established, which is designed based on consistency of similarity between three documents and similarity between their corresponding hash codes, such that the final document hash code retains the similarity relationship between the original documents.

7. The method as claimed in claim 1, wherein the similarity of the judicial case documents is calculated based on the hash code of the judicial case documents to be matched and the hash code of the known judicial case documents; the method comprises the following steps: and calculating the similarity of the judicial case documents according to the hamming distance between the hash codes of the judicial case documents to be matched and the hash codes of the known judicial case documents.

8. The similar judicial case matching system based on triple deep hash learning is characterized by comprising the following steps:

9. An electronic device, comprising: one or more processors, one or more memories, and one or more computer programs; wherein a processor is connected to the memory, the one or more computer programs being stored in the memory, the processor executing the one or more computer programs stored in the memory when the electronic device is running, to cause the electronic device to perform the method of any of the preceding claims 1-7.

10. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the method of any one of claims 1 to 7.