Disclosure of Invention
In view of this, embodiments of the present invention provide a method, an apparatus, a device, and a storage medium for recommending case documents, so as to analyze one or more tag words through multiple rounds of dialog with an inquirer by using natural language understanding and graph database technologies, and perform accurate search in a legal document database by using the tag words and return a corresponding document.
In a first aspect, an embodiment of the present invention provides a method for recommending a case document, including:
receiving a user query statement;
extracting tag words from the query sentence;
searching in a graphic database according to the label words, and acquiring a case document ID set corresponding to the label words;
inquiring a document database according to the case document ID set to obtain a case document set corresponding to the case document ID set;
wherein, the graph database stores label words and corresponding case document IDs thereof; the document database stores the ID of each case document and the corresponding original case document.
In a possible implementation manner, in the foregoing method provided in an embodiment of the present invention, before receiving the user query statement, the method further includes:
establishing a one-to-one corresponding relation between the case documents and the case document IDs, and storing the case documents and the case document IDs in the document database;
extracting a case document title from the original case document, and constructing case document metadata according to a case document ID and the case document title corresponding to the original case document;
identifying and labeling the case documents by using the trained multi-label classification model to generate corresponding case document labels;
and establishing a corresponding relation between the case document metadata and the case document label, and inserting the case document metadata and the case document label into a graphic database.
In a possible implementation manner, in the method provided in an embodiment of the present invention, before the identifying and labeling the case document by using the trained multi-label classification model, the method further includes:
extracting information paragraphs from the original case document to obtain plain text paragraphs of the case document;
performing word segmentation processing on the pure text paragraphs of the case document according to a legal vocabulary word segmentation dictionary to obtain a first word bag;
carrying out special word replacement on words in the first word bag to obtain a second word bag;
constructing a first text vector by using a vector constructor according to the second word bag, and carrying out denoising processing on the first text vector to obtain a second text vector;
dividing the second text vector into a training data set and a test data set;
and training and testing and evaluating the multi-label classification model based on machine learning by using the training data set and the testing data set to obtain the trained multi-label classification model.
In a possible implementation manner, in the method provided in an embodiment of the present invention, the extracting a tag word from the query statement specifically includes:
segmenting the query sentence by using a word segmentation dictionary stored with keywords to generate a third word bag, and generating a keyword set according to the third word bag;
judging whether the keyword set is empty or not;
if the keyword set is not empty, judging whether each keyword in the keyword set is the same as a tag word;
and if the keyword set is the same as the label word, taking the keyword in the keyword set, which is the same as the label word, as the label word.
In a possible implementation manner, in the method provided in an embodiment of the present invention, after determining whether each keyword in the keyword set is the same as a tag word if the keyword set is not empty, the method further includes:
if not, entering a graphic database to search for a corresponding label node, wherein the label node is associated with the case document metadata;
constructing a new keyword set according to the keywords corresponding to the child nodes of the label nodes;
constructing a reply sentence comprising the new keyword set;
and sending the reply sentence to a user.
In a second aspect, an embodiment of the present invention provides an apparatus for recommending case documents, including:
the receiving module is used for receiving a user query statement;
the extraction module is used for extracting the label words from the query sentences;
the retrieval module is used for retrieving in a graphic database according to the label words and acquiring a case document ID set corresponding to the label words;
the query obtaining module is used for querying a document database according to the case document ID set to obtain a case document set corresponding to the case document ID set;
wherein, the graph database stores label words and corresponding case document IDs thereof; the document database stores the ID of each case document and the corresponding original case document.
In a possible implementation manner, in the apparatus provided in an embodiment of the present invention, the apparatus further includes:
the establishing module is used for establishing a one-to-one corresponding relation between the case documents and the case document IDs before the receiving module receives the user query sentences, and storing the one-to-one corresponding relation in the document database; extracting a case document title from the original case document, and constructing case document metadata according to a case document ID and the case document title corresponding to the original case document; identifying and labeling the case documents by using the trained multi-label classification model to generate corresponding case document labels; and establishing a corresponding relation between the case document metadata and the case document label, and inserting the case document metadata and the case document label into a graphic database.
In a possible implementation manner, in the apparatus provided in an embodiment of the present invention, the apparatus further includes: a classification model training module: the system comprises an establishing module, a judging module and a judging module, wherein the establishing module is used for extracting information paragraphs of an original case document before the establishing module utilizes a trained multi-label classification model to identify and label the case document to obtain plain text paragraphs of the case document; performing word segmentation processing on the pure text paragraphs of the case document according to a legal vocabulary word segmentation dictionary to obtain a first word bag; carrying out special word replacement on words in the first word bag to obtain a second word bag; constructing a first text vector by using a vector constructor according to the second word bag, and carrying out denoising processing on the first text vector to obtain a second text vector; dividing the second text vector into a training data set and a test data set; and training and testing and evaluating the multi-label classification model based on machine learning by using the training data set and the testing data set to obtain the trained multi-label classification model.
In a possible implementation manner, in the foregoing apparatus provided in an embodiment of the present invention, the extracting module includes:
the keyword unit is used for segmenting the query sentence by utilizing a segmentation dictionary which stores keywords to generate a third word bag, and generating a keyword set according to the third word bag;
the label word unit is used for judging whether the keyword set is empty or not; if the keyword set is not empty, judging whether each keyword in the keyword set is the same as a tag word; and if the keyword set is the same as the label word, taking the keyword in the keyword set, which is the same as the label word, as the label word.
In a possible implementation manner, in the apparatus provided in an embodiment of the present invention, the apparatus further includes:
the construction module is used for judging whether each keyword in the keyword set is the same as a label word in the label word unit, if not, entering a graphic database to search for a corresponding label node, and the label node is associated with the case document metadata; constructing a new keyword set according to the keywords corresponding to the child nodes of the label nodes;
a reply module for constructing a reply sentence including the new keyword set; and sending the reply sentence to a user.
In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory and a processor;
the memory for storing a computer program;
wherein the processor executes the computer program in the memory to implement the method of any one of the first aspect.
In a fourth aspect, an embodiment of the present invention provides a computer-readable storage medium, in which a computer program is stored, and the computer program is used for implementing the method in any one of the first aspect when executed by a processor.
The case document recommendation method, the case document recommendation device, the case document recommendation equipment and the case document recommendation storage medium receive a user query statement, extract a label word from the query statement, retrieve in a graphic database according to the label word to obtain a case document ID set corresponding to the label word, query a document database according to the case document ID set to obtain a case document set corresponding to the case document ID set. The graph database stores label words and corresponding case documents IDs, and the document database stores the case documents IDs and corresponding original case documents. In the scheme, the natural language understanding and graphic database technology is applied, the query intention of a user is determined through extracting the key information of the query statement and interacting with the graphic database, and the related document ID is accurately returned from the graphic database with strict organization, so that the corresponding case document set is returned from the document database, and the user requirements are met.
Example one
Fig. 1 is a schematic flow chart of a case document recommendation method according to an embodiment of the present invention, and as shown in fig. 1, the method may include the following steps:
s101, receiving a user query statement.
In practical applications, the execution subject of the embodiment may be a recommendation device of case documents. In practical applications, the recommendation apparatus of the case document may be implemented by a virtual apparatus, such as a software code, or by an entity apparatus written with a relevant execution code, such as a usb disk, or by an entity apparatus integrated with a relevant execution code, such as a chip, an intelligent terminal, or the like.
Specifically, the user makes legal consultation about a case, usually consults similar case information, and therefore, the user organizes query sentences according to the own query intention and inputs a recommendation device of the case document.
And S102, extracting the label words from the query sentence.
The user may not have a deep legal background and therefore the vocabulary selected in describing the problem is too colloquial. For example: a case party involved in multiple vehicle tailgating events may not know at all that a similar "multiple vehicle injury" professional legal term should be used to search when retrieving similar cases. In order to determine the query intention of the user, according to an embodiment of the present invention, as shown in fig. 1A, step S102 may specifically include the following steps:
s102a, segmenting the query sentence by using a segmentation dictionary with stored keywords to generate a third bag, and generating a keyword set according to the third bag.
Specifically, the user query sentence is segmented into word bags by using a segmentation dictionary which stores keywords. And searching words or phrases with the same or similar meanings as the keywords in the word bag by using a word embedding technology, converting the words or phrases into corresponding keywords and putting the keywords into the keyword set, and directly putting the keywords into the keyword set if the word bag contains the keywords.
S102b, judging whether the keyword set is empty or not. And if the answer is null, constructing a reply statement and sending the reply statement to the user.
Specifically, if the keyword set obtained by word segmentation is empty, it indicates that any information related to the case cannot be obtained according to the user query statement, and at this time, a reply statement is constructed and sent to the user to prompt the user to re-input the query statement.
S102c, if the keyword set is not empty, judging whether each keyword in the keyword set is the same as a tag word. If so, the process proceeds to step S102 d. If not, the step S102e is performed.
S102d, the keywords in the keyword set, which are the same as the label words, are used as the label words.
S102e, entering a graph database to search for corresponding label nodes, wherein the label nodes are associated with case document metadata, constructing a new keyword set according to keywords corresponding to child nodes of the label nodes, constructing a reply sentence comprising the new keyword set, and sending the reply sentence to a user.
Specifically, if no label word is found in the keyword set, the non-label node in the keyword set is found in the graph database, and after the corresponding node is found, the next-level sub-node is found according to the relation included in the node and a new keyword set is generated by using the label name of the sub-node. A reply sentence including the new keyword set is then constructed and sent to the user to prompt the user to confirm the tagged words. Through the above multiple rounds of dialog based on natural language understanding and multiple rounds of interaction with the graphical database, the user query intent can be confirmed.
For example, a recommendation device of a case document is abbreviated as "recommendation device", and multiple rounds of dialogs based on natural language understanding are as follows:
the user: i want to see the case document of the traffic class.
The recommendation device: traffic offences can be subdivided into five sub-categories: responsibility subjects, responsibility constitution, alleviation or disclaimer of responsibility origin, responsibility mode, litigation procedure. Which category you are interested in?
The user: i want to see what is in the way of responsibility.
The recommendation device: the responsibility mode comprises a classification basis: loss of indemnity; and six kinds of labels: accident responsibility cannot be identified, major responsibility is reported, minor responsibility is reported, all responsibility is reported, both responsibility is equal, and responsibility is reported without responsibility.
The user: please help me find a document that is fully covered with responsibility.
S103, searching in a graph database according to the label words, and acquiring a case document ID set corresponding to the label words.
Specifically, the graph database stores tag words and corresponding case documents IDs. Case document ID is a one-to-one correspondence identification code generated for the case document using a universal unique identification code generator. And searching in the graphic database through the label words to obtain a case document ID set corresponding to the label words.
And S104, inquiring a document database according to the case document ID set to obtain a case document set corresponding to the case document ID set.
Specifically, each case document ID and the original case documents corresponding to the case document ID one by one are stored in the document database, and the document database is queried according to the case document ID set, so that the case document set corresponding to the case document ID set can be obtained.
Optionally, the graph database is pre-established, and according to an embodiment of the present invention, as shown in fig. 2, before performing the step S101, the method may further include the following steps:
s201, establishing a one-to-one corresponding relation between the case documents and the case document IDs, and storing the case documents and the case document IDs in the document database.
S202, extracting a case document title from the original case document, and constructing case document metadata according to a case document ID and the case document title corresponding to the original case document.
Specifically, a case document title can be extracted from a case document in an original XML format, case document metadata is constructed according to a case document ID and the case document title corresponding to the original case document, and a type of a node is indicated to be distinguished from other types of nodes when a metadata node is created in a graph database, for example: "MERGE (n: File { ID: 'case document ID', title: 'case document title' })".
S203, identifying and labeling the case documents by using the trained multi-label classification model to generate corresponding case document labels.
And S204, establishing a corresponding relation between the case document metadata and the case document labels, and inserting the case document metadata and the case document labels into a graphic database.
Specifically, label matching is carried out on a label word set predicted BY case document metadata according to a trained multi-label classification model in a graphic database, and the relation between the case document and the label words is established, wherein the label word set comprises ' MATCH (a: File), (b: Tag) WHERE b.name ═ label name ' CREATE (a) [: TAGGED _ BY ] - > (b) '.
The trained multi-label classification model is obtained by training the multi-label classification model in advance, and the multi-label classification model is used for labeling the existing or newly added case documents in the future so as to assist in creating metadata of the case documents stored in the graphic database.
The multi-label classification model comprises a series of two-classification and multi-classification models. Taking the hit-and-run case document as an example, as shown in fig. 4, it can be subdivided into five sub-categories, and these sub-categories can be further subdivided into 42 more specific labels, which belong to each other, and the sub-categories described in the case document are largely different. Taking the "responsibility mode" subclass as an example, such case documents contain six different label words such as "accident responsibility cannot be identified", "reported main responsibility", "reported secondary responsibility" and the like, and a case document cannot contain multiple labels in the case documents at the same time, but only one label is possible, so that a multi-label classification model can be trained for the labels. While for some labels, such as "multiple vehicle injury," a two classification model is suitable.
FIG. 4 illustrates a graph database of hit traffic classes, which contains five different types of nodes: case type, subclass, classification basis, label, case metadata. There are four relationships between nodes, which are: the case type includes a (: CONTAINS) sub-class, the sub-class is BASED ON a (: BASED _ ON) classification basis, the sub-class and the classification basis contain a (: HAS) label, and the case metadata is labeled with a label (: TAGGED _ BY). The case type, the subclass, the classification basis and the name of the label node are the attributes, which are collectively called keywords, wherein the name of the label node is also called a label word.
Optionally, according to an embodiment of the present invention, as shown in fig. 3, before performing step S203, the method may further include the following steps:
s301, extracting information paragraphs from the original case document to obtain plain text paragraphs of the case document.
S302, performing word segmentation processing on the case document plain text paragraphs according to the legal vocabulary word segmentation dictionary to obtain a first word bag.
Specifically, the extraction of the information paragraphs directly concerns the quality of the training data, and the original XML-formatted case document contains many information paragraphs that are useless for text classification learning, such as < title >, < case word size >, < referee time >, < referee > and so on, which are used as corpus to increase noise in the training data. And the < examined people >, < found people in the institute >, < thought people in the institute > and the like have high generalization effect on case conditions, and simultaneously contain a complete vocabulary list with important effect on case condition definition, so that the effect of text classification learning by being used as a corpus extraction paragraph is better for constructing text vectors.
S303, performing special word replacement on the words in the first word bag to obtain a second word bag.
Specifically, for different classification problems, individual words and phrases in the case document have higher influence on classification learning than other words and phrases. For example, in the "responsibility style" class document, the frequency and location of the occurrence of "original report" and "reported" have an important influence on whether the case document belongs to "reported all responsibility", "reported primary responsibility" or "reported secondary responsibility". However, in some case documents, only the name of the party is used and the original name is omitted. In this case, in the process of constructing the corpus, the name of the principal is matched with the original quilt report by semantic analysis of the < principal information > paragraph, and then the name of the principal is replaced by entering the paragraph extracted from the corpus. Similarly, in the document of 'harm caused by a plurality of motor vehicles', different license plates are replaced by fixed replacement words such as 'license plate 1', 'license plate 2', 'license plate 3' and the like which are added into the word segmentation dictionary, and the expected effect is similar to the responsibility mode. Therefore, special word substitutions may be made to words in the first bag.
S304, according to the second word bag, a first text vector is constructed by using a vector constructor, and the first text vector is subjected to denoising processing to obtain a second text vector.
S305, dividing the second text vector into a training data set and a testing data set.
S306, training and testing evaluation are carried out on the multi-label classification model based on machine learning by utilizing the training data set and the testing data set, and the trained multi-label classification model is obtained.
The case document recommendation method provided in this embodiment receives a user query sentence, extracts a tag word from the query sentence, searches in a graph database according to the tag word, obtains a case document ID set corresponding to the tag word, queries a document database according to the case document ID set, and obtains a case document set corresponding to the case document ID set. The graph database stores label words and corresponding case documents IDs, and the document database stores the case documents IDs and corresponding original case documents. In the scheme, the natural language understanding and graphic database technology is applied, the query intention of a user is determined through extracting the key information of the query statement and interacting with the graphic database, and the related document ID is accurately returned from the graphic database with strict organization, so that the corresponding case document set is returned from the document database, and the user requirements are met.
The following are embodiments of the apparatus of the present invention that may be used to perform embodiments of the method of the present invention. For details which are not disclosed in the embodiments of the apparatus of the present invention, reference is made to the embodiments of the method of the present invention.
Fig. 5 is a schematic structural diagram of a case document recommendation apparatus according to a second embodiment of the present invention, and as shown in fig. 5, the apparatus includes:
a receiving module 510, configured to receive a user query statement.
An extracting module 520, configured to extract the tag word from the query statement.
And the retrieval module 530 is configured to retrieve from a graph database according to the tag word, and obtain a case document ID set corresponding to the tag word.
And the query obtaining module 540 is configured to query the document database according to the case document ID set to obtain a case document set corresponding to the case document ID set.
Wherein, the graph database stores label words and corresponding case document IDs. The document database stores the ID of each case document and the corresponding original case document.
According to an embodiment of the present invention, the apparatus may further include:
the establishing module 550 is configured to establish a one-to-one correspondence relationship between the case documents and the case document IDs before the receiving module receives the user query statement, and store the one-to-one correspondence relationship in the document database. Extracting case document titles from the original case documents, and constructing case document metadata according to the case document IDs and the case document titles corresponding to the original case documents. And identifying and labeling the case documents by using the trained multi-label classification model to generate corresponding case document labels. And establishing a corresponding relation between the case document metadata and the case document label, and inserting the case document metadata and the case document label into a graphic database.
According to an embodiment of the present invention, the apparatus may further include: classification model training module 560: the method is used for extracting information paragraphs from the original case document before the establishing module identifies and labels the case document by using the trained multi-label classification model to obtain plain text paragraphs of the case document. And performing word segmentation processing on the pure text paragraphs of the case document according to a legal vocabulary word segmentation dictionary to obtain a first word bag. And carrying out special word replacement on the words in the first word bag to obtain a second word bag. And constructing a first text vector by using a vector constructor according to the second word bag, and carrying out denoising processing on the first text vector to obtain a second text vector. The second text vector is divided into a training data set and a test data set. And training and testing and evaluating the multi-label classification model based on machine learning by using the training data set and the testing data set to obtain the trained multi-label classification model.
According to an embodiment of the present invention, the extracting module 520 may include:
the keyword unit 521 is configured to segment the query sentence by using a segmentation dictionary in which keywords are stored to generate a third bag of words, and generate a keyword set according to the third bag of words.
A label word unit 522, configured to determine whether the keyword set is empty. And if the keyword set is not empty, judging whether each keyword in the keyword set is the same as a tag word. And if the keyword set is the same as the label word, taking the keyword in the keyword set, which is the same as the label word, as the label word.
According to an embodiment of the present invention, the apparatus may further include:
the construction module 570 is configured to determine, in the tag word unit, whether each keyword in the keyword set is the same as a tag word, and if not, enter a graph database to search for a corresponding tag node, where the tag node is associated with the case document metadata. And constructing a new keyword set according to the keywords corresponding to the child nodes of the label nodes.
A reply module 580 for constructing a reply sentence including the new keyword set. And sending the reply sentence to a user.