CN112148701A

CN112148701A - File retrieval method and equipment

Info

Publication number: CN112148701A
Application number: CN202011010147.0A
Authority: CN
Inventors: 王喆龙
Original assignee: Ping An Zhitong Consulting Co Ltd Shanghai Branch
Current assignee: Ping An Zhitong Consulting Co Ltd Shanghai Branch
Priority date: 2020-09-23
Filing date: 2020-09-23
Publication date: 2020-12-29

Abstract

The application is applicable to the technical field of data processing, and provides a method and equipment for file retrieval, which comprise the following steps: respectively dividing each historical case into a plurality of data packets based on knowledge nodes contained in a preset legal knowledge graph; creating a case index table corresponding to the historical case according to the knowledge nodes associated with the data packets; each knowledge node is associated with a corresponding distributed storage node; storing each data packet of the historical case in the associated distributed storage node based on the case index table; determining a target node in the legal knowledge graph associated with the search keyword based on the received search keyword; and generating a file retrieval result according to the history cases contained in the distributed storage node corresponding to the target node. The method and the device reduce the time consumption of searching and improve the retrieval efficiency.

Description

File retrieval method and equipment

Technical Field

The present application belongs to the technical field of data processing, and in particular, to a method and an apparatus for file retrieval.

Background

With the popularization of legal knowledge, the contact chance between the masses and legal cases is more and more, and a user can select a specific case to look up due to reasons such as work needs or personal interests. However, since the legal cases are numerous, if the user needs to manually screen the cases, the time required for the user to select the cases is greatly increased, and the difficulty in case selection is increased. Therefore, how to provide an efficient legal case retrieval means becomes a problem which needs to be solved urgently at present.

The existing retrieval technology of legal cases mainly adopts keyword-based searching to judge whether keywords input by a user exist in a text, and because the legal cases often contain more words, the legal cases need to be subjected to full-text keyword searching, so that the response time is long, and under the condition of large number of the legal cases, the searching time is further prolonged, so that the waiting time of file retrieval is prolonged, and the retrieval efficiency is reduced.

Disclosure of Invention

In view of this, embodiments of the present application provide a method and an apparatus for file retrieval, so as to solve the problems that in the existing retrieval technology of legal cases, full-text keyword search needs to be performed on the legal cases, response time is long, and in the case of a large number of legal cases, the search time consumption is further increased, waiting time for file retrieval is increased, and retrieval efficiency is low.

A first aspect of an embodiment of the present application provides a method for file retrieval, including:

respectively dividing each historical case into a plurality of data packets based on knowledge nodes contained in a preset legal knowledge graph;

creating a case index table corresponding to the historical case according to the knowledge nodes associated with the data packets; each knowledge node is associated with a corresponding distributed storage node; the case index table is used for storing the network addresses of the distributed storage nodes;

storing each data packet of the historical case in the associated distributed storage node based on the case index table;

determining a target node in the legal knowledge graph associated with the search keyword based on the received search keyword;

and generating a file retrieval result according to the history cases contained in the distributed storage node corresponding to the target node.

A second aspect of an embodiment of the present application provides an apparatus for file retrieval, including:

the data packet dividing unit is used for dividing each historical case into a plurality of data packets based on knowledge nodes contained in a preset legal knowledge graph;

the case index table creating unit is used for creating a case index table corresponding to the historical case according to the knowledge nodes associated with the data packets; each knowledge node is associated with a corresponding distributed storage node; the case index table is used for storing the network addresses of the distributed storage nodes;

the data packet storage unit is used for storing each data packet of the historical case into the associated distributed storage node based on the case index table;

the search keyword receiving unit is used for determining a target node in the legal knowledge graph, which is related to the search keyword, based on the received search keyword;

and the file retrieval result output unit is used for generating a file retrieval result according to the history cases contained in the distributed storage node corresponding to the target node.

A third aspect of embodiments of the present application provides a terminal device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the first aspect when executing the computer program.

A fourth aspect of embodiments of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the first aspect.

The method and the device for retrieving the file have the following advantages that:

when the history case is stored, the history case is divided into a plurality of data packets according to the legal knowledge graph and is stored in a plurality of different nodes in a distributed mode, and the related data packets can be extracted from different distributed storage nodes through corresponding case index tables to regenerate the history case; during subsequent keyword retrieval, a target node can be determined through the search keywords, the historical cases corresponding to the data packets stored in the target node are the target cases searched at this time, and a file retrieval result is generated, so that the purpose of file retrieval is achieved. Compared with the prior retrieval technology of legal cases, because different distributed storage nodes store data packets of historical cases, and the data packets in each distributed storage node correspond to the same knowledge node, full-text search is not needed in the subsequent search process, and after a target node associated with a search keyword is determined, the historical case corresponding to the data packet stored in the target node is the target case of the retrieval, so that the time consumed by the search is greatly reduced, and the retrieval efficiency is improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.

FIG. 1 is a flowchart of a method for retrieving documents according to a first embodiment of the present application;

FIG. 2 is a flowchart illustrating a specific implementation of a method for retrieving documents according to a second embodiment of the present application;

fig. 3 is a flowchart illustrating an implementation details of a method S202 for retrieving a file according to a third embodiment of the present application;

FIG. 4 is an association network provided by an embodiment of the present application;

FIG. 5 is a flowchart illustrating a detailed implementation of a file retrieval method according to a fourth embodiment of the present application;

FIG. 6 is a schematic structural diagram of a case relationship tree according to an embodiment of the present application;

fig. 7 is a flowchart of a specific implementation of a method S101 for retrieving a file according to a fifth embodiment of the present application;

fig. 8 is a flowchart illustrating an implementation details of a method S104 for retrieving a file according to a sixth embodiment of the present application;

fig. 9 is a flowchart illustrating an implementation of a method S105 for retrieving a file according to a seventh embodiment of the present application;

FIG. 10 is a block diagram of a document retrieval apparatus according to an embodiment of the present application;

fig. 11 is a schematic diagram of a terminal device according to another embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.

The method comprises the steps of determining documents which have an association relation with a user as candidate documents by acquiring user information of the user, preliminarily screening the documents in a document database, and generating document feature vectors corresponding to the candidate documents; according to the user knowledge graph corresponding to each user and browsing records in user information, user characteristic vectors are generated, the text characteristic vectors and the user characteristic vectors are led into a preset recommendation model, recommended documents are determined from candidate documents, a recommendation list containing the recommended documents is generated and output to the users, the purpose of automatically generating the recommendation list is achieved, the problems that accurate recommendation cannot be achieved through the existing recommendation technology of legal documents are solved, the recommendation efficiency is reduced, and the time consumed for searching interesting documents by the users is increased are solved.

In the embodiment of the present application, the main execution body of the flow is a terminal device. The terminal devices include but are not limited to: servers, computers, smart phones, tablets, and the like, capable of performing the task of document retrieval. Fig. 1 shows a flowchart of an implementation of a method for retrieving a file according to a first embodiment of the present application, which is detailed as follows:

in S101, based on knowledge nodes included in a preset legal knowledge graph, each historical case is divided into a plurality of data packages.

In this embodiment, the terminal device may have a legal knowledge base prestored therein, the legal knowledge base may be downloaded through a cloud server, and the legal knowledge base downloaded by the cloud server may be generated and obtained based on a plurality of standard legal texts, for example, legal entities included in the legal texts are identified according to standard legal texts such as criminal law, civil law and constitution, and an association relationship between different legal entities is established based on the common occurrence times and occurrence positions between the legal entities, so as to construct and obtain the legal knowledge base. In a possible implementation manner, the legal knowledge graph can be constructed according to all existing historical cases in a document database, similarly, the terminal device can identify legal entities contained in the historical cases, and establish an association relationship between different legal entities based on the common occurrence times and the occurrence positions of the legal entities, so as to construct and obtain the legal knowledge graph.

In this embodiment, the legal knowledge graph pre-stored in the terminal device includes a plurality of knowledge nodes, and each knowledge node may correspond to one legal entity. For example, the legal entities may be "intellectual property," "trademark," and "litigator," and there is a corresponding relationship between different legal entities, for example, "intellectual property" includes "trademark," that is, the former includes the latter and belongs to the inclusion relationship. The terminal equipment can create corresponding knowledge nodes for different legal entities, and generate a legal knowledge graph according to the incidence relation among the different knowledge nodes.

In this embodiment, a plurality of history cases may be stored in the storage module of the terminal device, and the history cases may contain standard legal texts, such as documents for defining legal terms, such as criminal law, civil law, constitution, and the like; the history case may also contain all intermediate text generated by the respective user when handling the legal case, and decision results on the legal case, etc., such as prosecution documents, evidence of answers, and decision books. The terminal equipment can download the historical cases from the internet, or receive the uploading of each user, configure corresponding case identification for each historical case, and store the case identification in a local storage module or a cloud server. In a possible implementation manner, in order to improve the storage efficiency of history cases, before the terminal device stores the history cases, the terminal device may perform a duplicate checking operation on all history cases, calculate a repetition rate between each history case, identify two history cases as the same case if the repetition rate between any two history cases is greater than a preset repetition threshold, and merge multiple history cases whose repetition rates are greater than the preset repetition threshold, so that the data repetition rate in the storage device can be reduced, and the storage efficiency of the database is improved.

In this embodiment, the terminal device may perform semantic analysis on a historical case, determine whether the historical case includes a node keyword corresponding to any knowledge node in the legal knowledge graph, and if so, divide the historical case into a plurality of data packets with different data volumes based on the node keyword obtained through identification. The text information contained in each data packet is merged to obtain the complete historical case, and the historical case is stored in the distributed storage nodes in a blocking mode, so that the retrieval and extraction efficiency can be improved.

In S102, a case index table corresponding to the historical case is created according to the knowledge nodes associated with the data packets; each knowledge node is associated with a corresponding distributed storage node; the case index table is used for storing the network addresses of the distributed storage nodes.

In this embodiment, the terminal device may configure a corresponding distributed storage node for each knowledge node in the legal knowledge graph in advance. Each distributed storage node is used for storing the data packet corresponding to the associated knowledge node. Therefore, the text information of all the data packets stored in the distributed storage nodes comprises the node key words corresponding to the knowledge nodes, and in the subsequent retrieval process, if one data packet of a historical case is stored in the distributed storage nodes of the target nodes corresponding to the search key words, the historical case can be determined to be the target case retrieved by the user, so that the retrieval result can be quickly output.

In this embodiment, the file retrieval system includes a plurality of distributed storage nodes, where the distributed storage nodes form a distributed storage system, and the distributed storage system may include an addressing root node in addition to the distributed storage nodes. When a distributed storage system needs to acquire a certain historical case, the case identification of the historical case is sent to an addressing root node, the addressing root node acquires a case index table corresponding to the received case identification, and the case index table stores the network addresses of all the fragments of the historical case, namely the data packets and the corresponding distributed storage nodes, so that all the data packets of the historical case can be retrieved from all the distributed storage nodes based on the case index table, the historical case is generated, and the historical case is output.

In this embodiment, the terminal device may store an addressing table of each distributed storage node, query, according to the data packets obtained by dividing the historical case and the knowledge nodes corresponding to each data packet, the network addresses of the distributed storage nodes corresponding to each knowledge node, establish an association relationship between the data packets and the network addresses, and generate the case index table. It should be noted that each history case may correspond to a case index table.

In S103, storing each data packet of the history case in the associated distributed storage node based on the case index table.

In this embodiment, the terminal device may upload each data packet to a distributed storage node corresponding to an associated knowledge node for storage according to a network address recorded in the case index table, thereby implementing distributed storage of the entire historical case.

In a possible implementation manner, in the case of a large amount of data, one knowledge node may correspond to a plurality of distributed storage nodes. In this case, the terminal device may select one of the plurality of distributed storage nodes as a target storage node for storing the data packet of the history case through a preset load balancing algorithm. The terminal device can obtain the occupancy rate of each distributed storage node and the network operation parameters of each distributed storage node. Specifically, if the occupancy rate of the distributed storage nodes is higher, the corresponding storage priority is higher; if the value of the network operation parameter of the distributed storage node is larger (where the value of the network operation parameter is used to indicate the data transmission rate and the signal-to-noise ratio of the distributed storage node, and therefore the larger the value is, the larger the data transmission rate and the higher the signal-to-noise ratio are indicated), the corresponding storage priority is higher. The terminal device can import the storage occupancy rate and the network operation parameters into the storage priority conversion model, respectively calculate the storage priority of each distributed storage node, and select the distributed storage node with the highest storage priority as the target storage node.

If there is a new case, the new case may be divided into a plurality of data packets through the operations from S101 to S103, and each data packet is stored in the distributed storage node corresponding to the associated knowledge node.

In S104, based on the received search keyword, a target node in the legal knowledge graph, which is associated with the search keyword, is determined.

In this embodiment, when the user needs to search for a history case, a search keyword may be sent to the terminal device. If the terminal device is a mobile terminal, such as a smart phone, a notebook computer, and the like, a user can input a search keyword for a case to be searched through an interaction module of the terminal device, and the terminal device can generate a file retrieval result according to the received search keyword; if the terminal device is a server, for example, a retrieval server, the user may generate a retrieval request through the user terminal, where the retrieval request includes the search keyword, send the retrieval request to the server through a client built in the user terminal and associated with the server, and after receiving the retrieval request, the server extracts the search keyword included in the retrieval request and generates a file retrieval result.

In this embodiment, the terminal device may match the search keyword with each knowledge node in the legal knowledge graph, and select a knowledge node matched with the search keyword as the target node.

In S105, a file search result is generated according to the history cases included in the distributed storage node corresponding to the target node.

In this embodiment, a data packet stored in a target node includes a node keyword associated with a knowledge node, and the node keyword matches with the search keyword, so that text information of the data packet matches with the search keyword, and it can be determined that a history case corresponding to the data packet is a target case that a user needs to search, and therefore, case identifiers of history cases corresponding to each data packet in the target node can be obtained, and the case identifiers can be case titles of the history cases, and the file retrieval result is generated according to each case identifier.

In a possible implementation manner, if the search keywords include a plurality of search keywords, the terminal device may determine a display order of each history case in the file retrieval result according to the number of search keywords included in the history case, and the larger the number of the matched search keywords is, the earlier the display order is. If the number of the search keywords contained in the plurality of history cases is the same, the display order of each history case in the file retrieval result can be determined according to the occurrence number of each search keyword in the history case, wherein the display order of the history case with more occurrence number is earlier.

As can be seen from the above, when the method for file retrieval provided by the embodiment of the application stores a historical case, the historical case is divided into a plurality of data packets according to the legal knowledge graph and is stored in a plurality of different nodes in a distributed manner, and the associated data packets can be extracted from different distributed storage nodes through corresponding case index tables to regenerate the historical case; during subsequent keyword retrieval, a target node can be determined through the search keywords, the historical cases corresponding to the data packets stored in the target node are the target cases searched at this time, and a file retrieval result is generated, so that the purpose of file retrieval is achieved. Compared with the prior retrieval technology of legal cases, because different distributed storage nodes store data packets of historical cases, and the data packets in each distributed storage node correspond to the same knowledge node, full-text search is not needed in the subsequent search process, and after a target node associated with a search keyword is determined, the historical case corresponding to the data packet stored in the target node is the target case of the retrieval, so that the time consumed by the search is greatly reduced, and the retrieval efficiency is improved.

Fig. 2 is a flowchart illustrating a specific implementation of a method for retrieving a file according to a second embodiment of the present application. Referring to fig. 2, in the method for retrieving a file according to the embodiment, before dividing each historical case into a plurality of data packages based on knowledge nodes included in the preset legal knowledge graph, the method further includes: s201 to S204 are specifically detailed as follows:

further, before dividing each historical case into a plurality of data packets based on knowledge nodes contained in the preset legal knowledge graph, the method further includes:

in S201, semantic analysis is performed on all the historical cases in the case database to obtain a plurality of legal entities.

In this embodiment, the end device may determine case keywords included in the historical cases through a semantic analysis algorithm, identify part-of-speech types of the respective document keywords, and select case keywords related to legal knowledge as the legal entities, thereby obtaining the legal entities of the respective existing documents. The process of determining the case keywords through the semantic analysis algorithm specifically may include: the method comprises the steps of carrying out statement division on a historical case to obtain a plurality of case statements, carrying out phrase extraction on the statements to obtain candidate keywords corresponding to the case statements, identifying the part of speech of each candidate keyword, selecting the candidate keywords of the part of speech of nouns as case keywords, and determining the incidence relation among different case keywords by using verbs, prepositions and other keywords.

In S202, clustering the legal entities based on the standard legal text to obtain a plurality of knowledge nodes.

In this embodiment, after obtaining the legal entities included in the legal text, the terminal device may perform clustering operation on the legal entities, and package a plurality of legal entities having an association relationship into the same knowledge node, thereby determining the knowledge node included in the legal text.

In one possible implementation, the history case includes a plurality of different languages, for example, english and corresponding chinese translations, so that corresponding to the same entity, there are translations related to different languages in the history case, that is, legal entities in different languages, and the legal entities corresponding to the same entity have an association relationship, so that a plurality of legal entities belonging to the same entity are clustered into one knowledge node.

In one possible implementation, the terminal device stores a list of aliases obtained from the internet or legal texts, and there may be a plurality of different alias names, such as "criminal law" and "criminal law", for the same entity in different legal documents, and although the names are different, the corresponding legal entities are the same and have an alias relationship with each other. After the legal entities contained in the acquired historical case are identified by the terminal equipment, the legal entities with the alias relations can be identified, the legal entities with the alias relations are clustered, and the legal entities with the alias relations are identified in the legal knowledge graph through the same knowledge node, so that the accuracy of the legal knowledge graph can be improved.

In S203, establishing an association relationship between the knowledge nodes according to the co-occurrence word segments of the knowledge nodes in all the history cases.

In this embodiment, different legal entities may appear in the same language segment of the history case, two legal entities having the same language segment are identified as having a co-occurrence relationship, and a language segment containing a plurality of legal entities in the history case is identified as a co-occurrence language segment. For example, "the folk law includes a marital law, a contractual law, etc., and the phrase segment includes three legal entities, which are respectively the" folk law "," marital law "and the" contractual law ", and the phrase segment recorded with the three legal entities is a co-occurrence phrase segment.

In this embodiment, the terminal device may obtain co-occurrence word segments of the legal entities corresponding to the plurality of different knowledge nodes, locate the legal entities obtained through the identification in the co-occurrence word segments, determine the association relationships among the plurality of legal entities based on the connecting words among the plurality of legal entities, and identify the association relationships among the legal entities as the association relationships among the corresponding knowledge nodes.

In S204, the legal knowledge graph is generated according to the association relationship and the knowledge node.

In this embodiment, the terminal device may connect knowledge nodes based on the association relationship, so that a legal knowledge graph about all identified legal entities may be generated.

In the embodiment of the application, the speech analysis is carried out on the historical cases, the legal entities contained in each historical case are determined, the legal entities belonging to the same content are clustered to generate knowledge nodes, and the legal knowledge graph is constructed, so that the polymerization degree and the accuracy of the legal knowledge graph can be improved.

Fig. 3 shows a flowchart of a specific implementation of a method S202 for file retrieval according to a third embodiment of the present application. Referring to fig. 3, with respect to the embodiment described in fig. 2, in the method for retrieving a file provided by this embodiment, S202 includes: s2021 to S2024 are specifically described as follows:

further, if the user information includes the task to be processed of the target user, the extracting a candidate document associated with the user information from a document database based on the user information of the target user includes:

in S2021, the associated entity and the associated type of each of the legal entities are determined in the standard legal text.

In this embodiment, the terminal device may determine whether different legal entities have an alias relationship by downloading an alias list from the internet, and may determine whether different legal entities correspond to the same legal content by performing self-learning through a plurality of standard legal texts.

In this embodiment, the terminal device may mark legal entities in each standard legal text, obtain a sentence including a corresponding legal entity, obtain other entities except the legal entity in the sentence, identify the other entities as associated entities having an association relationship with the legal entity, and determine the association types between the legal entity and each associated entity from the sentence.

Illustratively, the legal entity that needs to determine the alias relationship is "civil law", and a statement "civil law includes marital law, contractual law, etc." is included in a standard legal text, and the statement includes two associated entities, namely "marital law" and "contractual law", in addition to the legal entity of "civil law", and the type of association between "civil law" and the two associated entities is "inclusion" relationship.

In S2022, an association network of the legal entity is generated based on the association entity and the association type of the association entity.

In this embodiment, since the standard legal text is specifically used for defining each legal concept, the type of association between different legal knowledge can be determined by the standard legal text, and the identified association relationship is determined based on the definition sentence of the standard legal text for the legal probability, so that the accuracy of the association relationship is high. After the terminal device performs the above-mentioned extraction of the associated entities of the legal entities and the determination of the associated types of the associated entities on all standard legal texts, it may integrate all the associated entities and the associated types to generate the associated network of the legal entities.

In a possible implementation manner, the association network may be a star-like relationship network, a center of the star-like relationship network is the legal entity, and the branch nodes of the star-like relationship network are the association entities having an association relationship with the legal entity, and a connection line between the center node and the branch node may be used to represent an association type between the center node and the branch node. Fig. 4 illustrates an association network provided by an embodiment of the present application. Referring to fig. 4, the association relationship network is specifically an association relationship network corresponding to a legal entity of "civil law", where the association entities corresponding to the "civil law" include "marital law", "contractual law", "inheritance right law", "criminal law", and "civil compensation", and the association relationships between the "civil law" and the respective association entities are shown in the figure.

In S2023, if the similarity between the associated networks of any two legal entities is greater than a preset similarity threshold, it is identified that an alias relationship exists between the two legal entities.

In this embodiment, after obtaining the associated networks corresponding to each legal entity, the terminal device may calculate a similarity between any two associated networks, where the similarity calculation specifically includes: counting the first number of the same associated entities, identifying the association types between the same associated entities, counting the second number of the same associated entities and the same association types, and determining the similarity between the two associated networks according to the first number and the second number. The greater the first number of the same associated entities, the greater the similarity between two legal entities.

In a possible implementation manner, the terminal device may calculate a similarity between the two associated networks through a preset similarity calculation algorithm, where the similarity calculation algorithm may specifically be a cosine similarity calculation algorithm or an euclidean distance calculation algorithm, specifically, convert the two associated networks into corresponding vector matrices, determine a vector distance between the two vector matrices, and determine a similarity between the two associated networks based on the vector distance.

In one possible implementation manner, the way of calculating the similarity may be: the terminal device may configure corresponding basic weights for the same associated entities, wherein the smaller the concept range of the associated entity is, the higher the corresponding basic weights are; conversely, if the concept range of the associated entity is larger, the corresponding basic weight is lower, for example, if the concept range corresponding to the "civil law" is larger, the corresponding basic weight is smaller, which may be "1"; and the concept range corresponding to the "civil law" and the "marital law" is smaller, the corresponding basic weight value is larger, and can be "2". If the association types corresponding to the same associated entities are also the same, a preset weighting coefficient may be superimposed on the basis weight to obtain a similarity factor corresponding to each of the same associated entities, and the similarity between the two associated networks may be calculated by superimposing the similarity factors of all the same associated entities.

In this embodiment, if the similarity between the associated networks of two legal entities is less than or equal to a preset similarity threshold, identifying that the two legal entities correspond to different legal concepts, that is, different aliases that do not belong to the same concept; otherwise, if the similarity is greater than the similarity threshold, the two legal entities are identified to correspond to the same legal concept, and are in an alias relationship with each other.

In S2024, clustering a plurality of the legal entities having the alias relationship to the same knowledge node.

In the embodiment, the terminal equipment clusters a plurality of legal entities with alias relations into the same knowledge node, so that alias association can be realized in the subsequent searching process, and the searching accuracy is improved.

In the embodiment of the application, the associated networks of different legal entities are constructed, the similarity among the different associated networks is calculated, a plurality of different legal entities with alias relationships are identified and obtained, and clustering operation is performed, so that the accuracy of the legal knowledge graph can be improved, and the retrieval efficiency is further improved.

Fig. 5 is a flowchart illustrating a specific implementation of a file retrieval method according to a fourth embodiment of the present application. Referring to fig. 5, with respect to the embodiment shown in fig. 1, before dividing each historical case into a plurality of data packages based on knowledge nodes included in the preset legal knowledge graph, the method for retrieving a file according to this embodiment includes: s501 to S503 are specifically detailed as follows:

further, before dividing each historical case into a plurality of data packets based on knowledge nodes contained in the preset legal knowledge graph, the method includes:

in S501, case labels of the historical cases in a case database are obtained, and a case relation tree corresponding to the case database is constructed based on the label grades of the case labels; the case relation tree comprises a plurality of branch nodes; each branch node is associated with one of the case labels.

In this embodiment, when a history case is stored in the case database, a corresponding case tag may be configured for the history case according to the content and title of the history case, and the dimension such as the acquisition channel of the document. The case label can be configured in a mode of manual input by a user, and can be obtained by automatic extraction after semantic analysis is carried out on document contents through terminal equipment. Historical cases within a case database may contain case tags identifying information of different dimensions, including legal category tags identifying the legal domain to which the historical case belongs. The terminal device can extract the legal category labels from all case labels of the historical cases, and determine the legal field corresponding to the existing document.

For example, the legal category label may be determined based on the document content of the historical case. If the document content of the history case is a process describing the property inheritance of a certain user, the value of the legal category tag can be 'inheritance'; if the document contents of the history case are a determination process describing a marital relationship of two users, the value of the legal category tag may be "marital method". Of course, if the document content of a history case relates to multiple legal domains, for example, a history case relates to the content inherited by the property of a partner of a divorced funeral couple, the existing document may relate to the content with both "inheritance" and "marital" and the legal type tags may be "inheritance" and "marital".

In the embodiment, each case label can determine the cascade relationship among each other according to the size and the coverage relationship of the field to which the case label belongs. For example, if a case label is "civil law" and another case label is "marital law", the marital law is a legal branch belonging to the civil law, i.e., the civil law belongs to an affiliate label node of the marital law; and the marital rule belongs to a next-up label node of the civil law. The terminal device can generate a case relation tree corresponding to the case database according to the cascade relation among the case labels, namely the label grades, and stores the historical cases in the corresponding branch nodes in an associated manner according to the case labels corresponding to the historical cases, so that the purpose of classifying the historical cases based on the case relation tree is achieved.

Exemplarily, fig. 6 shows a schematic structural diagram of a case relation tree provided in an embodiment of the present application. Referring to FIG. 6, the case relationship tree includes a plurality of branch nodes, each branch node corresponding to a case label. The cascade relationship between the branch nodes can be determined according to the sizes and the inclusion relationship of the branch nodes in the field to which the branch nodes belong. The coverage range of the parent node covers the coverage range of the child node, namely the coverage range of the parent node is larger than that of the child node, and the parent node and the child node are in contained relation. Wherein each tag node may flag the number of associated history cases.

In S502, according to the case labels, each historical case is imported to the associated branch node, and a legal sub-graph of the branch node is established based on legal entities including all historical cases in the branch node.

In this embodiment, the terminal device may store each history case in the corresponding branch node according to the case label corresponding to each history case, and history cases belonging to the same branch node include the same case label, so that the history cases may be classified based on the case relationship tree. The terminal equipment can also perform semantic analysis on each historical case in each branch node respectively, extract legal entities contained in each historical case, create knowledge nodes based on the legal entities, and generate legal sub-maps of the branch nodes according to all the knowledge nodes in the branch nodes.

In S503, the legal knowledge graph is generated according to the legal sub-graphs of all the branch nodes and the case relation tree.

In this embodiment, the terminal device connects the legal sub-maps of each branch node based on the association relationship between each branch node in the case relationship tree, and generates the legal knowledge map.

In the embodiment of the application, the case relation tree corresponding to the case database is constructed, and the legal knowledge graph is established based on the case relation tree, so that the cascade relation among different knowledge nodes in the legal knowledge graph can be improved, and the accuracy of subsequent retrieval is improved.

Fig. 7 shows a flowchart of a specific implementation of the method S101 for file retrieval according to the fifth embodiment of the present application. Referring to fig. 7, with respect to the embodiments described in fig. 1 to 6, a method S101 for retrieving a file provided by this embodiment includes: s1011 to S1013 are specifically described as follows:

further, the dividing each historical case into a plurality of data packets based on knowledge nodes contained in a preset legal knowledge graph comprises:

in S1011, the history case is divided into a plurality of text sections according to a preset block data amount.

In this embodiment, each distributed storage node may set the block data amount of the data packet in advance, so that management of the data packet can be facilitated. Specifically, the data amount of each data packet is not greater than the block data amount, based on which the terminal device may divide the history case into a plurality of text sections, each of which corresponds to a data amount not greater than the block data amount.

In S1012, extracting keywords from each text segment based on the legal knowledge base to obtain text keywords corresponding to each text segment; the text keywords are recorded in knowledge nodes of the legal knowledge graph.

In this embodiment, after dividing the historical case into a plurality of text segments, the terminal device may perform keyword search on each text segment, determine whether the text segment includes a node keyword corresponding to any knowledge node of the legal knowledge graph, and identify the node keyword as a text keyword corresponding to the text segment if the text segment includes the node keyword corresponding to the knowledge node.

In a possible implementation manner, if the text segment contains node keywords corresponding to a plurality of knowledge nodes, associated nodes corresponding to the knowledge nodes can be obtained, the number of the associated keywords of the associated nodes contained in the text segment is counted, and the knowledge node containing the most associated keywords is selected as the knowledge node corresponding to the text segment.

In S1013, an association relationship between the text segment and the knowledge node is established, and the data packet is generated according to the association relationship and the text segment.

In this embodiment, after determining the associated knowledge node of each text segment in the legal knowledge graph, the terminal device may establish an association relationship between the two, and encapsulate the association relationship in the data packet, so that the association relationship can be uploaded to the distributed storage node corresponding to the knowledge node during subsequent storage.

In the embodiment of the application, the text segments of the historical texts are divided based on the block data volume, so that the data volume of each data packet can be ensured to be consistent, and the efficiency of subsequent data storage and management can be improved.

Fig. 8 shows a flowchart of a specific implementation of the method S104 for file retrieval according to the sixth embodiment of the present application. Referring to fig. 8, with respect to any one of the embodiments in fig. 1 to fig. 6, in the method for retrieving a file provided by this embodiment, S104 includes: s1041 to S1042 are specifically described as follows:

further, the determining a target node in the legal knowledge graph associated with the search keyword based on the received search keyword comprises:

in S1041, performing semantic analysis on the search keyword, and determining legal entities included in the search keyword.

In this embodiment, when a user searches a history case, the user may input a corresponding search keyword, and the terminal device may perform semantic analysis on the search keyword to extract legal entities included in the search keyword. Since the search keyword may be composed of a plurality of different legal entities, such as "intellectual property right", the search keyword includes two legal entities, i.e., "intellectual property right" and "right" respectively; or one search keyword can have a plurality of different alias names or translation names, and the terminal device can determine legal entities corresponding to the different alias names or translation names according to the search keyword input by the user to realize search association.

In S1042, legal entities included in the search keyword are matched with knowledge nodes in the legal knowledge graph, and the knowledge nodes matched with the legal entities are identified as the target nodes associated with the search keyword.

In this embodiment, the terminal device may match the legal entity included in the search keyword with each knowledge node, and use the matched knowledge node as a target node corresponding to the search keyword.

In the embodiment of the application, the search keyword is subjected to semantic analysis to determine the legal entity contained in the search keyword, and the corresponding knowledge node is searched based on the legal entity, so that the purpose of determining the knowledge node corresponding to the search result is achieved.

Fig. 9 shows a flowchart of a specific implementation S105 of a method for file retrieval according to a seventh embodiment of the present application. Referring to fig. 9, with respect to any one of the embodiments in fig. 1 to 6, in the method for retrieving a file provided in this embodiment, S105 includes: S1051-S1052, detailed description is as follows:

further, the generating a file retrieval result according to the history cases included in the distributed storage node corresponding to the target node includes:

in S1051, the stored data packet of the historical case is extracted from the distributed storage node corresponding to the target node, and the search keyword is marked from the case field corresponding to the data packet.

In this embodiment, after determining a target node, a terminal device may query, according to a network address of a distributed storage node associated with the target node, data packets stored in the distributed storage node, where each data packet may correspond to an associated history case; the case language segment stored in the data packet is obtained by analyzing the data packet, and the legal entity corresponding to the search keyword is marked in the case language segment, wherein the mark can adopt a highlighting mode such as red marking or highlighting.

In S1052, encapsulating all case language segments marked with the search keyword, and generating the file retrieval result.

In this embodiment, the terminal device encapsulates all case language segments marked with the search keywords in the target node, and generates a file retrieval result, and the user can determine the case language segments containing the search keywords and the case identifiers of the historical cases corresponding to the case language segments through the file retrieval result, so as to easily know the search result. The user can select a plurality of historical cases from the file retrieval result as target cases, the terminal device can obtain a case index table corresponding to the target cases according to the selection instruction of the user, extract data packages related to the target cases from all distributed storage nodes, and reconstruct the target cases for output, complete historical cases do not need to be obtained in the searching stage, and the complete texts of the historical cases are obtained only when downloading is needed, so that the searching speed can be improved, and the access data amount of a file retrieval system is reduced.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Fig. 10 shows a block diagram of a document retrieval apparatus according to an embodiment of the present application, where the document retrieval apparatus includes units for executing steps in the corresponding embodiment of fig. 1. Please refer to fig. 10 for the embodiment corresponding to fig. 1. For convenience of explanation, only the portions related to the present embodiment are shown.

Referring to fig. 10, the apparatus for file retrieval includes:

the data packet dividing unit 11 is configured to divide each historical case into a plurality of data packets based on knowledge nodes included in a preset legal knowledge graph;

a case index table creating unit 12, configured to create a case index table corresponding to the historical case according to the knowledge node associated with each data packet; each knowledge node is associated with a corresponding distributed storage node; the case index table is used for storing the network addresses of the distributed storage nodes;

a data packet storage unit 13, configured to store, based on the case index table, each data packet of the historical case in the associated distributed storage node;

a search keyword receiving unit 14, configured to determine, based on the received search keyword, a target node in the legal knowledge graph that is associated with the search keyword;

and a file search result output unit 15, configured to generate a file search result according to a history case included in the distributed storage node corresponding to the target node.

Optionally, the apparatus for file retrieval further includes:

the legal entity acquisition unit is used for performing semantic analysis on all the historical cases in the case database to obtain a plurality of legal entities;

the legal entity clustering unit is used for clustering the legal entities based on a standard legal text to obtain a plurality of knowledge nodes;

the incidence relation determining unit is used for establishing incidence relations among the knowledge nodes according to co-occurrence word segments of the knowledge nodes in all the historical cases;

and the first legal knowledge graph generating unit is used for generating the legal knowledge graph according to the incidence relation and the knowledge node.

Optionally, the legal entity clustering unit includes:

the associated entity determining unit is used for determining the associated entity and the associated type of each legal entity in the standard legal text;

an association network generating unit, configured to generate an association network of the legal entity based on the association entity and the association type of the association entity;

the similarity calculation unit is used for identifying that the two legal entities have an alias relationship if the similarity between the associated networks of any two legal entities is greater than a preset similarity threshold;

and the alias relationship identification unit is used for clustering a plurality of legal entities with the alias relationship to the same knowledge node.

Optionally, the apparatus for file retrieval further includes:

the case relation tree building unit is used for obtaining case labels of the historical cases in the case database and building a case relation tree corresponding to the case database based on the label grades of the case labels; the case relation tree comprises a plurality of branch nodes; each branch node is associated with one case label;

the legal sub-map generating unit is used for leading each historical case into the associated branch node according to the case label and establishing a legal sub-map of the branch node based on legal entities containing all historical cases in the branch node;

and the second legal knowledge graph generating unit is used for generating the legal knowledge graph according to the legal sub-graphs of all the branch nodes and the case relation tree.

Optionally, the packet dividing unit 11 includes:

the text segment dividing unit is used for dividing the historical case into a plurality of text segments according to the preset block data size;

the text keyword extraction unit is used for respectively extracting keywords from each text segment based on the legal knowledge graph to obtain text keywords corresponding to each text segment; the text keywords are recorded in knowledge nodes of the legal knowledge graph;

and the data packet packaging unit is used for establishing the incidence relation between the text segment and the knowledge node and generating the data packet according to the incidence relation and the text segment.

Optionally, the search keyword receiving unit 14 includes:

the legal entity determining unit is used for performing semantic analysis on the search keyword and determining legal entities contained in the search keyword;

and the target node determining unit is used for matching legal entities contained in the search keywords with all knowledge nodes in the legal knowledge graph and identifying the knowledge nodes matched with the legal entities as the target nodes associated with the search keywords.

Optionally, the file retrieval result output unit 15 includes:

the search keyword marking unit is used for extracting the stored data packet of the historical case from the distributed storage node corresponding to the target node and marking the search keyword from the case language section corresponding to the data packet;

and the case language segment packaging unit is used for packaging all the case language segments marked with the search keywords to generate the file retrieval result.

Therefore, in the file retrieval device provided by the embodiment of the application, since the data packets of the historical cases are stored in different distributed storage nodes, and the data packets in each distributed storage node correspond to the same knowledge node, in the subsequent search process, full-text search is not required, and after the target node associated with the search keyword is determined, the historical case corresponding to the data packet stored in the target node is the target case of the current retrieval, so that the time consumed by the search is greatly reduced, and the retrieval efficiency is improved.

Fig. 11 is a schematic diagram of a terminal device according to another embodiment of the present application. As shown in fig. 11, the terminal device 11 of this embodiment includes: a processor 110, a memory 111 and a computer program 112, such as a file retrieval program, stored in the memory 111 and executable on the processor 110. The processor 110, when executing the computer program 112, implements the steps in the above-described method embodiments of file retrieval, such as S101 to S105 shown in fig. 1. Alternatively, the processor 110 executes the computer program 112 to implement the functions of the units in the device embodiments, such as the functions of the modules 11 to 15 shown in fig. 10.

Illustratively, the computer program 112 may be divided into one or more units, which are stored in the memory 111 and executed by the processor 110 to accomplish the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 112 in the terminal device 11. For example, the computer program 112 may be divided into a packet dividing unit, a case index table creating unit, a packet storing unit, a search key receiving unit, and a file retrieval result outputting unit, each of which functions as described above.

The terminal device 11 may be a desktop computer, a notebook, a palm computer, a cloud server, or other computing devices. The terminal device may include, but is not limited to, a processor 110, a memory 111. Those skilled in the art will appreciate that fig. 11 is merely an example of a terminal device 11 and is not intended to limit the terminal device 11, and may include more or less components than those shown, or some components in combination, or different components, for example, the terminal device may also include input and output devices, network access devices, buses, etc.

The Processor 110 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 111 may be an internal storage unit of the terminal device 11, such as a hard disk or a memory of the terminal device 11. The memory 111 may also be an external storage device of the terminal device 11, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the terminal device 11. Further, the memory 111 may also include both an internal storage unit and an external storage device of the terminal device 11. The memory 111 is used for storing the computer program and other programs and data required by the terminal device. The memory 111 may also be used to temporarily store data that has been output or is to be output.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims

1. A method of document retrieval, comprising:

2. The document retrieval method according to claim 1, wherein before dividing each of the history cases into a plurality of data packages based on knowledge nodes contained in the preset legal knowledge graph, the method comprises:

performing semantic analysis on all the historical cases in the case database to obtain a plurality of legal entities;

clustering the legal entities based on standard legal texts to obtain a plurality of knowledge nodes;

establishing an incidence relation among the knowledge nodes according to the co-occurrence word segments of the knowledge nodes in all the historical cases;

and generating the legal knowledge graph according to the incidence relation and the knowledge node.

3. The document retrieval method of claim 2, wherein the clustering the plurality of legal entities based on standard legal texts to obtain a plurality of knowledge nodes comprises:

determining associated entities and associated types of each legal entity in the standard legal text;

generating an association network for the legal entity based on the associated entity and the association type of the associated entity;

if the similarity between the associated networks of any two legal entities is greater than a preset similarity threshold, identifying that the two legal entities have an alias relationship;

clustering a plurality of the legal entities having the alias relationship to the same knowledge node.

4. The document retrieval method according to claim 1, wherein before dividing each of the history cases into a plurality of data packages based on knowledge nodes contained in the preset legal knowledge graph, the method comprises:

acquiring case labels of the historical cases in a case database, and constructing a case relation tree corresponding to the case database based on the label grade of each case label; the case relation tree comprises a plurality of branch nodes; each branch node is associated with one case label;

according to the case labels, each historical case is led into the associated branch node, and a legal sub-map of the branch node is established based on legal entities containing all historical cases in the branch node;

and generating the legal knowledge graph according to the legal sub-graphs of all the branch nodes and the case relation tree.

5. The document retrieval method according to any one of claims 1 to 4, wherein the dividing each historical case into a plurality of data packages based on knowledge nodes contained in a preset legal knowledge graph comprises:

dividing the historical case into a plurality of text sections according to the preset block data volume;

extracting keywords from each text segment based on the legal knowledge graph to obtain text keywords corresponding to each text segment; the text keywords are recorded in knowledge nodes of the legal knowledge graph;

and establishing an incidence relation between the text segment and the knowledge node, and generating the data packet according to the incidence relation and the text segment.

6. The document retrieval method of any one of claims 1-4, wherein the determining a target node in the legal knowledge graph associated with the search keyword based on the received search keyword comprises:

performing semantic analysis on the search keywords, and determining legal entities contained in the search keywords;

and matching legal entities contained in the search keyword with all knowledge nodes in the legal knowledge graph, and identifying the knowledge nodes matched with the legal entities as the target nodes associated with the search keyword.

7. The file retrieval method according to any one of claims 1 to 4, wherein the generating a file retrieval result according to the history cases contained in the distributed storage node corresponding to the target node comprises:

extracting a stored data packet of the historical case from the distributed storage node corresponding to the target node, and marking the search keyword from a case language segment corresponding to the data packet;

and packaging all case language segments marked with the search keywords to generate the file retrieval result.

8. An apparatus for document retrieval, comprising:

9. A terminal device, characterized in that the terminal device comprises a memory, a processor and a computer program stored in the memory and executable on the processor, the processor executing the computer program with the steps of the method according to any of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.