CN110968664A - Document retrieval method, device, equipment and medium - Google Patents

Document retrieval method, device, equipment and medium Download PDF

Info

Publication number
CN110968664A
CN110968664A CN201811160590.9A CN201811160590A CN110968664A CN 110968664 A CN110968664 A CN 110968664A CN 201811160590 A CN201811160590 A CN 201811160590A CN 110968664 A CN110968664 A CN 110968664A
Authority
CN
China
Prior art keywords
document
semantic vector
semantic
generation model
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811160590.9A
Other languages
Chinese (zh)
Inventor
张广鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201811160590.9A priority Critical patent/CN110968664A/en
Publication of CN110968664A publication Critical patent/CN110968664A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application discloses a document retrieval method, which comprises the steps of processing a first document input or uploaded by a user by using a semantic vector generation model to generate a corresponding first semantic vector, processing documents stored in a document library to generate corresponding second semantic vectors, calculating the similarity between the first semantic vector and each second semantic vector, and determining a retrieval result in the documents stored in the document library based on a similarity calculation result. The method is not limited to determining the retrieval result based on the text similarity, but the semantic vector generation model is utilized to determine the retrieval result based on the semantic similarity, and the semantic features can reflect the substantial content of the document compared with the text features, so that the method can effectively improve the accuracy of document retrieval and ensure that the retrieved document meets the actual retrieval intention of a user.

Description

Document retrieval method, device, equipment and medium
Technical Field
The present application relates to the field of information search technologies, and in particular, to a document retrieval method, apparatus, device, and medium.
Background
The document generally refers to a document such as official document, letter, contract, and the like, and accordingly, the document retrieval refers to retrieving a document related to or similar to the document in a corresponding document library using the document as a retrieval condition.
The prior document retrieval technology is applied in the field of justice, and is mainly used for retrieving legal documents, wherein the legal documents refer to legal documents which are manufactured by units such as national public security organs (including national security organs), inspection yards, courts, prisons, modification organs, justice organs, arbitration organs and the like according to law and are used for processing various litigation cases and non-litigation cases, and documents which have legal effectiveness or legal significance and are used for case parties, lawyers and law firm self documents or substitutes.
Specifically, in the process of a case failure, a judge often needs to use a legal document to be judged as a retrieval condition, retrieve a legal document related to the legal document from a document library for storing the legal document, use the retrieved legal document as reference data, and determine a final judgment result by referring to the legal documents. Furthermore, legal researchers often need to search relevant legal documents and conduct specific research aiming at a certain legal regulation.
The current mainstream document retrieval methods include two types: the first is to use an Elastic Search (ES) system to Search documents, and specifically, the ES system determines documents corresponding to keywords with higher text similarity as Search results based on the text similarity between the keywords corresponding to the documents uploaded by the user and the keywords corresponding to each document in the document library. And secondly, determining elements corresponding to the documents uploaded by the user and the documents stored in the document library by performing element analysis on the documents uploaded by the user and the documents stored in the document library, wherein the analyzed elements can reflect the semantics of the documents corresponding to the analyzed elements, and further determining a final retrieval result according to the similarity between the elements.
When the first retrieval method is used for retrieving documents, retrieval is mainly performed based on the text similarity among the keywords, but the text similarity does not completely represent semantic similarity, so that the documents retrieved by the first retrieval method may not meet the actual retrieval intention of the user. Whether the document obtained by the retrieval of the second retrieval method can meet the actual retrieval intention of the user depends on whether the analyzed element can accurately reflect the semantics of the document corresponding to the analyzed element, namely, the accuracy of the retrieval result is limited by the accuracy of the element analysis result, and if a better element analysis result is to be obtained, an expert and a technician in the related field are required to jointly cooperate to make an element analysis rule and calibration data, so that a large amount of time cost and labor cost are consumed, namely, a large amount of research and development cost is required to be invested for ensuring the accuracy of the retrieval result.
Disclosure of Invention
The embodiment of the application provides a document retrieval method, a document retrieval device, document retrieval equipment and a document retrieval medium, which can improve the accuracy of document retrieval and ensure that the retrieved document meets the actual retrieval intention of a user.
In view of the above, a first aspect of the present application provides a document retrieval method, including:
acquiring a first document;
inputting the first document into a semantic vector generation model, and acquiring a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
calculating the similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
and determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
Optionally, the semantic vector generation model includes: the device comprises a word segmentation module, a word vector determination module and a semantic vector determination module which are connected in series;
the word segmentation module is used for carrying out word segmentation processing on the document to obtain a first word segmentation sequence;
the word vector determining module is used for mapping each participle in the first participle sequence into corresponding word vectors respectively; the word vector is used for representing semantic features of the participles corresponding to the word vector;
the semantic vector determining module is used for performing convolution processing on word vectors corresponding to the participles in the first participle sequence to obtain semantic vectors.
Optionally, the semantic vector generation model further includes: a word segmentation and interception module;
the word segmentation intercepting module is used for intercepting a preset number of words from the first word segmentation sequence according to a preset intercepting rule to form a second word segmentation sequence;
the word vector determination module is specifically configured to:
mapping each participle in the second participle sequence into corresponding word vectors respectively;
the semantic vector determination module is specifically configured to:
and performing convolution processing on the word vectors corresponding to the participles in the second participle sequence to obtain semantic vectors.
Optionally, the Word vector determination module includes a Word2vec model; and, the semantic vector determination module comprises an inclusion model.
Optionally, the first document is a legal document;
then before inputting the first document into a semantic vector generation model, the method further comprises:
determining case description sections according to the labels corresponding to the paragraphs in the first document;
then the inputting the first document into a semantic vector generation model includes:
and inputting the case description segment into the semantic vector generation model.
Optionally, before inputting the first document into the semantic vector generation model, the method further includes:
determining a keyword corresponding to the first document as a first keyword;
determining keywords matched with the first keywords in the text library as second keywords, wherein the text similarity between the second keywords and the first keywords is higher than a similarity threshold; the document library is used for storing documents and keywords related to the documents;
determining the document associated with the second keyword as a candidate document;
the calculating the similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library includes:
and calculating the similarity between the first semantic vector and the second semantic vector corresponding to the candidate documents respectively.
Optionally, the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model after the first document is acquired.
A second aspect of the present application provides a document retrieval apparatus, the apparatus comprising:
the acquisition module is used for acquiring a first document;
the processing module is used for inputting the first document into a semantic vector generation model and acquiring a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
the calculation module is used for calculating the similarity between the first semantic vector and a second semantic vector corresponding to the document stored in the document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
and the determining module is used for determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
A third aspect of the present application provides a computer-readable storage medium for storing program code for executing the document retrieval method provided by the first aspect.
A fourth aspect of the present application provides a computer program product comprising instructions which, when run on a computer, cause the computer to perform the document retrieval method provided by the first aspect above.
According to the technical scheme, the embodiment of the application has the following advantages:
the embodiment of the application provides a document retrieval method, wherein a semantic vector generation model is provided, the semantic vector generation model is a neural network model which takes a document as input and takes a semantic vector capable of representing semantic features of the document as output, the semantic vector generation model is used for processing a first document input or uploaded by a user to generate a corresponding first semantic vector, documents stored in a document library are processed to generate corresponding second semantic vectors, then the similarity between the first semantic vector and each second semantic vector is calculated, and further based on a similarity calculation result, a retrieval result is determined in the documents stored in the document library. Compared with the method for searching documents by adopting the ES system in the prior art, the method provided by the embodiment of the application is not limited to determining the search result based on the text similarity, but the first document and the documents stored in the document library are processed by utilizing the semantic vector generation model to generate the semantic vectors capable of representing the semantic features of the documents, and the search result is determined based on the similarity between the semantic vectors. In addition, the document retrieval method provided by the embodiment of the application does not need manual intervention to formulate a complex semantic extraction rule, and the labor cost consumed in the research and development process is greatly reduced.
Drawings
Fig. 1 is a schematic view of an application scenario of a document retrieval method according to an embodiment of the present application;
FIG. 2 is a schematic flow chart illustrating a document retrieval method according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a semantic vector generation model according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an acceptance model according to an embodiment of the present application;
FIG. 5 is a schematic structural diagram of a document retrieval apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of a server according to an embodiment of the present application.
Detailed Description
In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The terms "first," "second," "third," "fourth," and the like in the description and in the claims of the present application and in the drawings described above, if any, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Aiming at the technical problems that in the prior art, the accuracy of a document retrieval result is low, excessive cost needs to be invested in a research and development process and the like, the embodiment of the application provides the document retrieval method, the accuracy of document retrieval can be improved, the retrieved document can meet the actual retrieval intention of a user, and the labor cost needed to be invested in the research and development process can be reduced to a certain extent.
The following first introduces the core technical idea of the document retrieval method provided by the embodiment of the present application:
in the document retrieval method provided by the embodiment of the application, a document input or uploaded by a user is firstly obtained as a first document, then the first document is input to a semantic vector generation model, a first semantic vector output by the semantic vector generation model is obtained, and the semantic vector output model is a neural network model which takes the document as input and takes a semantic vector capable of representing semantic features of the document as output; then, calculating the similarity between the first semantic vector and a second semantic vector corresponding to the document stored in the document library, wherein the second semantic vector is obtained by processing the document stored in the document library by using the semantic vector generation model; and finally, determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and each first semantic vector.
Compared with a method for searching documents by adopting an ES system in the prior art, the document searching method provided by the embodiment of the application is not limited to determining the searching result based on the text similarity, but processes the first document and the documents stored in the document library by utilizing the semantic vector generation model to generate semantic vectors capable of representing semantic features of the documents, and determines the final searching result based on the similarity between the semantic vectors, namely, the document searching method provided by the embodiment of the application determines the searching result based on the semantic similarity; because the semantic features can reflect the essential content of the document better than the text features, the method provided by the embodiment of the application can effectively improve the accuracy of the document retrieval result and ensure that the retrieved document meets the actual retrieval intention of the user. In addition, the document retrieval method provided by the embodiment of the application does not need manual intervention to formulate a complex semantic extraction rule, and the labor cost consumed in the research and development process is greatly reduced.
It should be understood that the document retrieval method provided by the embodiment of the present application may be applied to a server providing a retrieval service, where the server may specifically be an application server or a Web server, and when the document retrieval method is deployed in actual application, the server may be an independent server or a cluster server, and the server may respond to documents sent by multiple terminal devices at the same time, and accordingly retrieve documents semantically related to the documents sent by each terminal device in a document library as a retrieval result.
In order to facilitate understanding of the technical solution of the present application, a server is taken as an execution subject, and the document retrieval method provided by the embodiment of the present application is introduced with reference to an actual application scenario.
Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a document retrieval method provided in an embodiment of the present application. The application scene comprises terminal equipment 101 and a server 102; the terminal device 101 is configured to send a document input or uploaded by a user to the server, so that the server 102 performs document retrieval accordingly according to the document; the server 102 is configured to implement the document retrieval method provided in the embodiment of the present application based on the document transmitted by the terminal device 101, and retrieve a document semantically related to the document in the document library as a retrieval result.
It should be understood that the library may be stored in the server 102, or may be stored on other devices; if the document library is stored on another device, the server 102 may call the document library stored on the other device when the server 102 needs to search the document library for the relevant document as the search result.
When a user needs to search for a document, the user can input or upload a document as a search condition through the terminal device 101, and after receiving the document, the terminal device 101 sends the document to the server 102. The server 102 takes the document sent by the terminal device 101 as a first document, inputs the first document into a semantic vector generation model which operates per se, and outputs a first semantic vector capable of representing the semantic features of the first document by correspondingly processing the first document; then calculating the similarity between the first semantic vector and a second semantic vector corresponding to the document stored in the document library, wherein the second semantic vector is obtained by processing the document stored in the document library by using the semantic vector generation model; and determining a retrieval result in the documents stored in the document library according to the similarity between the second semantic vector and the first vector.
The semantic vector generation model executed in the server 102 is a neural network model that takes a document as an input and takes a semantic vector capable of expressing semantic features of the document as an output. The document retrieval method utilizes the semantic vector generation model to process the first document and documents stored in the document library to obtain semantic vectors corresponding to the documents, and further determines retrieval results based on the similarity between the semantic vectors.
It should be noted that the application scenario described in fig. 1 is only an example, and in practical application, the document retrieval method provided in the embodiment of the present application may also be applied to other application scenarios, and no specific limitation is made to the application scenario of the document retrieval method herein.
The document retrieval method provided by the present application is described below by way of example.
Referring to fig. 2, fig. 2 is a schematic flow chart of a document retrieval method according to an embodiment of the present application. For convenience of description, the following embodiments are described with a server as an execution subject. As shown in fig. 2, the retrieval method of laws and regulations includes the following steps:
step 201: a first document is obtained.
When a user needs to use a document as a retrieval condition to retrieve the document related to the document in the document library, the user can input the document as the retrieval condition in a document input box provided by the terminal device by operating the terminal device, and the terminal device correspondingly sends the document input by the user in the input box to the server in response to the operation initiated by the user for confirming the completion of the input; or, the user may select a document used as a search condition from files stored locally in the terminal device by clicking an upload button displayed on the terminal device, and upload the document to the server through the terminal device.
And after receiving the document sent or uploaded by the terminal equipment, the server takes the document as a first document and further carries out subsequent document retrieval operation according to the first document.
It should be understood that the first document described above may be any type of document, and may specifically be a legal document, a treatise, or the like. The type of the first document is not limited in any way here.
Step 202: and inputting the first document into a semantic vector generation model, and acquiring a first semantic vector output by the semantic vector generation model.
After acquiring a first document from the terminal equipment, the server inputs the first document into a semantic vector generation model which operates by the server, and the semantic vector generation model outputs a first semantic vector which can represent the semantic features of the first document by correspondingly processing the first document.
The semantic vector generation model is a neural network model that takes a document as an input and takes a semantic vector representing semantic features of the document as an output. The following will specifically describe the structure of the semantic vector generation model in the next embodiment, and also describe the specific working principle of the semantic vector generation model.
When the first document received by the server is a legal document, before the legal document is input into the semantic vector generation model, the server can further extract case description segments from the legal document, and only the case description segments are input into the semantic vector generation model, so that the influence of other information irrelevant to the case in the first document on the final semantic vector calculation result is avoided.
In a specific implementation, the server may determine the case description section in the first document according to the tag corresponding to each section in the first document. Specifically, because the format of the legal document is usually fixed, the server may mark a corresponding tag for each paragraph in the first document in advance according to the paragraph template of the legal document, further extract the paragraph tagged with case description from the first document as a case description paragraph, and input the case description paragraph into the semantic vector generation model.
It should be understood that for other types of first documents, the server may also adopt a similar method to extract paragraphs including important contents from the first document, and then input the extracted paragraphs into the semantic vector generation model.
Step 203: and calculating the similarity between the first semantic vector and a second semantic vector corresponding to the document stored in the document library.
After the semantic vector generation model outputs a first semantic vector for representing semantic features of a first document, the server calculates the similarity between the first semantic vector and second semantic vectors corresponding to documents stored in the document library, and specifically, the server may calculate the cosine similarity between the first semantic vector and each of the second semantic vectors.
Here, the second semantic vector is obtained by processing a document stored in the document library by using the semantic vector generation model. The second semantic vector may be obtained by processing a document stored in the document library by using the semantic vector generation model in advance, or may be obtained by inputting the document stored in the document library into the semantic vector generation model in real time after the server acquires the first document.
It should be understood that the above-mentioned document library corresponds to the first document input by the user, specifically, if the first document is a legal document, the document library called should be a legal document library storing the legal document, if the first document is a thesis, the document library called should be a thesis library storing the thesis, and so on.
It should be noted that, because a large number of documents are usually stored in the document library, the calculation amount actually required to be executed by the server is very large no matter whether the server calculates the second semantic vectors corresponding to the documents stored in the document library in advance and then calculates the similarity between the first semantic vector and each second semantic vector, or the server calculates the second semantic vectors corresponding to the documents stored in the document library in real time and then calculates the similarity between the first semantic vector and each second semantic vector after acquiring the first document. In order to reduce the amount of calculation required to be performed by the server, after the server acquires the first document, the server may first screen out a plurality of candidate documents from the document library by using the ES system.
The server can determine a keyword corresponding to the first document, and the keyword is used as a first keyword; determining keywords matched with the first keywords in a text library to serve as second keywords, wherein the text similarity between the second keywords and the first keywords is higher than a similarity threshold value; the document library stores documents and keywords associated with the documents. Further, documents associated with the second keyword are taken as candidate documents.
Specifically, the server may perform word segmentation processing on the first document, and perform semantic analysis on each segmented word obtained through the word segmentation processing to determine a first keyword corresponding to the first document. Then, determining a second keyword matched with the first keyword in a document library storing documents and keywords related to the documents, wherein the text similarity between the second keyword and the first keyword is higher than a similarity threshold; further, documents associated with the second keyword are taken as candidate documents.
It should be noted that the number of the selected candidate documents may also be preset, and then a preset number of documents may be selected from the document library as the candidate documents according to the text similarity between the first keyword and the second keyword. Of course, other methods may be used to obtain the candidate document, and the method of specifically selecting the candidate document is not limited herein.
It should be understood that after the server determines the candidate documents, the similarity between the first semantic vector corresponding to the first document and the second semantic vectors corresponding to the respective candidate documents can be calculated.
Specifically, if the second semantic vector is obtained by the server through pre-calculation, the server may directly call the second semantic vector corresponding to each candidate document after determining the candidate document, and then calculate the similarity between the first semantic vector output by the semantic vector generation model and the second semantic vector corresponding to each candidate document. If the second semantic vector is calculated in real time after the server acquires the first document, after the server determines the candidate documents, the server only needs to calculate the second semantic vector corresponding to each candidate document, and then calculates the similarity between the first semantic vector output by the semantic vector generation model and the second semantic vector corresponding to each candidate document.
Step 204: and determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
And after the server calculates the similarity corresponding to the first semantic vector and each second semantic vector, determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and each first semantic vector.
In a possible implementation manner, the server may preset a similarity threshold, screen out second semantic vectors with similarities greater than the similarity threshold, and determine documents corresponding to the second semantic vectors as final retrieval results.
It should be understood that the similarity threshold herein may be set according to actual requirements, and the similarity threshold is not specifically limited herein.
In another possible implementation manner, the server may sort the similarity degrees corresponding to the second semantic vectors in a descending order, determine the second semantic vectors corresponding to the preset number of similarity degrees sorted in the top, and then take the documents corresponding to the second semantic vectors as final retrieval results.
It should be understood that the preset number can be set according to actual requirements, and the preset number is not limited in any way.
It should be noted that the server may also select a document from the document library as a final search result according to the similarity in other manners.
The document retrieval method is not limited to determining the retrieval result based on the text similarity, but processes the first document and documents stored in the document library by using the semantic vector generation model to generate semantic vectors capable of representing semantic features of the documents, and determines the final retrieval result based on the similarity between the semantic vectors, namely, the document retrieval method provided by the embodiment of the application determines the retrieval result based on the semantic similarity, and the semantic features can reflect the essential content of the documents more than the text features, so that the method provided by the embodiment of the application can effectively improve the accuracy of the document retrieval result, and the retrieved documents are ensured to meet the actual retrieval intention of a user. In addition, the document retrieval method provided by the embodiment of the application does not need manual intervention to formulate a complex semantic extraction rule, and the labor cost consumed in the research and development process is greatly reduced.
As described above, the document retrieval method provided in the embodiment of the present application needs to determine, based on the semantic vector generation model, a semantic vector capable of characterizing semantic features of an input document according to the input document. In order to further understand the document retrieval method provided by the embodiment of the present application, the semantic vector generation model is specifically described below with reference to the accompanying drawings.
Referring to fig. 3, fig. 3 is a schematic structural diagram of a semantic vector generation model 300 according to an embodiment of the present application. As shown in fig. 3, the semantic vector generation model 300 includes: a word segmentation module 301, a word vector determination module 302 and a semantic vector determination module 303 in cascade.
The word segmentation module 301 is configured to perform word segmentation on the document input to the semantic vector generation model 300 to obtain a word segmentation sequence corresponding to the document, which is used as a first word segmentation sequence. Specifically, the word segmentation module 301 may directly use a Language Technology Platform (LTP) to perform word segmentation processing on the document input to the semantic vector generation model 300, divide the document into a plurality of words, and then use the words to form a first word segmentation sequence.
The word vector determining module 302 is configured to map each participle in the first participle sequence output by the participle module 301 into a corresponding word vector, where the word vector can represent semantic features of the participle corresponding to the word vector, and accordingly, the word vectors can reflect semantic relevance between the words. Specifically, after the word segmentation module 301 obtains the first word segmentation sequence corresponding to the document, the first word segmentation sequence is output to the word vector determination module 302, the word vector determination module 302 maps each word segmentation in the first word segmentation sequence into a corresponding word vector, the word vectors can correspondingly reflect the semantics of the word segmentation corresponding to the word vector, and the similarity between the word vectors corresponding to the words with similar semantics is higher.
The Word vector determining module 302 may specifically include a Word2vec model, that is, the Word2vec model is used as the Word vector determining module 302 in the semantic vector generating module 300, and the Word2vec model is used to map each participle in the first participle sequence to a fixed vector space, so as to obtain a Word vector corresponding to each participle.
It should be noted that the first word segmentation sequence obtained after the word segmentation processing is performed on the document by the word segmentation module 301 usually includes a large number of words, if the first word segmentation sequence is directly input to the word vector determination module 302, the word vector determination module 302 maps each word segmentation in the first word segmentation sequence to a corresponding word vector, which requires a large amount of calculation processing, and in order to reduce the calculation amount of the word vector determination module 302, a word segmentation interception module may be additionally disposed between the word segmentation module 301 and the word vector determination module 302.
The word segmentation intercepting module is used for intercepting a preset number of word segments from the first word segmentation sequence according to a preset intercepting rule to form a second word segmentation sequence. Specifically, after the word segmentation intercepting module acquires the first word segmentation sequence output by the word segmentation module 301, a preset number of words are intercepted from the first word segmentation sequence according to a preset intercepting rule, and then a second word segmentation sequence is formed by utilizing the intercepted words, and the second word segmentation sequence is correspondingly output to the word vector determination module 302, so that the word vector determination module 302 respectively maps each word segmentation in the second word segmentation sequence into a corresponding word vector.
For the convenience of understanding, the working principle of the above word segmentation and interception module is illustrated as follows:
if the preset interception rule is that 600 participles are intercepted from the first participle sequence to form a second participle sequence, if the number of the participles in the first participle sequence is more than 600, the first 550 participles and the last 50 participles in the first participle sequence are intercepted to form the second participle sequence, and therefore middle part irrelevant information is filtered; if the number of the participles in the first participle sequence is less than 600, filling blank character marks after the first participle sequence to be used as filling participles until the number of the participles in the first participle sequence reaches 600, and using the first participle sequence with the participle number of 600 as a second participle sequence.
It should be understood that the preset interception rule may be set according to actual requirements, and the preset interception rule is not limited in any way.
The semantic vector determining module 303 is configured to perform convolution processing on the word vector output by the word vector determining module 302, and generate a semantic vector corresponding to the document input to the semantic vector generating model 300, where the semantic vector is obtained by performing high-level abstract summarization on the document. Specifically, if the word vector output by the word vector determination module 302 is a word vector corresponding to each participle in the first participle sequence, the semantic vector determination module 303 performs convolution processing on the word vector corresponding to each participle in the first participle sequence to obtain a semantic vector of the document; if the word vector output by the word vector determination module 302 is a word vector corresponding to each participle in the second participle sequence, the semantic vector determination module 303 performs convolution processing on the word vector corresponding to each participle in the second participle sequence to obtain a semantic vector of the document.
The semantic vector determining module 303 is composed of an initiation model, and specifically, the semantic vector determining module 303 may be composed of a plurality of initiation models connected in series, or may be composed of only one initiation model.
The inventor tests that the semantic vector determining module 303 formed by connecting two initiation models in series has a good processing effect, so that the semantic vector determining module formed in such a way can not only ensure that the semantic vector obtained by calculation is more accurate, but also avoid excessive calculation. Each of the initiation models in the semantic vector determination module 303 is equally divided into four branches, each branch including a different convolution kernel.
Referring to fig. 4, fig. 4 is a schematic structural diagram of a single initiation model included in the semantic vector determination module 303. The inference model includes four branches, branch 401 consisting of convolution kernels of size 1 x 1, branch 402 consisting of a combination of convolution kernels of size 1 x 1 and convolution kernels of size 3 x 3, branch 403 consisting of a combination of convolution kernels of size 1 x 1 and convolution kernels of size 5 x 5, and branch 404 consisting of convolution kernels of size 1 x 1. The word vector output by the word vector determination module 302 is processed by the two initiation models and a full link layer to output a semantic vector.
It should be understood that the structure of the initiation model shown in fig. 4 is just an example, and in practical applications, an initiation model with any structure may be provided as the semantic vector determination module. In addition, the semantic vector determination module may include any number of acceptance models, and the number of the acceptance models included therein is not limited at all.
The semantic vector generation model takes a document as input and takes a semantic vector capable of expressing the semantic features of the document as output. The first document and the documents stored in the document library are processed by using the semantic vector generation model to obtain semantic vectors corresponding to the documents, and then the retrieval result is determined based on the similarity between the semantic vectors.
In order to further understand the document retrieval method provided in the embodiment of the present application, the document retrieval method provided in the embodiment of the present application is described below by taking an application scenario of retrieving a legal document as an example.
When a user needs to search for a legal document related to a legal document by taking a certain legal document as a search condition, the user can trigger the opening of a local memory of the terminal equipment by clicking an upload key on a document search interface displayed on the terminal equipment, the user can select the legal document serving as the search condition from files locally stored in the terminal equipment according to actual requirements, and the legal document is uploaded to a server through the terminal equipment.
After receiving the legal document, the server takes the legal document as a first document; then, an ES system is utilized to search the legal documents with higher text similarity with the first document from a legal document library to serve as candidate documents; specifically, the server may preset to search 50 candidate documents, and the server may select 50 legal documents with the top ranked text similarity as the candidate documents.
And then, the server selects the case description sections corresponding to the first document and the candidate document according to the labels corresponding to the sections in the first document and the candidate document.
Then, case description segments corresponding to the first document and case description segments corresponding to the candidate documents are respectively input to the semantic vector generation model. A word segmentation module in the semantic vector generation model performs word segmentation on the input case description segment to obtain a first word segmentation sequence, and the first word segmentation sequence is output to a word segmentation interception module; the word segmentation intercepting module intercepts a preset number of words from the first word segmentation sequence according to a preset intercepting rule, forms a second word segmentation sequence by utilizing the intercepted words, and outputs the second word segmentation sequence to the word vector determining module; the word vector determining module is used for mapping each participle in the second participle sequence into corresponding word vectors respectively, the word vectors can represent the semantics of the corresponding participles, and the word vectors corresponding to the participles in the second participle sequence are output to the semantic vector determining module; finally, the semantic vector determining module performs convolution processing on the word vectors corresponding to the participles in the second participle sequence output by the word vector determining module to obtain the semantic vectors corresponding to the input case description segment. Thereby obtaining a first semantic vector corresponding to the case description segment of the first document and a second semantic vector corresponding to the case description segment of each candidate document.
And then, the server calculates the vector similarity between the first semantic vector and each second semantic vector, screens out the second semantic vectors with the similarity larger than a preset similarity threshold, takes the legal documents corresponding to the screened second semantic vectors as final retrieval results, and feeds the retrieval results back to the terminal equipment so as to display the finally determined retrieval results to the user through the terminal equipment.
Aiming at the document retrieval methods described above, the application also provides corresponding document retrieval devices so as to facilitate the application and implementation of the methods in practice.
Referring to fig. 5, fig. 5 is a schematic structural diagram of a document retrieval apparatus 500 corresponding to the method shown in fig. 2, where the document retrieval apparatus 500 includes:
an obtaining module 501, configured to obtain a first document;
the processing module 502 is configured to input the first document into a semantic vector generation model, and obtain a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
a calculating module 503, configured to calculate a similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
a determining module 504, configured to determine a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
Optionally, the semantic vector generation model includes: the device comprises a word segmentation module, a word vector determination module and a semantic vector determination module which are connected in series;
the word segmentation module is used for carrying out word segmentation processing on the document to obtain a first word segmentation sequence;
the word vector determining module is used for mapping each participle in the first participle sequence into corresponding word vectors respectively; the word vector is used for representing semantic features of the participles corresponding to the word vector;
the semantic vector determining module is used for performing convolution processing on word vectors corresponding to the participles in the first participle sequence to obtain semantic vectors.
Optionally, the semantic vector generation model further includes: a word segmentation and interception module;
the word segmentation intercepting module is used for intercepting a preset number of words from the first word segmentation sequence according to a preset intercepting rule to form a second word segmentation sequence;
the word vector determination module is specifically configured to:
mapping each participle in the second participle sequence into corresponding word vectors respectively;
the semantic vector determination module is specifically configured to:
and performing convolution processing on the word vectors corresponding to the participles in the second participle sequence to obtain semantic vectors.
Optionally, the Word vector determination module includes a Word2vec model; and, the semantic vector determination module comprises an inclusion model.
Optionally, the first document is a legal document;
the apparatus further comprises:
the extraction module is used for determining case description sections according to the labels corresponding to the paragraphs in the first document;
the processing module is specifically configured to:
and inputting the case description segment into the semantic vector generation model.
Optionally, the apparatus further comprises:
the screening module is used for determining a keyword corresponding to the first document as a first keyword; determining keywords matched with the first keywords in the text library as second keywords, wherein the text similarity between the second keywords and the first keywords is higher than a similarity threshold; the document library is used for storing documents and keywords related to the documents; determining the document associated with the second keyword as a candidate document;
the calculation module is specifically configured to:
and calculating the similarity between the first semantic vector and the second semantic vector corresponding to the candidate documents respectively.
Optionally, the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model after the first document is acquired.
The document retrieval device is not limited to determining a retrieval result based on the text similarity, but processes the first document and documents stored in the document library by using the semantic vector generation model to generate semantic vectors capable of representing semantic features of the documents, and determines a final retrieval result based on the similarity between the semantic vectors, namely, the document retrieval device provided by the embodiment of the application determines the retrieval result based on the semantic similarity, and the semantic features can reflect the essential content of the documents more than the text features, so that the device provided by the embodiment of the application can effectively improve the accuracy of the document retrieval result, and the retrieved documents are ensured to meet the actual retrieval intention of a user. In addition, the document retrieval device provided by the embodiment of the application does not need manual intervention to formulate a complex semantic extraction rule, so that the labor cost consumed in the research and development process is greatly reduced.
The embodiment of the present application also provides a document retrieval server, referring to fig. 6, fig. 6 is a schematic structural diagram of a server provided by the embodiment of the present application, and the server 600 may have a relatively large difference due to different configurations or performances, and may include one or more Central Processing Units (CPUs) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) storing an application program 642 or data 644. Memory 632 and storage medium 630 may be, among other things, transient or persistent storage. The program stored in the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations for the server. Still further, the central processor 622 may be configured to communicate with the storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 600.
The server 600 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input-output interfaces 658, and/or one or more operating systems 641, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and so forth.
The steps performed by the server in the above embodiments may be based on the server structure shown in fig. 6.
The CPU622 is configured to execute the following steps:
acquiring a first document;
inputting the first document into a semantic vector generation model, and acquiring a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
calculating the similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
and determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
Optionally, the CPU622 can also execute the method steps of any specific implementation of the document retrieval method in the embodiment of the present application.
The embodiment of the present application further provides a computer-readable storage medium for storing a program code, where the program code is used to execute any one implementation of the document retrieval method described in the foregoing embodiments.
The present application further provides a computer program product including instructions, which when run on a computer, causes the computer to execute any one of the implementation manners of the document retrieval method described in the foregoing embodiments.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be substantially implemented or contributed to by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
The above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims (10)

1. A document retrieval method, the method comprising:
acquiring a first document;
inputting the first document into a semantic vector generation model, and acquiring a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
calculating the similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
and determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
2. The method of claim 1, wherein the semantic vector generation model comprises: the device comprises a word segmentation module, a word vector determination module and a semantic vector determination module which are connected in series;
the word segmentation module is used for carrying out word segmentation processing on the document to obtain a first word segmentation sequence;
the word vector determining module is used for mapping each participle in the first participle sequence into corresponding word vectors respectively; the word vector is used for representing semantic features of the participles corresponding to the word vector;
the semantic vector determining module is used for performing convolution processing on word vectors corresponding to the participles in the first participle sequence to obtain semantic vectors.
3. The method of claim 2, wherein the semantic vector generation model further comprises: a word segmentation and interception module;
the word segmentation intercepting module is used for intercepting a preset number of words from the first word segmentation sequence according to a preset intercepting rule to form a second word segmentation sequence;
the word vector determination module is specifically configured to:
mapping each participle in the second participle sequence into corresponding word vectors respectively;
the semantic vector determination module is specifically configured to:
and performing convolution processing on the word vectors corresponding to the participles in the second participle sequence to obtain semantic vectors.
4. The method of claim 2, wherein the Word vector determination module comprises a Word2vec model; and, the semantic vector determination module comprises an inclusion model.
5. The method of any one of claims 1 to 4, wherein the first document is a legal document;
then before inputting the first document into a semantic vector generation model, the method further comprises:
determining case description sections according to the labels corresponding to the paragraphs in the first document;
then the inputting the first document into a semantic vector generation model includes:
and inputting the case description segment into the semantic vector generation model.
6. The method of claim 1, wherein prior to inputting the first document into a semantic vector generation model, the method further comprises:
determining a keyword corresponding to the first document as a first keyword;
determining keywords matched with the first keywords in the text library as second keywords, wherein the text similarity between the second keywords and the first keywords is higher than a similarity threshold; the document library is used for storing documents and keywords related to the documents;
determining the document associated with the second keyword as a candidate document;
the calculating the similarity between the first semantic vector and a second semantic vector corresponding to a document stored in a document library includes:
and calculating the similarity between the first semantic vector and the second semantic vector corresponding to the candidate documents respectively.
7. The method of claim 1, wherein the second semantic vector is obtained by processing documents stored in the document library using the semantic vector generation model after the first document is obtained.
8. A document retrieval apparatus, the apparatus comprising:
the acquisition module is used for acquiring a first document;
the processing module is used for inputting the first document into a semantic vector generation model and acquiring a first semantic vector output by the semantic vector generation model; the semantic vector generation model is a neural network model which takes a document as input and takes semantic vectors representing semantic features as output;
the calculation module is used for calculating the similarity between the first semantic vector and a second semantic vector corresponding to the document stored in the document library; the second semantic vector is obtained by processing the documents stored in the document library by using the semantic vector generation model;
and the determining module is used for determining a retrieval result in the documents stored in the document library according to the similarity between each second semantic vector and the first semantic vector.
9. A computer-readable storage medium for storing program code for performing the document retrieval method of any one of claims 1-7.
10. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the document retrieval method of any one of claims 1-7.
CN201811160590.9A 2018-09-30 2018-09-30 Document retrieval method, device, equipment and medium Pending CN110968664A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811160590.9A CN110968664A (en) 2018-09-30 2018-09-30 Document retrieval method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811160590.9A CN110968664A (en) 2018-09-30 2018-09-30 Document retrieval method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN110968664A true CN110968664A (en) 2020-04-07

Family

ID=70029055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811160590.9A Pending CN110968664A (en) 2018-09-30 2018-09-30 Document retrieval method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN110968664A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753069A (en) * 2020-06-09 2020-10-09 北京小米松果电子有限公司 Semantic retrieval method, device, equipment and storage medium
CN112183111A (en) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 Long text semantic similarity matching method and device, electronic equipment and storage medium
CN113177061A (en) * 2021-05-25 2021-07-27 马上消费金融股份有限公司 Searching method and device and electronic equipment
CN116401335A (en) * 2023-03-15 2023-07-07 北京擎盾信息科技有限公司 Quantitative retrieval method and device for legal documents, storage medium and electronic device

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions
CN105808726A (en) * 2016-03-08 2016-07-27 浪潮软件股份有限公司 Method and apparatus for measuring similarity of documents
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation
CN106502996A (en) * 2016-12-13 2017-03-15 深圳爱拼信息科技有限公司 A kind of judgement document's search method and server based on semantic matches
CN107122451A (en) * 2017-04-26 2017-09-01 北京科技大学 A kind of legal documents case by grader method for auto constructing
CN107491547A (en) * 2017-08-28 2017-12-19 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence
CN108038091A (en) * 2017-10-30 2018-05-15 上海思贤信息技术股份有限公司 A kind of similar calculating of judgement document's case based on figure and search method and system
CN108170773A (en) * 2017-12-26 2018-06-15 百度在线网络技术(北京)有限公司 Media event method for digging, device, computer equipment and storage medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090190839A1 (en) * 2008-01-29 2009-07-30 Higgins Derrick C System and method for handling the confounding effect of document length on vector-based similarity scores
US20100179933A1 (en) * 2009-01-12 2010-07-15 Nec Laboratories America, Inc. Supervised semantic indexing and its extensions
CN105808726A (en) * 2016-03-08 2016-07-27 浪潮软件股份有限公司 Method and apparatus for measuring similarity of documents
CN105808530A (en) * 2016-03-23 2016-07-27 苏州大学 Translation method and device in statistical machine translation
CN106502996A (en) * 2016-12-13 2017-03-15 深圳爱拼信息科技有限公司 A kind of judgement document's search method and server based on semantic matches
CN107122451A (en) * 2017-04-26 2017-09-01 北京科技大学 A kind of legal documents case by grader method for auto constructing
CN107491547A (en) * 2017-08-28 2017-12-19 北京百度网讯科技有限公司 Searching method and device based on artificial intelligence
CN108038091A (en) * 2017-10-30 2018-05-15 上海思贤信息技术股份有限公司 A kind of similar calculating of judgement document's case based on figure and search method and system
CN108170773A (en) * 2017-12-26 2018-06-15 百度在线网络技术(北京)有限公司 Media event method for digging, device, computer equipment and storage medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753069A (en) * 2020-06-09 2020-10-09 北京小米松果电子有限公司 Semantic retrieval method, device, equipment and storage medium
CN111753069B (en) * 2020-06-09 2024-05-07 北京小米松果电子有限公司 Semantic retrieval method, device, equipment and storage medium
CN112183111A (en) * 2020-09-28 2021-01-05 亚信科技(中国)有限公司 Long text semantic similarity matching method and device, electronic equipment and storage medium
CN113177061A (en) * 2021-05-25 2021-07-27 马上消费金融股份有限公司 Searching method and device and electronic equipment
CN113177061B (en) * 2021-05-25 2023-05-16 马上消费金融股份有限公司 Searching method and device and electronic equipment
CN116401335A (en) * 2023-03-15 2023-07-07 北京擎盾信息科技有限公司 Quantitative retrieval method and device for legal documents, storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN108287858B (en) Semantic extraction method and device for natural language
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
CN107193962B (en) Intelligent map matching method and device for Internet promotion information
CN110781276A (en) Text extraction method, device, equipment and storage medium
CN112889042A (en) Identification and application of hyper-parameters in machine learning
CN110929038A (en) Entity linking method, device, equipment and storage medium based on knowledge graph
CN110968664A (en) Document retrieval method, device, equipment and medium
CN112270196A (en) Entity relationship identification method and device and electronic equipment
CN110929520B (en) Unnamed entity object extraction method and device, electronic equipment and storage medium
CN110909539A (en) Word generation method, system, computer device and storage medium of corpus
CN111984792A (en) Website classification method and device, computer equipment and storage medium
JP7198408B2 (en) Trademark information processing device and method, and program
CN112307337B (en) Associated recommendation method and device based on tag knowledge graph and computer equipment
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
CN112613321A (en) Method and system for extracting entity attribute information in text
CN117520800A (en) Training method, system, electronic equipment and medium for nutrition literature model
CN112328469B (en) Function level defect positioning method based on embedding technology
CN113569018A (en) Question and answer pair mining method and device
CN110287270B (en) Entity relationship mining method and equipment
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN111538903A (en) Method and device for determining search recommended word, electronic equipment and computer readable medium
CN108733702B (en) Method, device, electronic equipment and medium for extracting upper and lower relation of user query
US20170293863A1 (en) Data analysis system, and control method, program, and recording medium therefor
US20210318949A1 (en) Method for checking file data, computer device and readable storage medium
CN112182218A (en) Text data classification method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200407

RJ01 Rejection of invention patent application after publication