CN115495555A

CN115495555A - Document retrieval method and system based on deep learning

Info

Publication number: CN115495555A
Application number: CN202211175734.4A
Authority: CN
Inventors: 温嘉宝; 杨敏; 贺倩明
Original assignee: Shenzhen Deli Technology Co ltd; Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Deli Technology Co ltd; Shenzhen Institute of Advanced Technology of CAS
Priority date: 2022-09-26
Filing date: 2022-09-26
Publication date: 2022-12-20

Abstract

The invention discloses a document retrieval method and system based on deep learning. The system comprises: the recall module is used for generating a plurality of candidate texts aiming at the query texts input by the user based on the pre-stored word vectors and text vectors; and the rearrangement module is used for inputting the query text input by the user and the candidate texts into the trained long text encoder, obtaining a query vector and a candidate text vector, calculating the similarity between the query vector and the candidate text vector, and obtaining a sorted retrieval result. The method effectively solves the problems that the unsupervised model ordering effect is not ideal and the supervised text retrieval model cannot be directly retrieved in a large number of documents in text retrieval, obviously improves the retrieval speed and accuracy, and is particularly suitable for the field of retrieving long texts with the long texts.

Description

Document retrieval method and system based on deep learning

Technical Field

The invention relates to the technical field of document retrieval, in particular to a document retrieval method and system based on deep learning.

Background

The document retrieval is an information retrieval technology with the retrieval object being a document, and is a process of acquiring the document through retrieval according to learning and working requirements. With the development of modern network technology, document retrieval is generally accomplished by computer technology. The search language of the document can be divided into a classification language, a topic language, and the like. The required document data can be quickly and accurately retrieved by using the standard document retrieval language, but the use threshold of the standard document retrieval language is higher, and common users cannot use the standard document retrieval language well due to the reasons that the classification rules are not known or the expression of retrieval keywords is not accurate and the like. How to directly retrieve the most needed documents by analyzing the natural language input by the user has been a problem of attention in academic and industrial circles, and related scholars also put forward many solutions to try to solve the problem.

In one research result, a word vector weighted average matching retrieval method (SIF) based on smooth frequency is proposed. Firstly, converting sentences in a retrieval text and a document into word vector representations, and then carrying out weighted average on all word vectors in the sentences to obtain average vectors; next, the first principal component of the matrix of all sentence vectors is subtracted from the average sentence vector. The method is an unsupervised method, labeled data are not needed, training cost is low, however, the method is based on an unsupervised mode, the distribution of all data in space is learned, semantic relations between two specific texts are difficult to capture, output results have certain reference values, and the output effect as final results is not ideal.

In another research result, an interactive attention-based Bidirectional Encoder (BERT) retrieval model is constructed, the question and the document characters are spliced and input into the model, the classification characterization vectors output by the model are used as interaction vectors, and then the relevance scores are output through a full-connection layer. The method uses the label data, and because the spliced text is input into the model, the method has certain interaction capacity, but because the length of the spliced text is limited to 512 words, the text can be cut off during splicing, so that a large amount of semantic information is lost, and the method is not suitable for interaction of the long text and the long text.

In the prior art, text retrieval mainly includes two types of models, namely an unsupervised text retrieval model and a supervised text retrieval model. The unsupervised text retrieval model is characterized in that words in a text are converted into word vectors, then the text vectors are generated based on the word vectors through a series of algorithms, and finally the similarity between text pairs is calculated according to the text vectors. The disadvantages of this method are: 1) The word segmentation effect influences the word vector effect to a great extent, and a general word segmentation device cannot meet the requirement of the legal field on word granularity and cannot accurately segment some legal phrases, so that poor word vectors are trained, and the retrieval effect is influenced finally; 2) The unsupervised model learns the integral distribution of data, cannot effectively capture the internal relation among texts, and has low interactivity among the texts.

Supervised text retrieval models are typically based on pre-trained language models. For example, the attention-based bidirectional coder (BERT) model, which splices user-entered text with library text, then calculates their interaction features, and finally calculates relevance scores for questions and documents based on the interaction features. In this scheme, the model itself has a length limit on the input, and text beyond this length is truncated, so that semantic information is lost and it is not suitable for a scenario where long text matches long text. And it is difficult to apply such models in large-scale data because model interaction cannot be done offline.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provide a document retrieval method and system based on deep learning.

According to a first aspect of the present invention, there is provided a document retrieval system based on deep learning. The system comprises:

a recall module: the text search method comprises the steps of generating a plurality of candidate texts aiming at query texts input by a user based on pre-stored word vectors and text vectors;

a rearrangement module: and the long text encoder is used for inputting the query text input by the user and the candidate texts into the trained long text encoder, obtaining a query vector and a candidate text vector, calculating the similarity between the query vector and the candidate text vector, and obtaining a ranked retrieval result.

According to a second aspect of the present invention, a document retrieval method based on deep learning is provided. The method comprises the following steps:

generating a plurality of candidate texts aiming at a query text input by a user based on a pre-stored word vector and a text vector;

inputting the query text input by the user and the candidate texts into a trained long text encoder to obtain a query vector and a candidate text vector, and calculating the similarity between the query vector and the candidate text vector to obtain a ranked retrieval result.

Compared with the prior art, the method has the advantages of effectively solving the problems that the unsupervised model sorting effect is not ideal and the supervised text retrieval model cannot be directly retrieved in a large number of documents in text retrieval, and remarkably improving the retrieval speed and accuracy. Moreover, the invention can be applied to the field of searching long texts by using the long texts, and has higher searching accuracy and higher searching speed on the premise of less required labeled data quantity, thereby having wide application prospect.

Other features of the present invention and advantages thereof will become apparent from the following detailed description of exemplary embodiments thereof, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description, serve to explain the principles of the invention.

FIG. 1 is an architecture diagram of a deep learning based document retrieval system according to one embodiment of the present invention;

fig. 2 is a flowchart of a document retrieval method based on deep learning according to an embodiment of the present invention.

Detailed Description

Various exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be noted that: the relative arrangement of the components and steps, the numerical expressions and numerical values set forth in these embodiments do not limit the scope of the present invention unless specifically stated otherwise.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the invention, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

In all examples shown and discussed herein, any particular value should be construed as merely illustrative, and not limiting. Thus, other examples of the exemplary embodiments may have different values.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The invention provides a document retrieval system based on deep learning, taking a legal field case as an example, a user gives a judgment book, and the retrieval system is required to find out a plurality of judgment books most relevant to the judgment book from a database and return the judgment books to the user, so that required relevant judgment book data can be quickly obtained. In the following description, for the sake of clarity, legal classes will be presented as examples, but it should be understood that the invention is equally applicable to other fields involving the retrieval of long texts in long texts, such as for example, paper duplication, patent retrieval, etc.

The legal class retrieval is mainly faced with three problems: 1) The data volume of case documents is very large, and the total volume of documents is over one hundred million and ten million according to the latest data, and the speed of documents is still increased by more than ten thousand every day. This enormous amount of data does not allow the entire case library to be retrieved or ordered directly using the depth model. 2) In the scene of legal field class retrieval, the input content of a user is a case, the text length is common to thousands of characters, the speed of recalling by using a traditional reverse index-based ElasticSearch database is very slow, synonyms cannot be considered, and the like, and meanwhile, the length of the retrieved text exceeds the input length limit of the existing retrieval model, so that the text interaction of the model becomes difficult. 3) The legal field class retrieval relates to a large number of cases, deep knowledge, high difficulty in manual labeling and high cost, and causes less labeled data, so that how to obtain a rapid and accurate retrieval system based on deep learning on a small amount of labeled data is a problem to be solved urgently.

Aiming at the characteristics of legal field class retrieval, the provided document retrieval and retrieval system based on deep learning adopts a two-stage scheme of first recall and then sequencing, and the system integrally comprises a recall module and a rearrangement module.

The recall module is mainly used for quickly recalling a plurality of candidate answers and outputting the candidate answers to the rearrangement module for sorting, and a vector retrieval mode is adopted because the text input by a user is long and cannot be retrieved by using a traditional ElasticSearch database.

For example, the recall module first generates an offline text vector using an unsupervised vector-based search model. The model weights the word vectors by the smooth fall frequency to generate text vectors. In order to better generate word vectors in the legal field, the recall module uses a special legal word segmenter trained on the basis of legal field labeling data, so that the word segmentation effect is greatly improved, and then a corresponding text semantic representation is generated by using a uSIF algorithm so as to provide calculation similarity for use during recall. The recall module greatly improves the recall rate and recall speed of the cases.

The rearrangement module adopts a supervised deep retrieval model, the allowed text input length of the model reaches thousands of characters, and the model has more accurate text representation capability under a small number of data sets by combining contrast learning during training, so that the overall retrieval effect of the system is further improved.

For example, the problem that the semantic information is lost due to exceeding the maximum input length of a supervised model in the case of long text interaction is solved. In the rearrangement stage, a long text-oriented pre-training language model (Longformer) based on a sparse self-attention mechanism is used as a basic document encoder, the upper limit of the length of an input sequence is expanded to thousands of characters, meanwhile, the calculation complexity of the attention mechanism is reduced, and the long text semantic features are captured by training the model through a method of pre-training and contrast learning in the legal field. And rearranging the candidate texts generated in the recall module, thereby improving the retrieval effect.

Embodiments of the recall module and the reorder module are described in detail below with reference to FIG. 1.

In one embodiment, the recall module performs the steps of:

and S1, generating a word vector.

And performing word segmentation by using a word segmentation device trained based on the labeling data of the legal field, performing word segmentation on large-scale data, and performing word vector training. For example, two million cases are randomly extracted, a special word segmentation device in the legal field is used for word segmentation, the text of the segmented words is input into a training word vector in fasttext, and finally the training word vector is stored in redis for persistence operation.

And S2, generating a text vector.

For example, for all case data, after word segmentation is performed by using a special word segmentation device in the legal field, a vector of each word is obtained in redis, a word vector matrix is formed and input to a ufsf algorithm to generate an original text vector, singular value decomposition is performed on the original text vector to obtain the first 5 principal components, the original text vector subtracts the first 5 principal components to obtain a final text vector, namely, the common part is subtracted from all the text vectors to obtain the characteristics of the final text vector, and the final text vector is stored in a Milvus vector database to perform persistence operation.

In particular, for a participled case, the vector v of each of its words is taken from the redis _w And splicing to obtain a word vector matrix of the text, and converting the word vector matrix into a text vector through a uSIF algorithm

The method specifically comprises the following steps:

first, α is calculated, which represents the probability that a word in the thesaurus will appear with a probability greater than a threshold T. T denotes the probability of a word occurring in a document and n denotes the average length of the text.

Wherein,

a representation vocabulary;

represents the vocabulary size;

indicating a function, and taking 1 when the logic in the parenthesis is true, and otherwise taking 0; p (w) represents the probability of the word w appearing in the corpus.

Next, a word vector v for each word in the text _w Computing original text vectors with additional weights

The weighted portion is referred to as the smoothed inverse frequency.

Wherein s represents text; s represents the text vocabulary; z represents the expectation of the sum of all word vector to text vector distances. Equation (4) is derived from the taylor expansion at point 0 of the probability function for a given text vector generating word. p (w) represents the probability of the word w appearing in the corpus.

The semantic expression capability of a sentence is improved by subtracting the common expression of the text in the corpus (the common expression is expected to be some high-frequency words or words, such as 'of'). Obtaining text vectors of all texts in the corpus

And arranged into text vector matrix

And carrying out singular value decomposition on the obtained matrix to obtain the first 5 principal components of the matrix. The final text vector representation c is thus obtained _s The formula is as follows:

wherein σ _i Representing a characteristic value, λ _i Representing the weight of each principal component.

Represents the original text vector, c' _i Representing the singular vectors of matrix a. proj denotes a projection operation. m is the number of principal components and the application value is 5.

The above process describes the process of converting a text into a text vector.

Finally, in order to facilitate the retrieval of the vectors, the generated vectors are stored in a Milvus vector database for persistence.

And S3, for the new text input by the user, searching out a plurality of most relevant vectors as a recall text.

The recalling process is similar to the vector generation process, firstly, a case (or called an input text) input by a user is participated by using a special legal participler, then, a vector corresponding to each participle is searched in redis, the vector is input into a uSIF algorithm to obtain a text original vector, a principal component obtained by a vector generation module is subtracted to obtain a final text vector, and finally, K vectors which are most similar in semanteme and document ids corresponding to the K vectors are searched by utilizing a Milvus database.

Second, rearrangement module:

in one embodiment, the rearrangement module uses a depth model (Longformer) facing long texts and based on a sparse self-attention mechanism as a basic encoder, positive and negative sample triples are constructed by a legal retrieval annotation data set as learning module training data, a representation-based framework is adopted, each case text is independently encoded into a vector by a text encoder, and finally similarity calculation is carried out on K cases and cases output by the recall module to obtain rearrangement scores. The rearrangement module comprises a training module and a sequencing module.

1) A training module:

to further improve the quality of the long text encoder (Longformer) generating document vectors, alleviate the problems of anisotropy of its native sentence representation vector space and low training data, in one embodiment, the long text encoder is fine-tuned using contrast learning. Constructing training data from case labeling data set, for a certain query text, the case related to the query text is a positive sample and the case unrelated to the query text is a negative sample, so that each query and the positive and negative samples of the query text form a triple

For a batch of training data, x _i Is simply a positive sample of

And negative sample remover

The method comprises the steps of coding each text by a long text coder (Longformer), obtaining vector representation by an average pooling layer, calculating and constructing a similarity matrix by cosine similarity, wherein each row of the matrix represents x _i All with the same batch of data

And

the training target is cross entropy and is expressed as:

wherein l _i Indicating the loss of the ith text; e is a natural constant; sim () is a cosine similarity function used to calculate the similarity of two vectors; h is _i As a text x _i Is used to represent the vector of (a),

as text

Is used to represent the vector of (a),

as text

Is represented by a vector of (a). τ represents a temperature over-parameter.

Through the training of the contrast learning, the alignment and the uniformity of the model generated vectors can be improved, namely, the vectors which are close to each other exist among the close instances and the vector representations are distributed more uniformly in the space, so that the similarity calculation among the vectors can well represent the similarity among the texts.

2) Sorting module

After a long text encoder (Longformer) uses contrast learning fine tuning on a small number of data sets, the semantics of a long text can be well captured and high-quality vector expression is generated, a query text input by a user and a candidate text recalled at a recall stage are respectively input into the long text encoder (Longformer), the output of a first Token (Token), namely [ CLS ] Token, is obtained as a query vector and a candidate text vector, the similarity of the query vector and the candidate text vector is calculated by using cosine similarity, and the final ranking result can be obtained by ranking the results from large to small.

In summary, the invention provides a deep learning-based document retrieval system, which processes texts into vectors to enable texts with different text lengths to be uniformly represented, then uses the vectors to retrieve and recall a plurality of texts as candidate answers, finally uses a sparse self-attention mechanism-based pre-training language model (Longformer) as a basic document encoder to combine with contrast learning to generate more accurate text vector representation, and finally uses a similarity function to rearrange the candidate texts. The system provided combines the advantages of vector retrieval and a depth model, and greatly improves the effect of long text retrieval of class cases on the premise of ensuring the retrieval speed.

Correspondingly, the invention also provides a document retrieval method based on deep learning, which is used for realizing one or more aspects of the system. For example, referring to fig. 2, the method includes: step S210, aiming at a query text input by a user, generating a plurality of candidate texts based on a pre-stored word vector and a text vector; step S220, inputting the query text input by the user and the candidate texts into a trained long text encoder to obtain a query vector and candidate text vectors, and calculating the similarity between the query vector and the candidate text vectors to obtain a ranked retrieval result.

In order to further verify the effect of the invention, according to the legal case document long document data disclosed by the referee document network, user questions based on 900 real scenes, wherein 300 questions are asked for civil affairs, criminal affairs and administrative affairs respectively, and the invention is used for searching. Three groups of experiments are carried out in total, wherein the first group uses a general word segmentation device, a vector recall module and a rearrangement module; the second group is to use a legal special word segmentation device, a vector recall module and a rearrangement module; the third group is the use of legal specialized tokenizer, the use of vector recall modules, and the use of rearrangement modules. The indexes for evaluating the experimental effect adopt average precision average (MAP) and average reciprocal level (MRR) and normalized breaking cumulative gain (NDCG). After manual review by law professionals, the third group of experiments has the best effect, and the second group of experiments has the least effect than the first group. Experimental results show that both the legal special word segmentation device and the rearrangement module can improve the system retrieval effect, namely, the retrieval speed and accuracy are improved.

In conclusion, the invention provides a deep learning-based document retrieval system, which effectively solves the problem of carrying out class retrieval by a deep model, combines the characteristics of class retrieval, and carries out case retrieval by using two stages of recall and rearrangement, thereby improving the retrieval speed and accuracy. The recall module is based on an unsupervised vector retrieval model, does not limit the text length and has higher recall rate and recall speed. The rearrangement module adopts a supervised deep retrieval model, allows a user to input case documents with thousands of characters, enhances the text representation capability of the deep model by contrast learning under the condition of a small amount of labeled data, and further improves the overall retrieval effect of the system. In particular, the advantages of the invention mainly include embodying the following aspects:

1) And a special legal word segmentation device is used during the generation of the word vector, so that words can be segmented more accurately in the legal field data.

2) And in the text vector generation stage, the uSIF is used as a vector generation algorithm, so that a more accurate case semantic vector can be generated without limiting the input length.

3) And in the vector recall stage, the uSIF is used as a vector generation algorithm, and the Milvus is used as a retrieval database, so that a more accurate case semantic vector can be generated, the input length is not limited, and semantically related texts can be quickly retrieved.

4) In the rearrangement stage, a comparison learning frame is adopted to train a long text encoder (Longformer), so that the long text encoder has a good generalization effect on a small number of data sets, the long text encoder is distributed more uniformly in a vector space, the anisotropy is eliminated to a certain extent, and the semantic expression capability of the model is improved.

The present invention may be a system, method and/or computer program product. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therewith for causing a processor to implement various aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as a punch card or an in-groove protruding structure with instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source code or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + +, python, or the like, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are equivalent.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein. The scope of the invention is defined by the appended claims.

Claims

1. A deep learning based document retrieval system comprising:

a recall module: the text generating device is used for generating a plurality of candidate texts aiming at query texts input by a user based on pre-stored word vectors and text vectors;

2. The system of claim 1, wherein the recall module generates the plurality of candidate texts according to:

aiming at a query text input by a user, performing word segmentation by using a special word segmentation device in the legal field, searching a word vector corresponding to each word in a pre-stored first database, and inputting the word vector into a uSIF algorithm to obtain an original vector of the query text;

subtracting a pre-stored principal component from the original query text vector to obtain a final query text vector;

and retrieving K vectors which are most similar to the final query text vector in semanteme and corresponding text identifications by utilizing a pre-stored second database to serve as the plurality of recalled candidate texts, wherein K is a set integer.

3. The system of claim 2, wherein the first database and the second database are obtained according to the steps of:

randomly extracting existing case data, performing word segmentation by using a special word segmentation device in the legal field, inputting a text of the word segmentation into a training word vector of a text classifier, and storing the training word vector into a first database for persistence operation;

for all selected case data, after word segmentation is carried out by using a special word segmentation device in the legal field, a vector of each word is obtained in a first database, a word vector matrix is formed and input to a uSIF algorithm to generate an original text vector of a sample case, singular value decomposition is carried out on the original text vector of the sample case to obtain a plurality of principal components with a set number, the original text vector of the sample case is subtracted from the plurality of principal components to obtain a final text vector of the sample case, and the final text vector of the sample case is stored in a second database to carry out persistence operation.

4. The system of claim 1, wherein the training process of the long text encoder comprises:

constructing training data from the case sample labeling data set, regarding a query text, taking the case related to the query text as a positive sample and the case not related to the query text as a negative sample, and further forming a triple of each query text and the positive and negative samples of the query text

For a batch of training data, x _i Is only a positive sample of

And negative sample is removed

The query text comprises positive and negative samples of other texts, each query text is coded by a long text coder, vector representation is obtained through an average pooling layer, a similarity matrix is constructed through cosine similarity calculation, and each row of the matrix represents x _i All with the same batch of training data

And

a similarity score of (d);

training is carried out by taking the set loss function as an optimization target.

5. The system of claim 4, wherein the loss function is set to:

wherein,

representing the loss of the ith text generation; e is a natural constant; sim () is a cosine similarity function; h is _i Is x _i Is used to represent the vector of (a),

is that

Is used to represent the vector of (a),

is that

And τ represents a temperature hyperparameter.

6. The system of claim 3, wherein the first database is a redis database, the second database is a Milvus vector database, and the text classifier is a fasttext classifier.

7. The system of claim 1, wherein the similarity between the query vector and the candidate text vectors is calculated based on cosine similarity.

8. The system of claim 1, wherein the recall module converts text into a text vector according to:

and weighting the word vectors by combining the smooth inverse frequency of the words in the field for the obtained word vector sequence to generate an original text vector, wherein the smooth inverse frequency of the words in the field reflects the occurrence frequency of the words in the field literature.

9. A document retrieval method based on deep learning comprises the following steps:

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method as claimed in claim 9.