CN111767373A - Document retrieval method, document retrieval device, electronic equipment and storage medium - Google Patents

Document retrieval method, document retrieval device, electronic equipment and storage medium Download PDF

Info

Publication number
CN111767373A
CN111767373A CN202010617097.6A CN202010617097A CN111767373A CN 111767373 A CN111767373 A CN 111767373A CN 202010617097 A CN202010617097 A CN 202010617097A CN 111767373 A CN111767373 A CN 111767373A
Authority
CN
China
Prior art keywords
vector
sentence
retrieved
vector set
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010617097.6A
Other languages
Chinese (zh)
Inventor
党升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An International Smart City Technology Co Ltd
Original Assignee
Ping An International Smart City Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An International Smart City Technology Co Ltd filed Critical Ping An International Smart City Technology Co Ltd
Priority to CN202010617097.6A priority Critical patent/CN111767373A/en
Publication of CN111767373A publication Critical patent/CN111767373A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3349Reuse of stored results of previous queries

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application is applicable to the technical field of artificial intelligence, and provides a document retrieval method, a document retrieval device, electronic equipment and a storage medium. Wherein, the method comprises the following steps: generating a sentence vector to be retrieved based on the sentence to be retrieved; determining a first vector set to which a sentence vector to be retrieved belongs through a preset clustering model; determining a second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and the central vector of the first vector set; similarity calculation is carried out on the sentence vectors to be retrieved and the sentence vectors in the second vector set, so that target sentence vectors are determined in the second vector set; and outputting the document pointed by the target sentence vector. By the aid of the method and the device, retrieval efficiency of large-scale full-text retrieval based on semantics can be improved. In addition, the present application relates to the field of blockchain technology, wherein the clustering models can be stored in blockchains.

Description

Document retrieval method, document retrieval device, electronic equipment and storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a document retrieval method, a document retrieval apparatus, an electronic device, and a computer-readable storage medium.
Background
At present, frames such as elastic search or slope are often adopted in the industry to construct a search engine so as to realize large-scale full-text search. However, the search engines constructed based on these frames need to perform word segmentation on the searched content first, and then search with the word segmentation result as the basic unit of the search, and cannot search results with different words but similar semantics through the semantics of the searched content. When the searched content is searched and analyzed based on semantics, semantic similarity calculation is required between the searched content and all contents in the database for each search, which causes the number of operations required for the search to be increased suddenly with the increase of data volume, resulting in low search efficiency.
Disclosure of Invention
In view of this, embodiments of the present application provide a document retrieval method, a document retrieval apparatus, an electronic device, and a computer-readable storage medium, which can improve retrieval efficiency of performing large-scale full-text retrieval based on semantics.
A first aspect of an embodiment of the present application provides a document retrieval method, including:
generating a sentence vector to be retrieved based on the sentence to be retrieved;
determining a first vector set to which the sentence vector to be retrieved belongs through a preset clustering model, wherein the first vector set comprises sentence vectors, and the sentence vectors are used for pointing to documents in a preset database;
determining a second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and a central vector of the first vector set, wherein the central vector is an average value of the sentence vectors of the first vector set, and the second vector set is a proper subset of the first vector set;
similarity calculation is carried out on the sentence vectors to be retrieved and the sentence vectors in the second vector set, so that target sentence vectors are determined in the second vector set;
and outputting the document pointed by the target sentence vector.
A second aspect of an embodiment of the present application provides a document retrieval apparatus including:
the sentence vector generating unit is used for generating a sentence vector to be retrieved based on the sentence to be retrieved;
the first vector set determining unit is used for determining a first vector set to which the sentence vector to be retrieved belongs through a preset clustering model, wherein the first vector set comprises sentence vectors, and the sentence vectors are used for pointing to documents in a preset database;
a second vector set determining unit, configured to determine, according to a distance between the sentence vector to be retrieved and a center vector of the first vector set, a second vector set to which the sentence vector to be retrieved belongs, where the center vector is an average of the sentence vectors of the first vector set, and the second vector set is a proper subset of the first vector set;
a target sentence vector determining unit, configured to perform similarity calculation on the to-be-retrieved sentence vector and each sentence vector in the second vector set, so as to determine a target sentence vector in the second vector set;
and the retrieval result output unit is used for outputting the document pointed by the target sentence vector.
A third aspect of the embodiments of the present application provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the electronic device, where the processor implements the steps of the document retrieval method provided in the first aspect when executing the computer program.
A fourth aspect of embodiments of the present application provides a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the document retrieval method provided by the first aspect.
The implementation of the document retrieval method, the document retrieval device, the electronic equipment and the computer-readable storage medium provided by the embodiment of the application has the following beneficial effects: and the electronic equipment performs semantic retrieval by taking the sentence to be retrieved as a unit. Because the first vector sets and the second vector sets under the first vector sets are divided based on the database in advance, when semantic retrieval is carried out on the statement to be retrieved, all sentence vectors stored in the database do not need to be traversed, but the first vector set most matched with the statement to be retrieved is determined first, so that preliminary retrieval is realized; then searching the second vector set which is most matched from the first vector set which is most matched, thereby realizing deep retrieval; and finally, retrieving the most matched sentence vectors in the most matched second vector set so as to obtain a final retrieval result. The process reduces the range of semantic retrieval, reduces the calculated amount of the semantic retrieval and improves the efficiency of the semantic retrieval.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments or the prior art descriptions will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a schematic diagram of a database provided in an embodiment of the present application;
FIG. 2 is a schematic diagram of a construction process of a database-based vector space according to an embodiment of the present application;
FIG. 3 is a diagram of partitioning a second set of vectors based on a vector space, provided by an embodiment of the present application;
FIG. 4 is a schematic flow chart illustrating an implementation of a document retrieval method according to an embodiment of the present application;
fig. 5 is a block diagram of a document retrieval apparatus according to an embodiment of the present application;
fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The document retrieval method according to the embodiment of the present application may be applied to electronic devices such as a server, a desktop computer, a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an Augmented Reality (AR)/Virtual Reality (VR) device, a notebook computer, a super-mobile personal computer (UMPC), a netbook, and a Personal Digital Assistant (PDA), and the specific type of the electronic device is not limited in the embodiment of the present application.
To facilitate understanding of the document retrieval method provided in the embodiments of the present application, a database and a vector space constructed based on the database are described below.
Referring to fig. 1, fig. 1 shows a structure of a database used in a document retrieval method provided in an embodiment of the present application. In the database, at least one document is stored, and each document is composed of at least one statement. In order to implement the document retrieval method provided by the embodiment of the present application, the electronic device needs to construct a vector space in advance based on the content (including documents and statements in each document) stored in the database.
Referring to fig. 2, based on the database, the vector space is constructed as follows:
step 201, respectively allocating a unique literature index number to each literature;
in this embodiment, different documents correspond to different document index numbers. That is, there is a one-to-one correspondence between documents and document indices: each document index number uniquely points to a document and a document is pointed to by only one document index number.
Step 202, aiming at each document, respectively allocating a unique sentence index number for the sentences forming the document;
in this embodiment, different sentences may also correspond to different sentence index numbers within a given document. That is, within a given document, there is a one-to-one correspondence between sentences and sentence index numbers: each statement index number uniquely points to one statement in the document, and one statement in the document is pointed to by only one statement index number.
For example, the index number of the document and the index number of the sentence may be numbered from 1 to the back, which is not limited herein. For example, assuming there are a documents (i.e., a documents) in the database, a document index number may be respectively assigned to each document, specifically from 1, 2, 3 to a; similarly, for document 1, assuming there are B sentences in document 1, sentence index numbers can be assigned to the sentences in document 1, specifically from 1, 2, 3 up to B; similarly, for document 2, assuming there are C sentences in document 1, sentence index numbers can be assigned to the sentences in document 2, specifically from 1, 2, 3 up to C. It can be seen that although the sentences in documents 1 and 2 are numbered from 1 onward, the documents to which the sentences belong have different document indices, and therefore no confusion arises.
Through steps 201 and 202, a particular document and particular statement in the database can be uniquely determined based on a given document index number and a given statement index number.
Step 203, respectively generating corresponding sentence vectors for each sentence, recording the sentence vectors as the sentence vectors, and establishing an association relation among the sentence vectors, the sentence index numbers corresponding to the sentences and the document index numbers of the documents to which the sentences belong;
in this embodiment, the electronic device may generate a corresponding sentence vector for the sentence based on a preset sentence vector generation model. Illustratively, the sentence vector generation model may be a pre-trained Bidirectional Encoderepresentation from transformations (BERT) model; alternatively, the sentence vector generation model may be a word2vec model, and the sentence vector generation model is not limited thereto. The constructed association relationship can be expressed as Vec-S-D, where Vec is a sentence vector, S is a sentence index number of a sentence generating the sentence vector, and D is a document index number of a document to which the sentence belongs. Through the association relationship, each sentence vector can point to a specific sentence of a specific document in the database.
For example, assuming that there are a documents in the database, and each document is composed of B sentences on average, there are a × B sentences in the database; correspondingly, a × B sentence vectors may be generated.
And 204, clustering each sentence vector in the database based on a preset clustering algorithm to obtain at least one first vector set, and reserving a clustering model.
In this embodiment, the preset clustering algorithm may be a K-means clustering algorithm, or may be another clustering algorithm, which is not limited herein. Through the clustering algorithm, each sentence vector in the database can be clustered, each cluster (namely each cluster) obtained by clustering is a first vector set, and after the clustering is finished, a corresponding clustering model is reserved. It can be considered that each first vector set is a large class, and sentence vectors belonging to the same large class are aggregated. The central vector of each cluster is the central vector corresponding to the first vector set, and the central vector can be specifically represented by an average value of sentence vectors of the first vector set.
In some embodiments, the electronic device may store the cluster model uplink in a block chain (Blockchain).
In order to ensure the security of data and the fairness and transparency to the user, the clustering model can be uploaded to a block chain for evidence storage. The user can then download the clustering model from the blockchain through the respective device to verify whether the clustering model is tampered. The blockchain in this embodiment is a novel application mode that uses computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, and an encryption algorithm. The block chain, which is essentially a decentralized database, is a string of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, which is used for verifying the validity (anti-counterfeiting) of the information and generating a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Step 205, for each first vector set, at least two second vector sets under the first vector set are obtained through division.
In this embodiment, each first vector set may be further divided into at least two second vector sets. In this embodiment, the second vector set may be divided according to the distance between vectors, specifically: respectively calculating the distance between each sentence vector of the first vector set and the vector of the center vector of the first vector set; according to a preset distance threshold value between vectors, carrying out interval division on the maximum value of the calculated distance between the vectors so as to obtain at least two preset distance intervals associated with the first class; each distance interval corresponds to a second vector set. Each second vector set obtained by the division mode is a proper subset of the corresponding first vector set. The division process of the second vector set is explained below with reference to fig. 3:
the central vector of a first set of vectors is shown in fig. 3, and the vectors of the first set of vectors that are the farthest from the central vector. The distance between the two vectors is the maximum value of the distance between the vectors in the first vector set, and is denoted as Lmax. Noting that the distance threshold between vectors is R, noting that the number of distance intervals to be divided in the first vector set is k, the value of R can be determined by
Figure BDA0002564154400000061
And (4) calculating. After R is calculated, k circles are drawn with the end point of the central vector as the center and R,2 × R,3 × R, … …, k × R as the radius in sequence. In FIG. 3, k is 4, and under the first vector set shown in FIG. 3, Item1 is marked off (distance interval is [0, R ]]I.e., a circle having the end point of the center vector as the center and R as the radius), Item2 (distance interval of (R,2R)]I.e. a ring shape with the end point of the central vector as the center, R as the inner radius, and 2R as the outer radius), Item3 (distance interval of (2R, 3R)]I.e. a ring shape with the end point of the central vector as the center, 2R as the inner radius, and 3R as the outer radius, and Item4 (the distance interval is (3R, Lmax)]I.e., a ring shape having the end point of the central vector as the center, 3R as the inner radius, and Lmax (i.e., 4R) as the outer radius). Assuming that a sentence vector Vec1 belongs to the first vector set and that Vec1 is at a distance d from the center vector, where d falls within the interval (R,2R) (i.e., the end point of Vec1 falls within a circle centered on the end point of the center vector, with R as the inner radius and 2R as the outer radius), it can be determined that Vec1 belongs to the second vector set, Item2, which is below the first vector set.
Thus, through the above process, the electronic device has been built into a vector space based on the database. There is at least one first set of vectors in the vector space and at least one second set of vectors in each first set of vectors. Sentence vectors corresponding to each sentence in the database can be divided into a second vector set under a first vector set based on the vector space.
Referring to fig. 4, fig. 4 is a flowchart illustrating an implementation of a document retrieval method according to an embodiment of the present application. As shown in fig. 4, the document retrieval method provided by the present embodiment may include:
step 401, generating a sentence vector to be retrieved based on a sentence to be retrieved;
in this embodiment, the electronic device may generate a sentence vector to be retrieved for the sentence to be retrieved based on a preset sentence vector generation model. Illustratively, the sentence vector generation model may be a pre-trained BERT model; alternatively, the sentence vector generation model may be a word2vec model, and the sentence vector generation model is not limited thereto.
Step 402, determining a first vector set to which the sentence vector to be retrieved belongs through a preset clustering model;
in the present embodiment, as described above, a cluster model and at least one first vector set have been configured in advance in the electronic device. Through the clustering model, the electronic equipment can classify the sentence vectors to be retrieved into a correct first vector set; that is, the electronic device may determine, through the clustering model, a first vector set to which a sentence vector to be retrieved belongs. It can be considered that, in this step, the preliminary search is performed, and a large class to which the sentence vector to be searched belongs, that is, the first vector set to which the sentence vector to be searched belongs, can be found.
Step 403, determining a second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and the central vector of the first vector set;
in this embodiment, as mentioned above, the first vector set can be regarded as a large class, and the second vector set under the first vector set can be regarded as a small class under the large class. After determining a first vector set to which a sentence vector to be retrieved belongs, the electronic device may calculate a distance between the sentence vector to be retrieved and a center vector of the first vector set, and determine a distance interval in which the distance falls as a target distance interval from at least two preset distance intervals associated with the first vector set. Since each distance interval is associated with a second vector set under the first vector set, the second vector set associated with the target distance interval can be determined as the second vector set to which the sentence vector to be retrieved belongs. Through steps 402 and 403, the search range can be gradually reduced.
Step 404, performing similarity calculation on the sentence vector to be retrieved and each sentence vector in the second vector set to determine a target sentence vector in the second vector set;
in this embodiment, there is at least one sentence vector in each second vector set. In order to retrieve the target sentence vector which is most matched with the sentence vector to be retrieved from the second vector set to which the sentence vector to be retrieved belongs, the electronic device may perform similarity calculation on the sentence vector to be retrieved and each sentence vector of the second vector set. Generally speaking, the higher the result obtained by similarity calculation is, the more the sentence vector to be retrieved is matched with the corresponding sentence vector; therefore, the electronic device may sort the sentence vectors belonging to the second category in order from the highest to the lowest according to the result of the similarity calculation, and determine N sentence vectors sorted in the top as the target sentence vectors. N is a predetermined positive integer, for example, N may be set to 3.
Step 405, outputting the document pointed by the target sentence vector.
In the embodiment, as described above, because the sentence vectors are all generated by the sentences, the electronic device has previously established an association relationship between the sentence vectors generated by the sentences and the document index numbers of the documents to which the sentences belong; and, based on a given document index number, the electronic device can uniquely identify a particular document in the database. Based on the method, the electronic equipment can firstly acquire the literature index number associated with the target sentence vector; then, the document pointed by the document index number is searched in the database, and the document is output as a retrieval search for the user to look up.
Optionally, the association relationship stored in the database further includes a sentence index number of the sentence (i.e. the association relationship between the sentence vector generated by the sentence, the sentence index number of the sentence, and the document index number of the document to which the sentence belongs); and based on a given document index number and a given sentence index number, the electronic equipment can uniquely determine a specific document in the database and a specific sentence in the specific article. Based on the method, the electronic equipment can firstly acquire the document index number and the sentence index number which are associated with the target sentence vector; then, the document pointed by the document index number is searched in the database, the sentence pointed by the sentence index number is searched in the document, and the document and the sentence are output, so that a user can look up a more detailed retrieval result.
It should be noted that, when the number of the retrieved target sentence vectors is M, the electronic device outputs M sentences and M documents correspondingly.
Optionally, before step 101, the document retrieval method further includes:
sentence-dividing processing is carried out on documents to be retrieved to obtain at least one sentence forming the documents to be retrieved;
any statement in the at least one statement is determined as a statement to be retrieved in sequence;
in this embodiment, the electronic device may receive an input retrieval instruction, and determine a document pointed to by the retrieval instruction as a document to be retrieved. Alternatively, the retrieval instruction may point to a document local to the electronic device; alternatively, the electronic device may also point to a document stored in a preset cloud, which is not limited herein. Considering that a document is composed of at least one sentence in a normal case, the electronic device may perform sentence segmentation processing on the determined document to be retrieved to obtain at least one sentence composing the document to be retrieved. In consideration of the fact that full-text retrieval is required in the embodiment, the electronic device can sequentially determine any statement in the at least one statement as a statement to be retrieved; that is, each sentence in the document to be retrieved can be used as the sentence to be retrieved to perform the subsequent steps. It should be noted that, in the case where the document to be retrieved is composed of only one sentence, the electronic device may not need to perform sentence segmentation processing on the document to be retrieved, and may directly use the document to be retrieved as the sentence to be retrieved.
Alternatively, the electronic device may dynamically update the content stored in the database. Therefore, after the step 403, the document retrieval method further includes:
distributing idle document index numbers for the documents to be retrieved;
establishing an incidence relation between the document index numbers distributed by the documents to be retrieved and the sentence vectors to be retrieved;
and updating the sentence vectors to be retrieved to the sentence vectors belonging to the second category based on the association relationship.
After the second category to which the sentence vector to be retrieved belongs is determined, the vector space constructed in the database can be updated, that is, the sentence vector to be retrieved is updated to the sentence vector belonging to the second category, so as to enrich the content stored in the database. Further, as described above, considering that the association relationship of each sentence vector stored in the vector space can make Vec-S-D (i.e., the sentence vector, the sentence index number corresponding to the sentence, and the document index number of the document to which the sentence belongs), based on this, the sentence vector to be retrieved generated based on the current sentence to be retrieved is considered to be VecnewThen based on VecnewWhen updating the database, the electronic device may first assign an idle document index D to the current document to be retrievednewAnd is the VecnewThe corresponding sentence to be retrieved is assigned with an idle sentence index number SnewThen build Vecnew、DnewAnd SnewThe association relationship between them. Based on the association relationship, the electronic device can use the sentence vector Vec to be retrievednewAnd updating the sentence vectors into the corresponding second vector sets to enable sentence vectors in the second vector sets to be enriched (namely, taking the sentence to be retrieved as a sentence newly added in the database, taking the document to be retrieved as a newly added document in the database, and taking the sentence vector to be retrieved as a sentence vector newly added in the database).
Assuming that the number of sentences stored in the database is X, Y first vector sets are preconfigured in the database, the number of second vector sets under each first vector set is Z on average, and the number of sentence vectors included in each second vector set is W, the document retrieval method provided by this embodiment is adopted, and the number of calculations involved is about Y + Z + W times. When the existing violent semantic matching is adopted, the related calculation times are about X times because all sentences in the database need to be matched once. Obviously, the document retrieval method provided by the embodiment can greatly reduce the calculation times during retrieval, and can realize the remarkable improvement of the retrieval speed and the retrieval efficiency.
As can be seen from the above, in the document retrieval method provided in this embodiment, the electronic device establishes a vector space in advance based on the content stored in the database, where the vector space stores a sentence vector generated by a sentence, a sentence index number of the sentence, and an association relationship between the sentence index numbers of the documents to which the sentence belongs, so that the electronic device can uniquely determine a specific document and a specific sentence in the database based on a given sentence index number and a given sentence index number. The vector space is used for structurally dividing sentence vectors in the database in advance, a plurality of first vector sets are configured, and a plurality of second vector sets are arranged under each first vector set. Based on the method, when the electronic equipment is used for searching, semantic searching can be carried out by taking the statement to be searched as a unit, and in the searching process, the electronic equipment does not traverse all contents stored in the database, but determines a first vector set which is most matched with the statement to be searched first, so that preliminary searching is realized; then searching the second vector set which is most matched from the first vector set which is most matched, thereby realizing deep retrieval; and finally, retrieving the most matched sentence vectors in the most matched second vector set so as to obtain a final retrieval result. The process reduces the range of semantic retrieval, reduces the calculated amount of the semantic retrieval and improves the efficiency of the semantic retrieval.
Referring to fig. 5, fig. 5 is a block diagram of a document retrieval device according to an embodiment of the present disclosure. In this embodiment, each unit included in the electronic device is configured to execute each step in the embodiment of the document retrieval method, and refer to the relevant description in the embodiment corresponding to the document retrieval method. For convenience of explanation, only the portions related to the present embodiment are shown.
Referring to fig. 5, the document retrieval apparatus 5 includes:
a sentence vector to be retrieved generating unit 501, configured to generate a sentence vector to be retrieved based on a sentence to be retrieved;
a first vector set determining unit 502, configured to determine, through a preset clustering model, a first vector set to which the sentence vector to be retrieved belongs, where the first vector set includes sentence vectors, and the sentence vectors are used to point to documents in a preset database;
a second vector set determining unit 503, configured to determine a second vector set to which the sentence vector to be retrieved belongs according to a distance between the sentence vector to be retrieved and a central vector of the first vector set, where the central vector is an average value of the sentence vectors of the first vector set, and the second vector set is a proper subset of the first vector set;
a target sentence vector determining unit 504, configured to perform similarity calculation on the sentence vector to be retrieved and each sentence vector in the second vector set, so as to determine a target sentence vector in the second vector set;
and a search result output unit 505 for outputting the document pointed by the target sentence vector.
In an embodiment of the present application, the document retrieval device 5 further includes:
the sentence dividing processing unit is used for carrying out sentence dividing processing on the document to be retrieved to obtain at least one sentence forming the document to be retrieved;
and the sentence determining unit is used for sequentially determining any sentence in the at least one sentence as a sentence to be retrieved.
In an embodiment of the present application, the document retrieval device 5 further includes:
the index number distribution unit is used for distributing idle document index numbers for the documents to be retrieved;
the incidence relation establishing unit is used for establishing the incidence relation between the document index numbers distributed by the documents to be retrieved and the sentence vectors to be retrieved;
and the updating unit is used for updating the sentence vector to be retrieved to the second vector set based on the incidence relation.
As an embodiment of the present application, the second vector set determining unit 503 includes:
a distance calculating subunit, configured to calculate a distance between the sentence vector to be retrieved and a center vector of the first vector set;
an interval determining subunit, configured to determine, among at least two preset distance intervals associated with the first vector set, a target distance interval in which the distance falls;
and a second category determining subunit, configured to determine a second vector set associated with the target distance interval as a second vector set to which the sentence vector to be retrieved belongs.
In an embodiment of the present application, the document retrieval device 5 further includes:
an inter-vector distance calculation unit configured to calculate inter-vector distances between respective sentence vectors of the first vector set and a center vector of the first vector set, respectively;
and the interval division unit is used for carrying out interval division on the maximum value of the calculated distance between the vectors according to a preset distance threshold value between the vectors so as to obtain at least two preset distance intervals associated with the first vector set.
As an embodiment of the present application, the search result output unit 505 includes:
the index number acquisition subunit is used for acquiring the index number of the literature associated with the target sentence vector;
the target literature searching subunit is used for searching the text pointed by the literature index number in a preset database;
and the document output subunit is used for outputting the searched documents.
As an embodiment of the present application, the target sentence vector determining unit 504 includes:
a sentence vector sorting subunit, configured to sort, according to a sequence from high to low of a result of similarity calculation, each sentence vector of the second vector set;
and the target sentence vector determining subunit is used for determining a preset number of sentence vectors sequenced in front as target sentence vectors.
As an embodiment of the present application, the clustering model is stored in a block chain.
In the embodiment of the application, the document retrieval device establishes a vector space in advance based on the content stored in the database, and the vector space stores the association relationship among the sentence vector generated by the sentence, the sentence index number of the sentence and the document index number of the document to which the sentence belongs, so that the document retrieval device can uniquely determine a specific document and a specific sentence in the database based on a given document index number and a given sentence index number. The vector space is structurally divided into sentence vectors in advance, a plurality of first vector sets are configured, and a plurality of second vector sets are arranged under each first vector set. Based on the method, the document retrieval device can perform semantic retrieval by taking the statement to be retrieved as a unit during retrieval, and in the retrieval process, the document retrieval device does not traverse all contents stored in the database, but determines a first vector set most matched with the statement to be retrieved first so as to realize preliminary retrieval; then searching the second vector set which is most matched under the first vector set which is most matched, so as to realize deep search; and finally, searching the most matched sentence vectors under the most matched second vector set so as to obtain a final search result. The process reduces the range of semantic retrieval, reduces the calculated amount of the semantic retrieval and improves the efficiency of the semantic retrieval.
It should be noted that, because the contents of information interaction, execution process, and the like between the above units are based on the same concept, specific functions and technical effects thereof may be referred to in the method embodiment section, and are not described herein again.
Fig. 6 is a schematic structural diagram of an electronic device according to another embodiment of the present application. As shown in fig. 6, the electronic apparatus 6 of this embodiment includes: a processor 61, a memory 62 and a computer program 63, such as a program for a document retrieval method, stored in the memory 62 and executable on the processor 61. The processor 61 implements the steps in the embodiments of the document retrieval methods described above, such as steps 201 to 205 shown in fig. 2, when executing the computer program 63 described above. Alternatively, when the processor 61 executes the computer program 63, the functions of the units in the embodiment corresponding to fig. 5, for example, the functions of the units 501 to 505 shown in fig. 5, are implemented, and please refer to the related description in the embodiment corresponding to fig. 5, which is not described herein again.
Illustratively, the computer program 63 may be divided into one or more units, which are stored in the memory 62 and executed by the processor 61 to complete the present application. The one or more units may be a series of computer program instruction segments capable of performing specific functions, which are used to describe the execution process of the computer program 63 in the electronic device 60. For example, the computer program 63 may be divided into a sentence vector generation unit to be retrieved, a first vector set determination unit, a second vector set determination unit, a target sentence vector determination unit, and a retrieval result output unit, each of which functions as described above.
The electronic device may include, but is not limited to, a processor 61 and a memory 62. It will be appreciated by those skilled in the art that fig. 6 is merely an example of the electronic device 6, and does not constitute a limitation of the electronic device 6, and may include more or less components than those shown, or some components in combination, or different components, e.g., the turntable device described above may also include input-output devices, network access devices, buses, etc.
The Processor 61 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 62 may be an internal storage unit of the electronic device 6, such as a hard disk or a memory of the electronic device 6. The memory 62 may be an external storage device of the electronic device 6, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the electronic device 6. Further, the memory 62 may include both an internal storage unit and an external storage device of the electronic device 6. The memory 62 is used to store the computer program and other programs and data required by the turntable device. The memory 62 may also be used to temporarily store data that has been output or is to be output.
As can be seen from the above, in this embodiment, the electronic device establishes a vector space in advance based on the content stored in the database, where the vector space stores an association relationship among the sentence vector generated by the sentence, the sentence index number of the sentence, and the document index number of the document to which the sentence belongs, so that the electronic device can uniquely determine a specific document and a specific sentence in the database based on a given document index number and a given sentence index number. The vector space is structurally divided into sentence vectors in advance, a plurality of first vector sets are configured, and a plurality of second vector sets are arranged under each first vector set. Based on the method, when the electronic equipment is used for searching, semantic searching can be carried out by taking the statement to be searched as a unit, and in the searching process, the electronic equipment does not traverse all contents stored in the database, but determines a first vector set which is most matched with the statement to be searched first, so that preliminary searching is realized; then searching the second vector set which is most matched under the first vector set which is most matched, so as to realize deep search; and finally, searching the most matched sentence vectors under the most matched second vector set so as to obtain a final search result. The process reduces the range of semantic retrieval, reduces the calculated amount of the semantic retrieval and improves the efficiency of the semantic retrieval.
The embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program can implement the steps in the embodiments of the document retrieval method.
The embodiment of the present application provides a computer program product, which when running on an electronic device, enables the electronic device to implement the steps in the above-mentioned document retrieval method embodiments when executed.
The above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present application and are intended to be included within the scope of the present application.

Claims (10)

1. A document retrieval method, comprising:
generating a sentence vector to be retrieved based on the sentence to be retrieved;
determining a first vector set to which the sentence vector to be retrieved belongs through a preset clustering model, wherein the first vector set comprises sentence vectors, and the sentence vectors are used for pointing to documents in a preset database;
determining a second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and a central vector of the first vector set, wherein the central vector is an average value of the sentence vectors of the first vector set, and the second vector set is a proper subset of the first vector set;
similarity calculation is carried out on the sentence vectors to be retrieved and the sentence vectors in the second vector set, so that target sentence vectors are determined in the second vector set;
and outputting the document pointed by the target sentence vector.
2. The document retrieval method of claim 1, wherein the clustering model is stored in a blockchain; before generating a sentence vector to be retrieved based on a sentence to be retrieved, the document retrieval method further includes:
performing sentence division processing on a document to be retrieved to obtain at least one sentence forming the document to be retrieved;
and sequentially determining any statement in the at least one statement as a statement to be retrieved.
3. The document retrieval method according to claim 2, wherein after the determining of the second vector set to which the sentence vector to be retrieved belongs, the document retrieval method further comprises:
distributing an idle document index number for the document to be retrieved;
establishing an incidence relation between the document index numbers distributed by the documents to be retrieved and the sentence vectors to be retrieved;
and updating the sentence vector to be retrieved to the second vector set based on the incidence relation.
4. The document retrieval method according to claim 1, wherein the determining the second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and the central vector of the first vector set comprises:
calculating the distance between the sentence vector to be retrieved and the central vector of the first vector set;
determining a target distance interval in which the distance falls in at least two preset distance intervals associated with the first vector set;
and determining a second vector set associated with the target distance interval as a second vector set to which the sentence vector to be retrieved belongs.
5. The document retrieval method according to claim 4, wherein before the determining of the second vector set to which the sentence vector to be retrieved belongs according to the distance between the sentence vector to be retrieved and the center vector of the first vector set, the document retrieval method further comprises:
respectively calculating the inter-vector distance between each sentence vector of the first vector set and the central vector of the first vector set;
and according to a preset distance threshold value between vectors, carrying out interval division on the maximum value of the calculated distance between the vectors so as to obtain at least two preset distance intervals associated with the first vector set.
6. The document retrieval method of claim 1, wherein the outputting the document to which the target sentence vector points comprises:
acquiring a literature index number associated with the target sentence vector;
searching a document pointed by the document index number in a preset database;
and outputting the searched documents.
7. The document retrieval method of claim 1, wherein the performing similarity calculation between the sentence vector to be retrieved and each sentence vector of the second vector set to determine a target sentence vector in the second vector set comprises:
sequencing each sentence vector of the second vector set according to the sequence of similarity calculation results from high to low;
and determining a preset number of sentence vectors ranked at the top as target sentence vectors.
8. A document retrieval apparatus, comprising:
the sentence vector generating unit is used for generating a sentence vector to be retrieved based on the sentence to be retrieved;
the first vector set determining unit is used for determining a first vector set to which the sentence vector to be retrieved belongs through a preset clustering model, wherein the first vector set comprises sentence vectors, and the sentence vectors are used for pointing to documents in a preset database;
a second vector set determining unit, configured to determine, according to a distance between the sentence vector to be retrieved and a center vector of the first vector set, a second vector set to which the sentence vector to be retrieved belongs, where the center vector is an average of the sentence vectors of the first vector set, and the second vector set is a proper subset of the first vector set;
a target sentence vector determining unit, configured to perform similarity calculation on the to-be-retrieved sentence vector and each sentence vector in the second vector set, so as to determine a target sentence vector in the second vector set;
and the retrieval result output unit is used for outputting the document pointed by the target sentence vector.
9. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 7 are implemented when the computer program is executed by the processor.
10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
CN202010617097.6A 2020-06-30 2020-06-30 Document retrieval method, document retrieval device, electronic equipment and storage medium Pending CN111767373A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010617097.6A CN111767373A (en) 2020-06-30 2020-06-30 Document retrieval method, document retrieval device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010617097.6A CN111767373A (en) 2020-06-30 2020-06-30 Document retrieval method, document retrieval device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN111767373A true CN111767373A (en) 2020-10-13

Family

ID=72724339

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010617097.6A Pending CN111767373A (en) 2020-06-30 2020-06-30 Document retrieval method, document retrieval device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN111767373A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000112953A (en) * 1998-09-30 2000-04-21 Fujitsu Kiden Ltd Literature retrieval method and its system
CN103268340A (en) * 2013-05-21 2013-08-28 龚如宾 Format reflowable file establishing and drawing method based on hierarchical index
CN103927340A (en) * 2014-03-27 2014-07-16 中国科学院信息工程研究所 Ciphertext retrieval method
CN110598078A (en) * 2019-09-11 2019-12-20 京东数字科技控股有限公司 Data retrieval method and device, computer-readable storage medium and electronic device
CN110874417A (en) * 2018-09-04 2020-03-10 华为技术有限公司 Data retrieval method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000112953A (en) * 1998-09-30 2000-04-21 Fujitsu Kiden Ltd Literature retrieval method and its system
CN103268340A (en) * 2013-05-21 2013-08-28 龚如宾 Format reflowable file establishing and drawing method based on hierarchical index
CN103927340A (en) * 2014-03-27 2014-07-16 中国科学院信息工程研究所 Ciphertext retrieval method
CN110874417A (en) * 2018-09-04 2020-03-10 华为技术有限公司 Data retrieval method and device
CN110598078A (en) * 2019-09-11 2019-12-20 京东数字科技控股有限公司 Data retrieval method and device, computer-readable storage medium and electronic device

Similar Documents

Publication Publication Date Title
CN108804641B (en) Text similarity calculation method, device, equipment and storage medium
US11227118B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
CN109101620B (en) Similarity calculation method, clustering method, device, storage medium and electronic equipment
US10649770B2 (en) κ-selection using parallel processing
US9442929B2 (en) Determining documents that match a query
CN112115232A (en) Data error correction method and device and server
Li et al. Large-scale robust visual codebook construction
WO2022222942A1 (en) Method and apparatus for generating question and answer record, electronic device, and storage medium
CN112559687A (en) Question identification and query method and device, electronic equipment and storage medium
CN112214576B (en) Public opinion analysis method, public opinion analysis device, terminal equipment and computer readable storage medium
CN110569289A (en) Column data processing method, equipment and medium based on big data
CN112632261A (en) Intelligent question and answer method, device, equipment and storage medium
CN112214515A (en) Data automatic matching method and device, electronic equipment and storage medium
CN112860850B (en) Man-machine interaction method, device, equipment and storage medium
CN112632264A (en) Intelligent question and answer method and device, electronic equipment and storage medium
CN110209895B (en) Vector retrieval method, device and equipment
CN109657060B (en) Safety production accident case pushing method and system
CN115203378B (en) Retrieval enhancement method, system and storage medium based on pre-training language model
CN113407702B (en) Employee cooperation relationship intensity quantization method, system, computer and storage medium
CN111767373A (en) Document retrieval method, document retrieval device, electronic equipment and storage medium
CN114881001A (en) Report generation method based on artificial intelligence and related equipment
CN110941638A (en) Application classification rule base construction method, application classification method and device
CN114818686A (en) Text recommendation method based on artificial intelligence and related equipment
CN110968690B (en) Clustering division method and device for words, equipment and storage medium
CN115495636A (en) Webpage searching method, device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination