CN115146027A - Text vectorization storage and retrieval method, device and computer equipment - Google Patents

Text vectorization storage and retrieval method, device and computer equipment Download PDF

Info

Publication number
CN115146027A
CN115146027A CN202210606018.0A CN202210606018A CN115146027A CN 115146027 A CN115146027 A CN 115146027A CN 202210606018 A CN202210606018 A CN 202210606018A CN 115146027 A CN115146027 A CN 115146027A
Authority
CN
China
Prior art keywords
vectors
vector
candidate
retrieved
candidate word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210606018.0A
Other languages
Chinese (zh)
Inventor
黄凯
毛宇
林昊
徐伟
邬稳
朱煜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Merchants Union Consumer Finance Co Ltd
Original Assignee
Merchants Union Consumer Finance Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Merchants Union Consumer Finance Co Ltd filed Critical Merchants Union Consumer Finance Co Ltd
Priority to CN202210606018.0A priority Critical patent/CN115146027A/en
Publication of CN115146027A publication Critical patent/CN115146027A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application relates to a text vectorization retrieval method, a text vectorization retrieval device, a computer device, a storage medium and a computer program product. The method comprises the following steps: acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved; vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved; acquiring a plurality of top vectors of a graph structure, determining the similarity between the top vectors and a word vector to be retrieved, and determining a target vector; based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched; and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result. By adopting the hierarchical graph structure, the retrieval complexity is reduced to the logarithmic level, the retrieval time is greatly reduced, and the retrieval efficiency is improved.

Description

Text vectorization storage and retrieval method, device and computer equipment
Technical Field
The present application relates to the field of multimedia retrieval technologies, and in particular, to a text vectorization storage and retrieval method, apparatus, computer device, storage medium, and computer program product.
Background
With the development of the internet technology, the application requirements of the text retrieval technology in scenes such as intelligent customer service, synonym search and the like are gradually improved.
The traditional text retrieval method adopts schemes such as BM25+ semantic matching, inverted index + BM25 and the like to carry out stream-type retrieval recall, but the linear increase is carried out along with the increase of a knowledge base, and the response time under a mass knowledge base is longer.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a text vectorization search method, apparatus, computer device, computer readable storage medium and computer program product capable of improving search efficiency.
In a first aspect, the present application provides a text vectorization retrieval method. The method comprises the following steps:
acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved;
vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved;
acquiring a plurality of top vectors of a graph structure, determining the similarity between the top vectors and a word vector to be retrieved, and determining a target vector;
based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched;
and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
In one embodiment, obtaining a plurality of top-level vectors of a graph structure, determining similarity between the top-level vectors and a to-be-retrieved word vector, and determining a target vector includes:
and determining the top-level vector with the shortest distance to the word to be searched in the top-level vectors as a target vector.
In one embodiment, retrieving a graph structure from a top layer to a bottom layer based on a target vector to obtain a plurality of candidate word vectors corresponding to a word vector to be retrieved, includes:
and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
In one embodiment, the method includes:
acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set;
vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set;
constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In one embodiment, constructing a graph structure corresponding to a plurality of candidate word vectors includes:
determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set;
selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom structure, skipping each candidate word vector according to a preset probability and constructing a previous layer structure;
and when the number of the candidate word vectors in each layer structure is smaller than a preset number threshold, taking the layer structure corresponding to the candidate word vectors with the number smaller than the preset number threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
In one embodiment, selecting any of a plurality of candidate word vectors and constructing an underlying structure includes:
selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors;
based on the first vectors, searching a plurality of second vectors with the distance to the bottom vector smaller than or equal to a distance threshold value, and constructing edges corresponding to the second vectors;
and connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain a bottom layer diagram.
In a second aspect, the present application also provides a text vectorization retrieval apparatus. The device comprises:
the system comprises a preprocessing module, a searching module and a searching module, wherein the preprocessing module is used for acquiring a text set to be searched and preprocessing each element in the text set to be searched to obtain a keyword set to be searched;
the vectorization processing module is used for vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved;
the target vector determining module is used for acquiring a plurality of top-level vectors of the graph structure, determining the similarity between the top-level vectors and the vector of the word to be retrieved and determining a target vector;
the candidate word vector acquisition module is used for searching the graph structure from the top layer to the bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched;
and the candidate text acquisition module is used for acquiring a candidate text corresponding to each candidate word vector in the multiple candidate word vectors and outputting the candidate text as a retrieval result.
In a third aspect, the present application also provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method according to any of the embodiments described above when executing the computer program.
In a fourth aspect, the present application further provides a computer device readable storage medium. The computer device readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method of any of the above embodiments.
In a fifth aspect, the present application further provides a computer program product. The computer program product comprises a computer program which, when being executed by a processor, carries out the steps of the method according to any of the embodiments described above.
According to the text vectorization retrieval method, the text vectorization retrieval device, the computer equipment, the storage medium and the computer program product, each element in the text set to be retrieved is preprocessed by acquiring the text set to be retrieved, so that a keyword set to be retrieved is obtained; and then vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved. Further, acquiring a plurality of top vectors of the graph structure, determining the similarity between the top vectors and the vector of the word to be retrieved, and determining a target vector; and searching the graph structure from the top layer to the bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched. And finally, acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result. According to the method and the device, the target vector is determined based on the word vector to be retrieved, then the graph structure is retrieved from the top layer to the bottom layer based on the target vector, the hierarchical graph structure is adopted, the retrieval complexity is reduced to the logarithmic level, the retrieval time is greatly reduced, and the retrieval efficiency is improved.
Drawings
FIG. 1 is a diagram of an exemplary environment in which a text-vectorized search method may be implemented;
FIG. 2 is a flowchart illustrating a text vectorization retrieval method according to an embodiment;
FIG. 3 is a schematic flow chart diagram illustrating a similarity value determination method in one embodiment;
FIG. 4 is a schematic flow chart diagram of the text vectorized storage and retrieval step in one embodiment;
FIG. 5 is a schematic diagram of a diagram structure in another embodiment;
FIG. 6 is a block diagram showing the structure of a text vectorization retrieval apparatus according to an embodiment;
FIG. 7 is a diagram illustrating an internal structure of a computer device according to an embodiment.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the present application and are not intended to limit the present application.
The text vectorization retrieval method provided by the embodiment of the application can be applied to the application environment shown in fig. 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104, or may be located on the cloud or other network server. The server 104 may provide the terminal 102 with an environment for text vectorized retrieval. The server 104 acquires a text set to be retrieved through the terminal 102, and preprocesses each element in the text set to be retrieved to obtain a keyword set to be retrieved; then, vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved; and then acquiring a plurality of top vectors of the graph structure, determining the similarity between the top vectors and the vector of the word to be retrieved, and determining a target vector. Further, the server 104 performs top-to-bottom retrieval on the graph structure based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be retrieved; and then, acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text serving as a search result to the terminal 102.
The terminal 102 may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, internet of things devices, and portable wearable devices, and the internet of things devices may be smart speakers, smart televisions, smart air conditioners, smart car-mounted devices, and the like. The portable wearable device can be a smart watch, a smart bracelet, a head-mounted device, and the like. The server 104 may be implemented as a stand-alone server or as a server cluster comprised of multiple servers.
The sample processing method provided by the embodiment of the application can be applied to a server or a client side, can also be applied to a system comprising the client side and the server, and can be realized through the interaction of the client side and the server.
In one embodiment, as shown in fig. 2, a text vectorization retrieval method is provided, which is described by taking an example that the method is applied to a system implementation comprising a client and a server, and comprises the following steps 202 to 210.
Step 202, a text set to be retrieved is obtained, and each element in the text set to be retrieved is preprocessed to obtain a keyword set to be retrieved.
In this embodiment, as shown in fig. 3, the server obtains a text to be retrieved through the terminal, where the text to be retrieved may be a text (or a text converted from voice, video, or the like) that is taken into the terminal by the user through an input device. The input device of the terminal may be a touch layer covered on a display screen, a key, a track ball or a touch pad arranged on the terminal, or an external keyboard, a touch pad or a mouse.
In this embodiment, the server pre-processes the text to be retrieved by processing stop words and words with inconsistent capitalization and case in the text to be retrieved to obtain the first text to be split with consistent case and without stop words. The stop words are punctuation marks, special marks and preset invalid words in the stop word stock.
In this embodiment, the deactivation thesaurus may be preset, may be manually modified by a user, or may be automatically updated by the system.
And 204, vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved.
In this embodiment, the server converts each element in the keyword set to be retrieved into a word vector, so as to obtain a word vector set to be retrieved corresponding to the keyword set to be retrieved.
In this embodiment, the length of each to-be-retrieved word vector in the to-be-retrieved word vector set may be consistent. For example, the text pairs in the keyword set to be retrieved are token-encoded, and the token-encoded corresponds to a word vector library, such as "who is you? "and" which position you are? ", inside ids are mapped to the above word vector libraries [872, 3221, 6443] and [872, 3221, 1525, 855], respectively, and then the vector is complemented by 0 to adjust to a fixed length.
In this embodiment, as shown in fig. 3, the server may employ a bert-siamese matching model, and the fixed length of the model input format may be set to 20. For example, when the elements in the keyword set to be retrieved are mapped to the word vector libraries with id being [872, 3221, 6443] and [872, 3221, 1525, 855], respectively, then the server complements the vector with 0 to adjust the vector to a fixed length, adds [ CLS ] and [ SEP ] codes to adapt to the input format of the bert-siame matching model, and obtains intermediate vectors as: <xnotran> [101, 872, 3221, 6443, 102,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0], [101, 872, 3221, 1525, 855, 102,0,0,0,0,0,0,0,0,0,0,0,0,0,0], 20, 101 CLS ,102 SEP . </xnotran>
In this embodiment, as shown in fig. 3, the server may send the intermediate vector into the bert-coarse matching model, perform multi-level coding, and finally obtain vector coding (i.e., the vector of the word to be retrieved). <xnotran> , [101, 872, 3221, 6443, 102,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0] bert-siamese , vector1; </xnotran> <xnotran> [101, 872, 3221, 1525, 855, 102,0,0,0,0,0,0,0,0,0,0,0,0,0,0] bert-siamese , vector2. </xnotran>
In this embodiment, the server may support the intermediate vector in a single-dimensional vector form or a multi-dimensional vector form by encoding the intermediate vector.
In this embodiment, the server may convert elements in the keyword set to be retrieved into a multidimensional vector, which may reduce information loss (e.g., location information) in a text information vectorization process.
Step 206, as shown in fig. 4, a plurality of top-level vectors of the graph structure are obtained, the similarity between the plurality of top-level vectors and the vector of the word to be retrieved is determined, and the target vector is determined.
In this embodiment, the lengths of the top-level vectors in the graph structure are the same, and the lengths of the top-level vectors are also the same as the length of the word vector to be retrieved.
In this embodiment, the server obtains a plurality of top-level vectors in the graph structure, compares each top-level vector with a plurality of to-be-retrieved word vectors, and determines a distance (i.e., a similarity) between each top-level vector and each to-be-retrieved word vector.
In this embodiment, the server may use the top-level vector with the highest similarity to the vector of the word to be retrieved as the target vector.
And step 208, searching the graph structure from the top layer to the bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched.
In this implementation, the server may obtain all candidate word vectors corresponding to the to-be-retrieved word vector in the graph structure based on the target vector.
In another embodiment, the server may also screen out the first N candidate word vectors from the top layer to the bottom layer of the graph structure according to a plurality of candidate word vectors after all candidate word vectors directly or indirectly connected to the target vector in the graph structure, where N is a natural number.
Step 210, obtaining a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
In this embodiment, the server may output the search result to the terminal, and may display the search result on the terminal in a manner of voice, text, video, or the like.
In the text vectorization retrieval method, each element in a text set to be retrieved is preprocessed by acquiring the text set to be retrieved, so that a keyword set to be retrieved is obtained; and then vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved. Further, acquiring a plurality of top vectors of the graph structure, determining the similarity between the top vectors and the vector of the word to be retrieved, and determining a target vector; and searching the graph structure from the top layer to the bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched. And finally, acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result. According to the method and the device, the target vector is determined based on the word vector to be retrieved, then the graph structure is retrieved from the top layer to the bottom layer based on the target vector, the hierarchical graph structure is adopted, the retrieval complexity is reduced to the logarithmic level, the retrieval time is greatly reduced, and the retrieval efficiency is improved.
In some embodiments, obtaining a plurality of top-level vectors of a graph structure, determining similarity between the plurality of top-level vectors and a to-be-retrieved word vector, and determining a target vector includes: and determining the top-level vector with the shortest distance to the word to be searched in the top-level vectors as a target vector.
In this embodiment, as shown in fig. 4, the server may calculate the similarity value between the top-level vector and the word vector to be retrieved by using cosine distance, where the size interval is [0,1]. For example, when a top-level vector is [ sensor 1] and a to-be-retrieved word vector is [ sensor 2], the server calculates the similarity between the top-level vector and the to-be-retrieved word vector by using the cosine distance to obtain [ sensor 1, sensor 2, label ]. Wherein, label is similar label, the numerical value interval is [0,1], the larger the numerical value is, the shorter the distance between the top-level vector and the vector of the word to be searched is, that is, the higher the similarity is.
In some embodiments, retrieving the graph structure from a top layer to a bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be retrieved, includes: and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
In this implementation, the server may obtain, based on the target vector, a plurality of candidate word vectors in the graph structure that are directly or indirectly connected to the target vector.
In some exemplary embodiments, the server searches from the top level of the graph structure, and may use a greedy search to find m neighboring vectors of the target vector in the top level structure that may connect to the next level structure: [ a1, a2,. Am ]. Further, the server may calculate the similarity between the vector of the word to be retrieved and each vector in [ a1, a 2., am ], and determine the nearest neighbor vector of the target vector as the nearest neighbor node. The calculation mode can be flexibly configured, such as cosine distance and Euclidean distance, or an external neural network can be used for carrying out more complex calculation.
In this embodiment, the server may jump to the next layer based on the nearest neighbor node, perform the same search on the next layer, repeat the above steps until the lowest layer, and stop the search.
In this embodiment, the server may obtain all the nearest neighboring vectors as a plurality of candidate word vectors directly or indirectly connected to the target vector.
In another embodiment, the server may also screen out the top N candidate word vectors from the top layer to the bottom layer of the graph structure according to the candidate word vectors after the server selects the top N candidate word vectors from the top layer to the bottom layer of the graph structure based on the candidate word vectors directly or indirectly connected to the target vector in the graph structure, where N is a natural number.
In some embodiments, the above method comprises: acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set; vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set; constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In this embodiment, the deactivation word banks in the pre-processing of the to-be-retrieved text set and the candidate text set by the server may be the same, that is, the deactivation words in the pre-processing of the to-be-retrieved text set and the candidate text set by the server are consistent.
In another embodiment, the stop word in the pre-processing of the set of text to be retrieved by the server belongs to a first stop lexicon, and the stop word in the pre-processing of the set of candidate text by the server belongs to a second stop lexicon. The internal standard point symbols, the special symbols and the preset invalid vocabularies in the first and second deactivation word banks can be the same or different.
In this embodiment, the length of each candidate word vector is the same as the length of the word vector to be retrieved.
In some embodiments, constructing a graph structure corresponding to the plurality of candidate word vectors includes: determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set; selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom layer structure, each candidate word vector jumps according to a preset probability and constructs a previous layer structure; and when the number of the candidate word vectors in each layer structure is smaller than a preset number threshold, taking the layer structure corresponding to the candidate word vectors with the number smaller than the preset number threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
In this embodiment, the server may perform preprocessing on the candidate text set to obtain candidate keywords, and send a plurality of candidate keywords to the vector coding model to obtain the vector representation method for the text set.
In this embodiment, the mapping relationship between the candidate keyword and the candidate word vector may be represented in the form of a table, a dictionary, or the like.
In this embodiment, as shown in fig. 5, all candidate word vectors in the underlying structure (layer = 0) and the plurality of candidate word vectors directly or indirectly connected to all candidate word vectors in the underlying structure fully cover all candidate word vectors corresponding to the candidate word text set. For example, as shown in fig. 5, the graph structure includes three layer structures: the structure comprises a bottom layer structure (layer = 0), a first layer structure (layer = 1), and a second layer structure (layer = 2), wherein the second layer structure is a top layer structure, and vectors in each layer structure are directly or indirectly connected with vectors in the stratum structure.
In this embodiment, the number of the layer structure may be determined by a preset probability of skipping each candidate word vector to the upper layer structure and the number of all candidate word vectors corresponding to the candidate word text set.
In some possible embodiments, selecting any number of candidate word vectors to construct an infrastructure includes: selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors; based on the first vectors, searching a plurality of second vectors with the distance to the bottom vector smaller than or equal to a distance threshold value, and constructing edges corresponding to the second vectors; and connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain a bottom layer diagram.
In this embodiment, the server may use a greedy algorithm to find a plurality of neighboring vectors (i.e., the second vectors) of the first vector, and establish edge connections between the first vector and the second vectors to form the infrastructure.
It should be understood that, although the steps in the flowcharts related to the embodiments as described above are sequentially displayed as indicated by arrows, the steps are not necessarily performed sequentially as indicated by the arrows. The steps are not performed in the exact order shown and described, and may be performed in other orders, unless explicitly stated otherwise. Moreover, at least a part of the steps in the flowcharts related to the embodiments described above may include multiple steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, and the execution order of the steps or stages is not necessarily sequential, but may be performed alternately or alternately with other steps or at least a part of the steps or stages in other steps.
Based on the same inventive concept, the embodiment of the present application further provides a text vectorization retrieval device for implementing the above-mentioned text vectorization retrieval method. The implementation scheme for solving the problem provided by the device is similar to the implementation scheme described in the method, so that specific limitations in one or more embodiments of the text vectorization retrieval device provided below can be referred to the limitations of the text vectorization retrieval method in the foregoing, and details are not repeated herein.
In one embodiment, as shown in fig. 6, there is provided a text vectorization retrieval apparatus including: a pre-processing module 602, a vectorization processing module 604, a target vector determination module 606, a candidate word vector acquisition module 608, and a candidate text acquisition module 610, wherein:
the preprocessing module 602 is configured to obtain a text set to be retrieved, and preprocess each element in the text set to be retrieved to obtain a keyword set to be retrieved.
The vectorization processing module 604 is configured to perform vectorization processing on each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved.
The target vector determining module 606 is configured to obtain a plurality of top-level vectors of the graph structure, determine similarity between the plurality of top-level vectors and the to-be-retrieved word vector, and determine a target vector.
The candidate word vector obtaining module 608 is configured to perform top-to-bottom retrieval on the graph structure based on the target vector, and obtain multiple candidate word vectors corresponding to the word vector to be retrieved.
The candidate text obtaining module 610 is configured to obtain a candidate text corresponding to each candidate word vector in the multiple candidate word vectors, and output the candidate text as a search result.
In one embodiment, the target vector determination module 606 may include:
and the target vector determining submodule is used for determining the top-level vector with the shortest distance to the word to be searched in the top-level vectors as the target vector.
In one embodiment, the candidate word vector obtaining module 608 may include:
and the candidate word vector acquisition sub-module is used for acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
In one embodiment, the apparatus further includes:
and the candidate text preprocessing module is used for acquiring a candidate text set and preprocessing each element in the candidate text set to obtain a candidate keyword set.
And the candidate word vector generating module is used for vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set.
The graph structure building module is used for building a graph structure corresponding to the candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In one embodiment, the graph structure building module may include:
and the mapping relation determining submodule is used for determining the mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set.
The bottom structure building submodule is used for selecting any plurality of candidate word vectors and building a bottom structure; in a plurality of candidate word vectors in the underlying structure, each candidate word vector jumps according to a preset probability and constructs a previous layer structure.
And the top-level structure determining submodule is used for acquiring the number of the candidate word vectors in each level structure, and when the number of the candidate word vectors in the level structure is smaller than a preset number threshold, the level structure corresponding to the candidate word vector number smaller than the preset number threshold is used as the top-level structure, and a plurality of preset word vectors in the top-level structure are used as the top-level vectors.
In one embodiment, selecting any of a plurality of candidate word vectors, the infrastructure construction sub-module may include:
and the first vector determining unit is used for selecting any plurality of candidate word vectors as first vectors and constructing edges corresponding to the first vectors.
And the edge construction unit is used for searching a plurality of second vectors with the distance less than or equal to the distance threshold value from the bottom vector based on the plurality of first vectors and constructing edges corresponding to the second vectors.
And the edge connecting unit is used for connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain the bottom layer diagram.
The modules in the text vectorization retrieval device can be wholly or partially implemented by software, hardware and a combination thereof. The modules can be embedded in a hardware form or independent of a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.
In one embodiment, a computer device is provided, which may be a server, the internal structure of which may be as shown in fig. 7. The computer device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operating system and the computer program to run on the non-volatile storage medium. The database of the computer device is used for storing the mapping relation between the candidate keywords and the candidate word vectors. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text vectorization retrieval method.
Those skilled in the art will appreciate that the architecture shown in fig. 7 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
In one embodiment, a computer device is provided, comprising a memory and a processor, the memory having a computer program stored therein, the processor implementing the following steps when executing the computer program: acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved; vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved; acquiring a plurality of top vectors of a graph structure, determining the similarity between the top vectors and a word vector to be retrieved, and determining a target vector; based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched; and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
In one embodiment, the processor, when executing the computer program, further implements obtaining a plurality of top-level vectors of the graph structure, determining similarity between the plurality of top-level vectors and the vector of the word to be retrieved, and determining the target vector, which may include: and determining the top-level vector with the shortest distance to the word to be searched in the top-level vectors as a target vector.
In one embodiment, when the processor executes the computer program, the method further performs top-level to bottom-level search on the graph structure based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched, and the step of obtaining the candidate word vector may include: and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
In one embodiment, the processor, when executing the computer program, further performs the following steps, which may include: acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set; vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set; constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In one embodiment, the processor, when executing the computer program, further implements constructing a graph structure corresponding to the plurality of candidate word vectors, which may include: determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set; selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom layer structure, each candidate word vector jumps according to a preset probability and constructs a previous layer structure; and when the number of the candidate word vectors in each layer structure is smaller than a preset number threshold, taking the layer structure corresponding to the candidate word vectors with the number smaller than the preset number threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
In one embodiment, when the processor executes the computer program, the processor further selects any plurality of candidate word vectors to construct an underlying structure, which may include: selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors; based on the first vectors, searching a plurality of second vectors with the distance to the bottom vector smaller than or equal to a distance threshold value, and constructing edges corresponding to the second vectors; the edges of the first vector will correspond to, and connecting the edges corresponding to the second vector to obtain a bottom layer diagram.
In one embodiment, a computer-readable storage medium is provided, having a computer program stored thereon, which when executed by a processor, performs the steps of: acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved; vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved; acquiring a plurality of top vectors of a graph structure, determining the similarity between the top vectors and a word vector to be retrieved, and determining a target vector; based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched; and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
In one embodiment, the computer program when executed by the processor further implements obtaining a plurality of top-level vectors of the graph structure, determining similarity between the plurality of top-level vectors and the word vector to be retrieved, and determining the target vector may include: and determining a top-level vector with the shortest distance to the word vector to be searched in the top-level vectors as a target vector.
In one embodiment, when executed by the processor, the computer program further implements top-to-bottom retrieval of the graph structure based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be retrieved, which may include: and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
In one embodiment, the computer program when executed by the processor further performs the following steps, which may include: acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set; vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set; constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In one embodiment, the computer program when executed by the processor further implements constructing a graph structure corresponding to a plurality of candidate word vectors, which may include: determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set; selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom layer structure, each candidate word vector jumps according to a preset probability and constructs a previous layer structure; and when the number of the candidate word vectors in each layer structure is smaller than a preset number threshold, taking the layer structure corresponding to the candidate word vectors with the number smaller than the preset number threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
In one embodiment, when executed by a processor, the computer program further implements selecting any of a plurality of candidate word vectors, and constructing an infrastructure, which may include: selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors; based on the first vectors, searching a plurality of second vectors with the distance to the bottom vector smaller than or equal to a distance threshold value, and constructing edges corresponding to the second vectors; and connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain a bottom layer diagram.
In one embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, performs the steps of: acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved; vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved; acquiring a plurality of top vectors of a graph structure, determining the similarity between the top vectors and a word vector to be retrieved, and determining a target vector; based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched; and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
In one embodiment, the computer program when executed by the processor further implements obtaining a plurality of top-level vectors of the graph structure, determining similarity of the plurality of top-level vectors to the vector of the word to be retrieved, and determining the target vector may include: and determining the top-level vector with the shortest distance to the word to be searched in the top-level vectors as a target vector.
In one embodiment, when executed by the processor, the computer program further implements top-to-bottom retrieval of the graph structure based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be retrieved, which may include: and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector. .
In one embodiment, the computer program when executed by the processor further performs the following steps, which may include: acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set; vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set; constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each layer structure is an undirected graph.
In one embodiment, the computer program when executed by the processor further implements constructing a graph structure corresponding to a plurality of candidate word vectors, which may include: determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set; selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom layer structure, each candidate word vector jumps according to a preset probability and constructs a previous layer structure; and when the number of the candidate word vectors in each layer structure is smaller than a preset number threshold, taking the layer structure corresponding to the candidate word vectors with the number smaller than the preset number threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
In one embodiment, when executed by the processor, the computer program further implements selecting any of a plurality of candidate word vectors to construct an infrastructure, which may include: selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors; based on the first vectors, searching a plurality of second vectors which are less than or equal to a distance threshold value from the bottom vector, and constructing edges corresponding to the second vectors; and connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain a bottom layer diagram.
It should be noted that, the user information (including but not limited to user device information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.
It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above may be implemented by hardware instructions of a computer program, which may be stored in a non-volatile computer-readable storage medium, and when executed, may include the processes of the embodiments of the methods described above. Any reference to memory, databases, or other media used in the embodiments provided herein can include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high-density embedded nonvolatile Memory, resistive Random Access Memory (ReRAM), magnetic Random Access Memory (MRAM), ferroelectric Random Access Memory (FRAM), phase Change Memory (PCM), graphene Memory, and the like. Volatile Memory can include Random Access Memory (RAM), external cache Memory, and the like. By way of illustration and not limitation, RAM can take many forms, such as Static Random Access Memory (SRAM) or Dynamic Random Access Memory (DRAM), among others. The databases referred to in various embodiments provided herein may include at least one of relational and non-relational databases. The non-relational database may include, but is not limited to, a block chain based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic devices, quantum computing based data processing logic devices, etc., without limitation.
The technical features of the above embodiments can be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the above embodiments are not described, but should be considered as the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present application, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present application. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the concept of the present application, and these are all within the scope of protection of the present application. Therefore, the protection scope of the present application shall be subject to the appended claims.

Claims (10)

1. A text vectorization retrieval method, the method comprising:
acquiring a text set to be retrieved, and preprocessing each element in the text set to be retrieved to obtain a keyword set to be retrieved;
vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved;
obtaining a plurality of top-level vectors of a graph structure, determining the similarity between the top-level vectors and the to-be-retrieved word vector, and determining a target vector;
based on the target vector, searching the graph structure from the top layer to the bottom layer to obtain a plurality of candidate word vectors corresponding to the word vector to be searched;
and acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors, and outputting the candidate text as a retrieval result.
2. The method according to claim 1, wherein the obtaining a plurality of top-level vectors of a graph structure, determining similarity of the plurality of top-level vectors and the vector of the word to be retrieved, and determining a target vector comprises:
and determining a top-level vector with the shortest distance to the word vector to be searched in the top-level vectors as a target vector.
3. The method of claim 2, wherein the retrieving the graph structure from a top layer to a bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be retrieved, comprises:
and acquiring a plurality of candidate word vectors which are directly or indirectly connected with the target vector in the graph structure based on the target vector.
4. The method according to claim 1, characterized in that it comprises:
acquiring a candidate text set, and preprocessing each element in the candidate text set to obtain a candidate keyword set;
vectorizing each element in the candidate keyword set to obtain a candidate word vector corresponding to each element in the candidate keyword set;
constructing a graph structure corresponding to a plurality of candidate word vectors; the graph structure corresponding to each candidate word vector comprises a plurality of layer structures, and each layer structure comprises at least one vector node; each of the layer structures is an undirected graph.
5. The method of claim 4, wherein constructing a graph structure corresponding to the plurality of candidate word vectors comprises:
determining a mapping relation between the candidate keywords and the candidate word vectors based on the candidate word vectors corresponding to each element in the candidate keyword set;
selecting any plurality of candidate word vectors and constructing a bottom layer structure; in a plurality of candidate word vectors in a bottom structure, skipping each candidate word vector according to a preset probability and constructing a previous layer structure;
and acquiring the quantity of the candidate word vectors in each layer of structure, and when the candidate word vectors in each layer of structure are smaller than a preset quantity threshold, taking the layer structure corresponding to the candidate word vectors with the quantity smaller than the preset quantity threshold as a top layer structure, wherein a plurality of preset word vectors in the top layer structure are top layer vectors.
6. The method of claim 5, wherein the selecting any of the plurality of candidate word vectors to construct an underlying structure comprises:
selecting any plurality of candidate word vectors as first vectors, and constructing edges corresponding to the first vectors;
based on the plurality of first vectors, searching a plurality of second vectors which are less than or equal to a distance threshold value from the bottom-layer vector, and constructing edges corresponding to the second vectors;
and connecting the edge corresponding to the first vector with the edge corresponding to the second vector to obtain a bottom layer diagram.
7. A text vectorization retrieval apparatus, the apparatus comprising:
the system comprises a preprocessing module, a searching module and a searching module, wherein the preprocessing module is used for acquiring a text set to be searched and preprocessing each element in the text set to be searched to obtain a keyword set to be searched;
the vectorization processing module is used for vectorizing each element in the keyword set to be retrieved to obtain a word vector to be retrieved corresponding to each element in the keyword set to be retrieved;
the target vector determining module is used for acquiring a plurality of top-level vectors of a graph structure, determining the similarity between the top-level vectors and the to-be-retrieved word vector and determining a target vector;
the candidate word vector acquisition module is used for searching the graph structure from the top layer to the bottom layer based on the target vector to obtain a plurality of candidate word vectors corresponding to the word vector to be searched;
and the candidate text acquisition module is used for acquiring a candidate text corresponding to each candidate word vector in the plurality of candidate word vectors and outputting the candidate text as a retrieval result.
8. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor, when executing the computer program, implements the steps of the method of any of claims 1 to 6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method of any one of claims 1 to 6.
10. A computer program product comprising a computer program, characterized in that the computer program realizes the steps of the method of any one of claims 1 to 6 when executed by a processor.
CN202210606018.0A 2022-05-31 2022-05-31 Text vectorization storage and retrieval method, device and computer equipment Pending CN115146027A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210606018.0A CN115146027A (en) 2022-05-31 2022-05-31 Text vectorization storage and retrieval method, device and computer equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210606018.0A CN115146027A (en) 2022-05-31 2022-05-31 Text vectorization storage and retrieval method, device and computer equipment

Publications (1)

Publication Number Publication Date
CN115146027A true CN115146027A (en) 2022-10-04

Family

ID=83407055

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210606018.0A Pending CN115146027A (en) 2022-05-31 2022-05-31 Text vectorization storage and retrieval method, device and computer equipment

Country Status (1)

Country Link
CN (1) CN115146027A (en)

Similar Documents

Publication Publication Date Title
Lin et al. A deep learning architecture for semantic address matching
US11016997B1 (en) Generating query results based on domain-specific dynamic word embeddings
CN110727839A (en) Semantic parsing of natural language queries
US20220414079A1 (en) Data indexing and searching using permutation indexes
Lian et al. Product quantized collaborative filtering
CN111078842A (en) Method, device, server and storage medium for determining query result
CN113177141A (en) Multi-label video hash retrieval method and device based on semantic embedded soft similarity
Kan et al. Zero-shot learning to index on semantic trees for scalable image retrieval
Zhang et al. OMCBIR: Offline mobile content-based image retrieval with lightweight CNN optimization
Wang et al. Multi-concept representation learning for knowledge graph completion
CN115062134A (en) Knowledge question-answering model training and knowledge question-answering method, device and computer equipment
CN113076758B (en) Task-oriented dialog-oriented multi-domain request type intention identification method
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
CN117435685A (en) Document retrieval method, document retrieval device, computer equipment, storage medium and product
Wang et al. Time series analysis with multiple resolutions
CN115203378B (en) Retrieval enhancement method, system and storage medium based on pre-training language model
CN115146027A (en) Text vectorization storage and retrieval method, device and computer equipment
Nguyen Mau et al. Audio fingerprint hierarchy searching strategies on GPGPU massively parallel computer
CN114595389A (en) Address book query method, device, equipment, storage medium and program product
JP2019125124A (en) Extraction device, extraction method and extraction program
Yang et al. Isometric hashing for image retrieval
CN114153965A (en) Content and map combined public opinion event recommendation method, system and terminal
Gayathiri et al. Big data retrieval using locality-sensitive hashing with document-based NoSQL database
CN116703531B (en) Article data processing method, apparatus, computer device and storage medium
Li et al. Similarity search algorithm over data supply chain based on key points

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Country or region after: China

Address after: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant after: Zhaolian Consumer Finance Co.,Ltd.

Address before: 518000 Room 201, building A, No. 1, Qian Wan Road, Qianhai Shenzhen Hong Kong cooperation zone, Shenzhen, Guangdong (Shenzhen Qianhai business secretary Co., Ltd.)

Applicant before: MERCHANTS UNION CONSUMER FINANCE Co.,Ltd.

Country or region before: China

CB02 Change of applicant information