CN112905644B - Mixed search method fusing structured data and unstructured data - Google Patents
Mixed search method fusing structured data and unstructured data Download PDFInfo
- Publication number
- CN112905644B CN112905644B CN202110285108.XA CN202110285108A CN112905644B CN 112905644 B CN112905644 B CN 112905644B CN 202110285108 A CN202110285108 A CN 202110285108A CN 112905644 B CN112905644 B CN 112905644B
- Authority
- CN
- China
- Prior art keywords
- vector
- structured
- unstructured
- entity
- data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2228—Indexing structures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a hybrid search method fusing structured data and unstructured data. Firstly, respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector; secondly, constructing a fusion structured and unstructured data neighbor graph based on the similarity combination of the structured vector and the unstructured vector; then, vectorizing structured and unstructured data contained in the query entity to obtain a mixed query vector containing a structured vector and an unstructured vector; and finally, performing hybrid search on the hybrid query vector on the fusion structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity. The invention realizes the mixed search of searching the unstructured data and the structured data at the same time, and the efficiency is greatly improved compared with the current two separated indexing systems.
Description
Technical Field
The invention relates to the field of approximate nearest neighbor search, in particular to a hybrid search method fusing structured data and unstructured data.
Background
Various internet and intelligent applications generate massive unstructured data (pictures, videos, voice and the like) and structured data (numbers, symbols, labels and the like), and efficient query and acquisition of useful information from large-scale data is a core technology of various artificial intelligence applications. Structured data query based on relational database is mature and widely applied, and unstructured data search is also rapidly applied to various scenes along with the development of deep learning vectorization technology. With the increasing requirement on the consistency of the query result, many scenarios need to perform the search of structured data and unstructured data at the same time, i.e. hybrid search.
The hybrid search method is a research hotspot in the field of approximate nearest neighbor search at present, and is practically applied to platforms such as electronic commerce and the like. However, current hybrid search systems are implemented primarily by performing queries on structured and unstructured data separately, and then merging their query results. The hybrid search method has the problems of low query speed and low query result precision. There is an urgent need for an efficient hybrid search solution that can simultaneously perform structured and unstructured data queries and meet the query accuracy requirements.
Disclosure of Invention
The invention provides a mixed search method fusing structured data and unstructured data, which realizes mixed search for searching the unstructured data and the structured data simultaneously, and the efficiency is greatly improved compared with the current two separated index systems.
The specific content of the hybrid search method fusing the structured data and the unstructured data provided by the invention is as follows:
(1) respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector;
(2) constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination;
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) and the hybrid query vector performs hybrid search on the fused structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity.
Wherein, the step (1) is to arrange each entity e in the data set S i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And a structured vector beta i Entity vector (alpha) of i ,β i ). Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i Is the ith entity in the data set, and N is the number of the entities in the data set.
Unstructured vector alpha i To representComprises the following steps:
where m is the dimension of the unstructured vector,for unstructured vector alpha i The value in the j-th dimension.
Structured vector beta i Expressed as:
where n is the dimension of the structured vector,structured vector beta i The value in the j-th dimension.
The step (2) of constructing the fused structured and unstructured data neighbor graph based on the similarity combination of the structured vectors and the unstructured vectors means that each entity vector (alpha) is evaluated through mixed distance calculation i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) Connecting the K neighbors, entity vectors (α), nearest to their hybrid distance d 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 1 (α 1 ,α 2 )+w b ·d 2 (β 1 ,β 2 )
wherein d is 1 (α 1 ,α 2 ) For unstructured vector distance, d 2 (β 1 ,β 2 ) For structuring vector distances, where w b To construct the weight occupied by the structured vector distance in the neighborhood graph, toRegulating unstructured vector distance d 1 (α 1 ,α 2 ) And a structured vector distance d 2 (β 1 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity in) and further influences the performance of the constructed fused structured and unstructured data neighbor graph for hybrid search.
And (4) adopting the following distance calculation mode in the process of obtaining the nearest neighbor of the query entity by carrying out hybrid search on the hybrid query vector q fused with the structured and unstructured data neighbor graphs through a greedy algorithm, wherein the hybrid query vector q is (q is) α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector q as a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.
The invention has the beneficial effects that: the mixed search method based on the fusion structured and unstructured data neighbor graph obtains entity vectors by vectorizing the structured data and the unstructured data in the entity data respectively, constructs the fusion structured and unstructured data neighbor graph based on the similarity combination of the structured vectors and the unstructured vectors, vectorizes the structured data and the unstructured data in the query entity respectively to obtain mixed query vectors, and executes greedy search on the constructed neighbor graph by using the mixed query vectors, so that the mixed search of the unstructured data and the structured data is realized, and the efficiency is greatly improved compared with the current two separated index systems.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
In order to make the technical solution and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of the present invention, which mainly comprises the following steps:
(1) respectively vectorizing structured data and unstructured data contained in each entity in the data set to obtain an entity vector containing a structured vector and an unstructured vector;
the process is embodied in that each entity e in the data set S is divided into i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And an entity vector (α) of the structured vector β i i ,β i ). Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i Is the ith entity in the data set, and N is the number of the entities in the data set.
Unstructured vector alpha i Expressed as:
where m is the dimension of the unstructured vector,for unstructured vector alpha i The value in the j-th dimension.
Structured vector beta i Expressed as:
where n is the dimension of the structured vector,structured vector beta i The value in the j-th dimension.
(2) Constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination; evaluating each entity vector (alpha) by hybrid distance calculation during composition i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) The K neighbors closest to their hybrid distance d are connected.
Entity vector (alpha) 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 1 (α 1 ,α 2 )+w b ·d 2 (β 1 ,β 2 )
wherein d is 1 (α 1 ,α 2 ) Is an unstructured vector distance, d 2 (β 1 ,β 2 ) For structuring vector distances, where w b The weight occupied by the structured vector distance when constructing the neighbor graph is used for regulating and controlling the distance d of the unstructured vector 1 (α 1 ,α 2 ) And a structured vector distance d 2 (β 1 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity in) and further influences the performance of the constructed fused structured and unstructured data neighbor graph for hybrid search.
Distance d of structured vector 2 (β 1 ,β 2 ) The definition is as follows:
in the formula (I), the compound is shown in the specification,for a structured vector beta 1 、β 2 Is the value of the ith dimension, M is the structureThe dimensions of the quantization vector are such that,the distance computation function for the value of the ith dimension of the structured vector is defined as follows:
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) and the hybrid query vector performs hybrid search on the fused structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity. Specifically, the following distance calculation mode is adopted in the process that the hybrid query vector q is subjected to hybrid search by a greedy algorithm on a fusion structured and unstructured data neighbor graph to obtain the nearest neighbor of a query entity.
Hybrid query vector q ═ q (q) α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector, q, which is a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.
Claims (1)
1. A hybrid search method fusing structured and unstructured data, comprising the steps of:
(1) respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector;
(2) constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination;
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) performing hybrid search on the hybrid query vector on the fusion structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of a query entity;
wherein step (1) comprises the step of comparing each entity e in the data set S i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And a structured vector beta i Entity vector (alpha) of i ,β i ) (ii) a Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i The number of the ith entity in the data set is N;
unstructured vector alpha i Expressed as:
where m is the dimension of the unstructured vector,for unstructured vector alpha i Taking a value in the j dimension;
structured vector beta i Expressed as:
where n is the dimension of the structured vector,structured vector beta i Taking a value in the j dimension;
wherein the construction of the fused structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination described in step (2) refers to the evaluation of individual entity vectors (α) by mixed distance computation i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) Connecting the K neighbors, entity vectors (α), nearest to their hybrid distance d 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 1 (α 1 ,α 2 )+w b ·d 2 (β 1 ,β 2 )
wherein d is 1 (α 1 ,α 2 ) Is an unstructured vector distance, d 2 (β 1 ,β 2 ) For structuring vector distances, where w b For regulating and controlling the weight occupied by the structured vector distance in the process of constructing a neighbor graph 1 (α 1 ,α 2 ) And a structured vector distance d 2 (β 1 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity of);
wherein the hybrid query vector q in the step (4) adopts a hybrid distance calculation mode with the following formula in the process of obtaining the nearest neighbor of the query entity by performing hybrid search through a greedy algorithm on the fused structured and unstructured data neighbor graph, and the hybrid query vector q is (q is) q α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector q as a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110285108.XA CN112905644B (en) | 2021-03-17 | 2021-03-17 | Mixed search method fusing structured data and unstructured data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110285108.XA CN112905644B (en) | 2021-03-17 | 2021-03-17 | Mixed search method fusing structured data and unstructured data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112905644A CN112905644A (en) | 2021-06-04 |
CN112905644B true CN112905644B (en) | 2022-08-02 |
Family
ID=76106595
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110285108.XA Active CN112905644B (en) | 2021-03-17 | 2021-03-17 | Mixed search method fusing structured data and unstructured data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112905644B (en) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103412925A (en) * | 2013-08-13 | 2013-11-27 | 南京烽火星空通信发展有限公司 | System and method for integrated searching of structured data and unstructured data |
EP2836920A1 (en) * | 2012-04-09 | 2015-02-18 | Vivek Ventures, LLC | Clustered information processing and searching with structured-unstructured database bridge |
WO2017180475A1 (en) * | 2016-04-15 | 2017-10-19 | 3M Innovative Properties Company | Query optimizer for combined structured and unstructured data records |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20080077570A1 (en) * | 2004-10-25 | 2008-03-27 | Infovell, Inc. | Full Text Query and Search Systems and Method of Use |
US8930389B2 (en) * | 2009-10-06 | 2015-01-06 | International Business Machines Corporation | Mutual search and alert between structured and unstructured data stores |
US20180032930A1 (en) * | 2015-10-07 | 2018-02-01 | 0934781 B.C. Ltd | System and method to Generate Queries for a Business Database |
US11093842B2 (en) * | 2018-02-13 | 2021-08-17 | International Business Machines Corporation | Combining chemical structure data with unstructured data for predictive analytics in a cognitive system |
-
2021
- 2021-03-17 CN CN202110285108.XA patent/CN112905644B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP2836920A1 (en) * | 2012-04-09 | 2015-02-18 | Vivek Ventures, LLC | Clustered information processing and searching with structured-unstructured database bridge |
CN103412925A (en) * | 2013-08-13 | 2013-11-27 | 南京烽火星空通信发展有限公司 | System and method for integrated searching of structured data and unstructured data |
WO2017180475A1 (en) * | 2016-04-15 | 2017-10-19 | 3M Innovative Properties Company | Query optimizer for combined structured and unstructured data records |
Non-Patent Citations (1)
Title |
---|
"基于复杂网络的结构化公安情报流程研究";樊舒等;《情报杂志》;20201031;86-91 * |
Also Published As
Publication number | Publication date |
---|---|
CN112905644A (en) | 2021-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106897374B (en) | Personalized recommendation method based on track big data nearest neighbor query | |
Zhang et al. | Adversarial separation network for cross-network node classification | |
CN113761221B (en) | Knowledge graph entity alignment method based on graph neural network | |
CN114565053A (en) | Deep heterogeneous map embedding model based on feature fusion | |
CN109376797B (en) | Network traffic classification method based on binary encoder and multi-hash table | |
CN116049450A (en) | Multi-mode-supported image-text retrieval method and device based on distance clustering | |
CN108491628B (en) | Product design demand driven three-dimensional CAD assembly model clustering and searching method | |
Chen et al. | Personalized travel route recommendation algorithm based on improved genetic algorithm | |
CN112035689A (en) | Zero sample image hash retrieval method based on vision-to-semantic network | |
CN114359902B (en) | Three-dimensional point cloud semantic segmentation method based on multi-scale feature fusion | |
CN112905644B (en) | Mixed search method fusing structured data and unstructured data | |
CN113870312A (en) | Twin network-based single target tracking method | |
Chang et al. | Trajectory similarity measurement: An efficiency perspective | |
CN117453727A (en) | Data vectorization retrieval method and device based on reasoning in database | |
CN112765490A (en) | Information recommendation method and system based on knowledge graph and graph convolution network | |
CN116050517B (en) | Public security field oriented multi-mode data management method and system | |
CN108319727A (en) | A method of any two points shortest path in social networks is found based on community structure | |
CN116955650A (en) | Information retrieval optimization method and system based on small sample knowledge graph completion | |
CN113990408A (en) | Molecular diagram comparison learning method based on chemical element knowledge graph | |
CN113239219A (en) | Image retrieval method, system, medium and equipment based on multi-modal query | |
Zhang et al. | Active learning for information retrieval: Using 3D models as an example | |
Huang | Research on graph network recommendation algorithm based on random walk and convolutional neural network | |
CN113761243A (en) | Online retrieval method and system | |
CN117688121B (en) | SubGNN geographic knowledge graph representation learning method for injecting spatial features | |
Yang | Gc-mobileseg: Fast and accurate semantic segmentation network on mobile devices with global context modeling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |