CN112905644B - Mixed search method fusing structured data and unstructured data - Google Patents

Mixed search method fusing structured data and unstructured data Download PDF

Info

Publication number
CN112905644B
CN112905644B CN202110285108.XA CN202110285108A CN112905644B CN 112905644 B CN112905644 B CN 112905644B CN 202110285108 A CN202110285108 A CN 202110285108A CN 112905644 B CN112905644 B CN 112905644B
Authority
CN
China
Prior art keywords
vector
structured
unstructured
entity
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110285108.XA
Other languages
Chinese (zh)
Other versions
CN112905644A (en
Inventor
徐小良
王梦召
吕凌威
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110285108.XA priority Critical patent/CN112905644B/en
Publication of CN112905644A publication Critical patent/CN112905644A/en
Application granted granted Critical
Publication of CN112905644B publication Critical patent/CN112905644B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a hybrid search method fusing structured data and unstructured data. Firstly, respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector; secondly, constructing a fusion structured and unstructured data neighbor graph based on the similarity combination of the structured vector and the unstructured vector; then, vectorizing structured and unstructured data contained in the query entity to obtain a mixed query vector containing a structured vector and an unstructured vector; and finally, performing hybrid search on the hybrid query vector on the fusion structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity. The invention realizes the mixed search of searching the unstructured data and the structured data at the same time, and the efficiency is greatly improved compared with the current two separated indexing systems.

Description

Mixed search method fusing structured data and unstructured data
Technical Field
The invention relates to the field of approximate nearest neighbor search, in particular to a hybrid search method fusing structured data and unstructured data.
Background
Various internet and intelligent applications generate massive unstructured data (pictures, videos, voice and the like) and structured data (numbers, symbols, labels and the like), and efficient query and acquisition of useful information from large-scale data is a core technology of various artificial intelligence applications. Structured data query based on relational database is mature and widely applied, and unstructured data search is also rapidly applied to various scenes along with the development of deep learning vectorization technology. With the increasing requirement on the consistency of the query result, many scenarios need to perform the search of structured data and unstructured data at the same time, i.e. hybrid search.
The hybrid search method is a research hotspot in the field of approximate nearest neighbor search at present, and is practically applied to platforms such as electronic commerce and the like. However, current hybrid search systems are implemented primarily by performing queries on structured and unstructured data separately, and then merging their query results. The hybrid search method has the problems of low query speed and low query result precision. There is an urgent need for an efficient hybrid search solution that can simultaneously perform structured and unstructured data queries and meet the query accuracy requirements.
Disclosure of Invention
The invention provides a mixed search method fusing structured data and unstructured data, which realizes mixed search for searching the unstructured data and the structured data simultaneously, and the efficiency is greatly improved compared with the current two separated index systems.
The specific content of the hybrid search method fusing the structured data and the unstructured data provided by the invention is as follows:
(1) respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector;
(2) constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination;
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) and the hybrid query vector performs hybrid search on the fused structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity.
Wherein, the step (1) is to arrange each entity e in the data set S i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And a structured vector beta i Entity vector (alpha) of i ,β i ). Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i Is the ith entity in the data set, and N is the number of the entities in the data set.
Unstructured vector alpha i To representComprises the following steps:
Figure BDA0002980129400000021
where m is the dimension of the unstructured vector,
Figure BDA0002980129400000022
for unstructured vector alpha i The value in the j-th dimension.
Structured vector beta i Expressed as:
Figure BDA0002980129400000023
where n is the dimension of the structured vector,
Figure BDA0002980129400000024
structured vector beta i The value in the j-th dimension.
The step (2) of constructing the fused structured and unstructured data neighbor graph based on the similarity combination of the structured vectors and the unstructured vectors means that each entity vector (alpha) is evaluated through mixed distance calculation i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) Connecting the K neighbors, entity vectors (α), nearest to their hybrid distance d 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 11 ,α 2 )+w b ·d 21 ,β 2 )
wherein d is 11 ,α 2 ) For unstructured vector distance, d 21 ,β 2 ) For structuring vector distances, where w b To construct the weight occupied by the structured vector distance in the neighborhood graph, toRegulating unstructured vector distance d 11 ,α 2 ) And a structured vector distance d 21 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity in) and further influences the performance of the constructed fused structured and unstructured data neighbor graph for hybrid search.
And (4) adopting the following distance calculation mode in the process of obtaining the nearest neighbor of the query entity by carrying out hybrid search on the hybrid query vector q fused with the structured and unstructured data neighbor graphs through a greedy algorithm, wherein the hybrid query vector q is (q is) α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector q as a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.
The invention has the beneficial effects that: the mixed search method based on the fusion structured and unstructured data neighbor graph obtains entity vectors by vectorizing the structured data and the unstructured data in the entity data respectively, constructs the fusion structured and unstructured data neighbor graph based on the similarity combination of the structured vectors and the unstructured vectors, vectorizes the structured data and the unstructured data in the query entity respectively to obtain mixed query vectors, and executes greedy search on the constructed neighbor graph by using the mixed query vectors, so that the mixed search of the unstructured data and the structured data is realized, and the efficiency is greatly improved compared with the current two separated index systems.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
In order to make the technical solution and advantages of the present invention more apparent, the present invention will be further described with reference to the accompanying drawings.
Fig. 1 is a schematic flow chart of the present invention, which mainly comprises the following steps:
(1) respectively vectorizing structured data and unstructured data contained in each entity in the data set to obtain an entity vector containing a structured vector and an unstructured vector;
the process is embodied in that each entity e in the data set S is divided into i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And an entity vector (α) of the structured vector β i i ,β i ). Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i Is the ith entity in the data set, and N is the number of the entities in the data set.
Unstructured vector alpha i Expressed as:
Figure BDA0002980129400000041
where m is the dimension of the unstructured vector,
Figure BDA0002980129400000042
for unstructured vector alpha i The value in the j-th dimension.
Structured vector beta i Expressed as:
Figure BDA0002980129400000043
where n is the dimension of the structured vector,
Figure BDA0002980129400000044
structured vector beta i The value in the j-th dimension.
(2) Constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination; evaluating each entity vector (alpha) by hybrid distance calculation during composition i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) The K neighbors closest to their hybrid distance d are connected.
Entity vector (alpha) 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 11 ,α 2 )+w b ·d 21 ,β 2 )
wherein d is 11 ,α 2 ) Is an unstructured vector distance, d 21 ,β 2 ) For structuring vector distances, where w b The weight occupied by the structured vector distance when constructing the neighbor graph is used for regulating and controlling the distance d of the unstructured vector 11 ,α 2 ) And a structured vector distance d 21 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity in) and further influences the performance of the constructed fused structured and unstructured data neighbor graph for hybrid search.
Distance d of structured vector 21 ,β 2 ) The definition is as follows:
Figure BDA0002980129400000045
in the formula (I), the compound is shown in the specification,
Figure BDA0002980129400000046
for a structured vector beta 1 、β 2 Is the value of the ith dimension, M is the structureThe dimensions of the quantization vector are such that,
Figure BDA0002980129400000047
the distance computation function for the value of the ith dimension of the structured vector is defined as follows:
Figure BDA0002980129400000048
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) and the hybrid query vector performs hybrid search on the fused structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of the query entity. Specifically, the following distance calculation mode is adopted in the process that the hybrid query vector q is subjected to hybrid search by a greedy algorithm on a fusion structured and unstructured data neighbor graph to obtain the nearest neighbor of a query entity.
Hybrid query vector q ═ q (q) α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector, q, which is a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.

Claims (1)

1. A hybrid search method fusing structured and unstructured data, comprising the steps of:
(1) respectively vectorizing structured data and unstructured data contained in each entity in a data set to obtain an entity vector containing a structured vector and an unstructured vector;
(2) constructing a fusion structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination;
(3) vectorizing structured and unstructured data contained in a query entity in the same way as (1) to obtain a mixed query vector containing a structured vector and an unstructured vector;
(4) performing hybrid search on the hybrid query vector on the fusion structured and unstructured data neighbor graph through a greedy algorithm to obtain the nearest neighbor of a query entity;
wherein step (1) comprises the step of comparing each entity e in the data set S i Respectively vectorizing the contained structured and unstructured data to obtain a vector alpha containing the unstructured data i And a structured vector beta i Entity vector (alpha) of i ,β i ) (ii) a Wherein the data set S is represented as:
S={e i |i=1,2,...,N}
wherein e i The number of the ith entity in the data set is N;
unstructured vector alpha i Expressed as:
Figure FDA0003605812410000011
where m is the dimension of the unstructured vector,
Figure FDA0003605812410000012
for unstructured vector alpha i Taking a value in the j dimension;
structured vector beta i Expressed as:
Figure FDA0003605812410000013
where n is the dimension of the structured vector,
Figure FDA0003605812410000014
structured vector beta i Taking a value in the j dimension;
wherein the construction of the fused structured and unstructured data neighbor graph based on the structured vector and unstructured vector similarity combination described in step (2) refers to the evaluation of individual entity vectors (α) by mixed distance computation i ,β i ) Similarity between them, so that each entity vector (α) i ,β i ) Connecting the K neighbors, entity vectors (α), nearest to their hybrid distance d 1 ,β 1 ) And the entity vector (alpha) 2 ,β 2 ) Distance d ((alpha) between 1 ,β 1 ),(α 2 ,β 2 ) The formula for calculation) is:
d((α 1 ,β 1 ),(α 2 ,β 2 ))=d 11 ,α 2 )+w b ·d 21 ,β 2 )
wherein d is 11 ,α 2 ) Is an unstructured vector distance, d 21 ,β 2 ) For structuring vector distances, where w b For regulating and controlling the weight occupied by the structured vector distance in the process of constructing a neighbor graph 11 ,α 2 ) And a structured vector distance d 21 ,β 2 ) At a mixing distance d ((alpha)) 1 ,β 1 ),(α 2 ,β 2 ) Specific gravity of);
wherein the hybrid query vector q in the step (4) adopts a hybrid distance calculation mode with the following formula in the process of obtaining the nearest neighbor of the query entity by performing hybrid search through a greedy algorithm on the fused structured and unstructured data neighbor graph, and the hybrid query vector q is (q is) q α ,q β ) And the entity vector (alpha) i ,β i ) The mixing distance d of (a) is:
d(q,(α 2 ,β 2 ))=d 1 (q α ,α 2 )+w s ·d 2 (q β ,β 2 )
q α unstructured vector q as a hybrid query vector q β Structured vector, w, which is a hybrid query vector q s Adjusting unstructured vector distance d in hybrid distance 1 (q α ,α 2 ) And a structured vector distance d 2 (q β ,β 2 ) Occupied specific gravity by changing w s Thereby regulating the performance of hybrid search.
CN202110285108.XA 2021-03-17 2021-03-17 Mixed search method fusing structured data and unstructured data Active CN112905644B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110285108.XA CN112905644B (en) 2021-03-17 2021-03-17 Mixed search method fusing structured data and unstructured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110285108.XA CN112905644B (en) 2021-03-17 2021-03-17 Mixed search method fusing structured data and unstructured data

Publications (2)

Publication Number Publication Date
CN112905644A CN112905644A (en) 2021-06-04
CN112905644B true CN112905644B (en) 2022-08-02

Family

ID=76106595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110285108.XA Active CN112905644B (en) 2021-03-17 2021-03-17 Mixed search method fusing structured data and unstructured data

Country Status (1)

Country Link
CN (1) CN112905644B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103412925A (en) * 2013-08-13 2013-11-27 南京烽火星空通信发展有限公司 System and method for integrated searching of structured data and unstructured data
EP2836920A1 (en) * 2012-04-09 2015-02-18 Vivek Ventures, LLC Clustered information processing and searching with structured-unstructured database bridge
WO2017180475A1 (en) * 2016-04-15 2017-10-19 3M Innovative Properties Company Query optimizer for combined structured and unstructured data records

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080077570A1 (en) * 2004-10-25 2008-03-27 Infovell, Inc. Full Text Query and Search Systems and Method of Use
US8930389B2 (en) * 2009-10-06 2015-01-06 International Business Machines Corporation Mutual search and alert between structured and unstructured data stores
US20180032930A1 (en) * 2015-10-07 2018-02-01 0934781 B.C. Ltd System and method to Generate Queries for a Business Database
US11093842B2 (en) * 2018-02-13 2021-08-17 International Business Machines Corporation Combining chemical structure data with unstructured data for predictive analytics in a cognitive system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2836920A1 (en) * 2012-04-09 2015-02-18 Vivek Ventures, LLC Clustered information processing and searching with structured-unstructured database bridge
CN103412925A (en) * 2013-08-13 2013-11-27 南京烽火星空通信发展有限公司 System and method for integrated searching of structured data and unstructured data
WO2017180475A1 (en) * 2016-04-15 2017-10-19 3M Innovative Properties Company Query optimizer for combined structured and unstructured data records

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
"基于复杂网络的结构化公安情报流程研究";樊舒等;《情报杂志》;20201031;86-91 *

Also Published As

Publication number Publication date
CN112905644A (en) 2021-06-04

Similar Documents

Publication Publication Date Title
CN106897374B (en) Personalized recommendation method based on track big data nearest neighbor query
Zhang et al. Adversarial separation network for cross-network node classification
CN113761221B (en) Knowledge graph entity alignment method based on graph neural network
CN114565053A (en) Deep heterogeneous map embedding model based on feature fusion
CN109376797B (en) Network traffic classification method based on binary encoder and multi-hash table
CN116049450A (en) Multi-mode-supported image-text retrieval method and device based on distance clustering
CN108491628B (en) Product design demand driven three-dimensional CAD assembly model clustering and searching method
Chen et al. Personalized travel route recommendation algorithm based on improved genetic algorithm
CN112035689A (en) Zero sample image hash retrieval method based on vision-to-semantic network
CN114359902B (en) Three-dimensional point cloud semantic segmentation method based on multi-scale feature fusion
CN112905644B (en) Mixed search method fusing structured data and unstructured data
CN113870312A (en) Twin network-based single target tracking method
Chang et al. Trajectory similarity measurement: An efficiency perspective
CN117453727A (en) Data vectorization retrieval method and device based on reasoning in database
CN112765490A (en) Information recommendation method and system based on knowledge graph and graph convolution network
CN116050517B (en) Public security field oriented multi-mode data management method and system
CN108319727A (en) A method of any two points shortest path in social networks is found based on community structure
CN116955650A (en) Information retrieval optimization method and system based on small sample knowledge graph completion
CN113990408A (en) Molecular diagram comparison learning method based on chemical element knowledge graph
CN113239219A (en) Image retrieval method, system, medium and equipment based on multi-modal query
Zhang et al. Active learning for information retrieval: Using 3D models as an example
Huang Research on graph network recommendation algorithm based on random walk and convolutional neural network
CN113761243A (en) Online retrieval method and system
CN117688121B (en) SubGNN geographic knowledge graph representation learning method for injecting spatial features
Yang Gc-mobileseg: Fast and accurate semantic segmentation network on mobile devices with global context modeling

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant