CN113051302B - Overall design-oriented multi-dimensional data matching method and device and computer storage medium - Google Patents

Overall design-oriented multi-dimensional data matching method and device and computer storage medium Download PDF

Info

Publication number
CN113051302B
CN113051302B CN202110419464.6A CN202110419464A CN113051302B CN 113051302 B CN113051302 B CN 113051302B CN 202110419464 A CN202110419464 A CN 202110419464A CN 113051302 B CN113051302 B CN 113051302B
Authority
CN
China
Prior art keywords
multidimensional data
item
data item
matched
multidimensional
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110419464.6A
Other languages
Chinese (zh)
Other versions
CN113051302A (en
Inventor
叶东
孙兆伟
张洪珠
李晖
高祥博
赵翰墨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110419464.6A priority Critical patent/CN113051302B/en
Publication of CN113051302A publication Critical patent/CN113051302A/en
Application granted granted Critical
Publication of CN113051302B publication Critical patent/CN113051302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • G06F16/2255Hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses a multidimensional data matching method and device for overall design and a computer storage medium; the method can comprise the following steps: establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function; determining a hash value corresponding to the multidimensional data item to be matched according to the hash function corresponding to the matching strategy which is accurate matching, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; and correspondingly, the matching strategy is similarity matching, the similarity between the multidimensional data item to be matched and each multidimensional data item is acquired item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and a set number of second target multidimensional data items with the highest similarity are selected from the multidimensional data table.

Description

Overall design-oriented multi-dimensional data matching method and device and computer storage medium
Technical Field
The embodiment of the invention relates to the technical field of information, in particular to a multi-dimensional data matching method and device for overall design and a computer storage medium.
Background
With the explosive expansion of data scale, the value implicit in the data is continuously increased, and mining valuable information and knowledge in the big data is a popular research mode at present. Among the numerous big data mining and machine learning problems, how to efficiently realize accurate matching and similarity matching among large-scale data is a fundamental problem. For example, taking data cleansing work as an example, firstly, redundant data needs to be calculated and deleted through accurate matching and similarity among data, so as to reduce waste of storage space; or when the retrieval query task is executed, the data input for query is quickly matched with the data in the database from massive data items to obtain the data which best meets the query problem.
For a parameter library of a large amount of scale, the data that can be acquired is not limited to simple data of a single dimension, but is a multidimensional data object having multiple attribute dimensions and numerical values, for example, an article of a certain type has multiple attributes such as quality and power at the same time. The similarity matching algorithm for the multi-dimensional data at present generally performs similarity calculation by using inter-object distance calculation, such as methods based on euclidean distance, minimum boundary moment, and the like. Since the similarity is calculated by only depending on the distance, the result obtained by matching is not the result which is most expected by the user.
Disclosure of Invention
In view of this, embodiments of the present invention are directed to providing a multidimensional data matching method, apparatus and computer storage medium for overall design; the time complexity of the matching process can be reduced.
The technical scheme of the embodiment of the invention is realized as follows:
in a first aspect, an embodiment of the present invention provides a multidimensional data matching method for overall design, where the method includes:
establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function;
determining a hash value corresponding to the multidimensional data item to be matched according to the hash function corresponding to the matching strategy which is accurate matching, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
corresponding to the matching strategy is similarity matching, acquiring the similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and selecting a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
In a second aspect, an embodiment of the present invention provides an overall design-oriented multidimensional data matching apparatus, including: establishing a part, an accurate matching part and a similarity matching part; wherein the content of the first and second substances,
the establishing part is configured to establish a corresponding hash value index for each multi-dimensional data item in the multi-dimensional data table according to a set hash function;
the accurate matching part is configured to correspond to a matching strategy as accurate matching, determine a hash value corresponding to a multidimensional data item to be matched according to the hash function, and search a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
the similarity matching part is configured to be similarity matching corresponding to a matching strategy, acquire similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and select a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
In a third aspect, an embodiment of the present invention provides a computer storage medium, where the computer storage medium stores an overall design-oriented multidimensional data matching program, and the overall design-oriented multidimensional data matching program, when executed by at least one processor, implements the overall design-oriented multidimensional data matching method steps of the first aspect.
The embodiment of the invention provides a multidimensional data matching method and device for overall design and a computer storage medium; the hash value is used for carrying out accurate matching on the multidimensional data items, and in addition, the weighted Euclidean distance is used for carrying out similarity matching, so that the matching time complexity can be reduced under the condition of ensuring that the matching accuracy is not changed.
Drawings
Fig. 1 is a schematic flow chart of a multidimensional data matching method for overall design according to an embodiment of the present invention;
FIG. 2 is a schematic diagram of an implementation of exact matching provided by an embodiment of the present invention;
FIG. 3 is a schematic representation of multidimensional data provided by an embodiment of the present invention;
FIG. 4 is a diagram of a multidimensional data item to be matched according to an embodiment of the present invention;
fig. 5 is a schematic diagram of an implementation scheme of similarity matching according to an embodiment of the present invention;
FIG. 6 is a schematic diagram of a multi-dimensional data matching apparatus for overall design according to an embodiment of the present invention;
fig. 7 is a schematic diagram of a hardware structure of a computing device according to an embodiment of the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.
First, some terms related to the embodiments of the present invention are explained to facilitate understanding by those skilled in the art.
In the embodiment of the present invention, the multidimensional data table may be a data table having m dimensions and containing n data items, and each dimension may correspond to one attribute in a specific implementation process, so that the following description may be generic between "multidimensional" and "multiattribute". Each data item may be considered a multidimensional data item, and it is understood that in the current large-scale data scenario, the number of multidimensional data items in the multidimensional data table may be up to "ten thousand" in number, and in some examples, each multidimensional data item may correspond to each row in the multidimensional data table, and correspondingly, each column in the multidimensional data table corresponds to each dimension. Based on this, the multidimensional data matching scheme oriented to the overall design, which is set forth in the embodiments of the present invention, is also expected to search the multidimensional data item which is matched with the multidimensional data item to be matched in the multidimensional data table set forth above.
And the accurate matching means that the multidimensional data items to be matched are matched with each multidimensional data item in the multidimensional data table one by one in a dimensionality mode, so that the multidimensional data items which are completely consistent with the multidimensional data items to be matched in numerical values of each dimensionality are obtained.
Similarity matching refers to matching between a multidimensional data item to be matched and each multidimensional data item in a multidimensional data table one by one, but the numerical values of all dimensions may have matching conditions which cannot be completely the same.
The weighted euclidean distance is a weighted euclidean distance between the multidimensional data items x and y expressed by ad (x, y), and taking the multidimensional data items x and y as m-dimensional data items as an example, the formula can be expressed as follows:
Figure BDA0003027323790000041
wherein x isiAnd yiRespectively representing the values of the multidimensional data items x and y in the ith dimension; unlike the conventional Euclidean distance, the coefficient in front of each term's perfect square difference formula is ai(1 ≦ i ≦ m) represents the weight corresponding to the ith dimension, rather than the constant 1.
Based on the above definition and explanation of related concepts, design parameters typically have multiple attributes during the overall design process of a satellite, and thus can be considered as multidimensional data. In some examples, selecting appropriate design parameters from an existing parameter library may be considered a multi-dimensional data matching task, from which the same or similar design parameters need to be obtained for matching. In the face of the task requirement, the embodiment of the invention is expected to provide a multidimensional data matching scheme oriented to overall design, and the time complexity of the matching process can be reduced under the condition of ensuring that the matching accuracy is not changed.
Based on this, referring to fig. 1, a multi-dimensional data matching method for an overall design is shown, which may include:
s101: establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function;
s102: determining a hash value corresponding to the multidimensional data item to be matched according to the hash function corresponding to the matching strategy which is accurate matching, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
s103: corresponding to the matching strategy is similarity matching, acquiring the similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and selecting a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
For the technical solution shown in fig. 1, it should be noted that the hash value is used to perform the precise matching of the multidimensional data item, and in addition, the weighted euclidean distance is used to perform the similarity matching, so that the matching time complexity can be reduced under the condition of ensuring that the matching accuracy is not changed. Further, for the matching policy, it refers to a way of performing a matching search, and in some examples, it may be indicated by receiving a selection instruction of a user that the adopted matching policy is an exact match or a similarity match.
As shown in fig. 1, in some possible implementations, the establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function includes:
determining the hash function H (key) key% p according to a division residue method; the key represents a multidimensional data item to be subjected to hash operation, p represents the maximum prime number not greater than n, and n represents the number of the multidimensional data items in the multidimensional data table;
and calculating the hash value corresponding to each multidimensional data item in the multidimensional data table item by item according to the hash function, and establishing an index for each multidimensional data item.
For the above implementation, in the specific implementation process, for a multidimensional data table containing a plurality of multidimensional data items, first, a maximum prime number p not greater than the number n of the multidimensional data table items may be selected; then, a reasonable hash function is set for the multidimensional data table by adopting a remainder dividing and remaining method H (key) (-) -key% p, and a hash value H [ i ] corresponding to the multidimensional data item represented by each row in the multidimensional data table is calculated based on the hash function, wherein i is more than or equal to 0 and less than n, and an index is established for the hash value, and the index range is [1: n ]. It will be appreciated that n may equally well represent the number of rows of the multidimensional data table.
Based on the foregoing implementation manner, in some examples, referring to fig. 2, the determining, according to the hash function, a hash value corresponding to a multidimensional data item to be matched, and searching, according to the hash value corresponding to the multidimensional data item to be matched, a set number of first target multidimensional data items from the multidimensional data table includes:
s201: establishing a hash bucket structure which is used for storing hash conflicts and has the size of p;
s202: calculating the hash value of the multidimensional data item to be matched based on the hash function;
s203: searching hash values corresponding to all multidimensional data items in the multidimensional data table item by item according to the hash values of the multidimensional data items to be matched, and storing the searched multidimensional data items in the hash bucket structure when the hash values of the multidimensional data items to be matched are the same as the hash values corresponding to the searched multidimensional data items;
s204: and after the item-by-item search is finished, traversing the multidimensional data items stored in the hash bucket structure, and determining the multidimensional data items stored in the hash bucket structure as the first target multidimensional data items.
For the above example, in a specific implementation process, when performing the exact matching, a hash bucket structure may be constructed based on the hash value index of the multidimensional data table, where the size of the hash bucket structure is p, and a linked list of the hash bucket structure may be used to store elements where hash conflicts occur. After the hash bucket structure is built, accurate matching can be performed based on a matching strategy selected by a user or an instruction, specifically, a hash value hash of a multidimensional data item to be matched is calculated according to the hash function, then, the hash value is searched item by item in a hash value index of a multidimensional data table according to the hash, an ith multidimensional data item in the multidimensional data table is set, and if the hash is equal to Hi, the ith multidimensional data item is inserted into the tail of a linked list of the hash bucket structure until item by item searching is completed. At this time, the multidimensional data item stored in the hash bucket structure can be regarded as an exact match with the multidimensional data item to be matched. For example, if only one element exists in the hash bucket, it indicates that the multidimensional data item to be matched has been accurately matched and the result is unique; if the hash bucket has more than one element, the multidimensional data item to be matched is accurately matched and the result is not unique; and if no element exists in the hash bucket, the fact that no data item which is exactly matched with the multidimensional data item to be matched exists in the multidimensional data table is shown.
For example, taking the multidimensional data table with size of 2 × 8 shown in fig. 3 and the multidimensional data item to be matched with size of 1 × 8 shown in fig. 4 as an example, 8 dimensions are set and respectively labeled as A, B, C, D, E, F, G, H
The first multidimensional data item in the multidimensional data table and the multidimensional data item to be matched are determined to be obtained through the exact matching scheme set forth in the above example; and the second multidimensional data item in the multidimensional data table is precisely matched with the multidimensional data item to be matched.
From the exact matching scheme set forth in the foregoing example, it can be seen that, in the conventional scheme, for one tool, the matching is performed item by item and dimension by dimension, as compared to the conventional exact matching schemeThe data table has m dimensionalities and comprises n data items, and a mode of sequentially storing the multidimensional data table is adopted in the matching process, so that a larger storage space is occupied; the exact matching scheme described in the foregoing example can use a hash table mode of hash value index to perform storage, which significantly saves storage space. Furthermore, in conventional schemes, the temporal complexity of exact matching is o (mn); whereas for the exact matching scheme set forth in the preceding example, its time complexity can be reduced to
Figure BDA0003027323790000071
And moreover, the hash collision is processed by adopting a hash bucket matching mode, the elements with the hash collision are stored in the linked list of the same bucket, and the insertion of other elements in the hash searching process is not influenced, so that the time complexity of the accurate matching is reduced under the condition of ensuring the accuracy consistency with the accurate matching of the conventional scheme, and the efficiency of the accurate matching is improved.
As shown in fig. 1, in some possible implementation manners, the obtaining, item by item, a similarity between the multidimensional data item to be matched and each multidimensional data item in the multidimensional data table based on the set weighted euclidean distance policy includes:
for each multidimensional data item within the multidimensional data table, performing the following steps item by item:
aiming at the ith multi-dimensional data item in the multi-dimensional data table, the multi-dimensional data item y to be matched and the ith multi-dimensional data item x are obtained according to the following formulaiWeighted euclidean distance between:
Figure BDA0003027323790000072
where 1 ≦ i ≦ n, n represents the number of multidimensional data items in the multidimensional data table, m represents the number of dimensions of the multidimensional data table or the multidimensional data item to be matched, ajRepresents the weight value, x, corresponding to the jth dimensioni,jRepresenting the ith multidimensional data item xiNumber of dimension jAccording to the value of yjRepresenting the j dimension data value in the multi-dimensional data item y to be matched;
and according to the multidimensional data item y to be matched and the ith multidimensional data item xiThe weighted Euclidean distance between the first multidimensional data item and the second multidimensional data item is obtained according to the following formulaiA similarity value theta (x) with the multi-dimensional data item to be matched yi,y):
Figure BDA0003027323790000073
For the above implementation manner, in combination with the definition and explanation of the foregoing related concepts, in detail, starting from the first multidimensional data item in the multidimensional data table, the weighted euclidean distance between the multidimensional data item to be matched and the multidimensional data item is calculated item by item, and it can be understood that, in the multidimensional data matching process, the importance of each dimension is not uniform, and a user may prefer to consider some attributes of the data, and correspondingly consider some other attributes, based on which, each dimension corresponds to a corresponding weight value to represent the importance degree of the dimension on the user side. In the process of calculating the weighted Euclidean distance item by item, the similarity value between the multidimensional data item to be matched and the weighted Euclidean distance can be calculated continuously. For the similarity value, it can be understood that the larger the value, the smaller the difference between the multi-dimensional data item to be matched and the representation, and the more similar the two. If the similarity value theta between a certain multidimensional data item in the multidimensional data table and the multidimensional data item to be matched is 1, the multidimensional data item and the multidimensional data item to be matched are completely identical, namely the degree of accurate matching is achieved.
For the above implementation, in the process of obtaining similarity values item by item, it is further required to timely store multidimensional data items that are very similar to the multidimensional data item to be matched in the multidimensional data table, and in some examples, referring to fig. 5, the selecting a set number of second target multidimensional data items with the highest similarity from the multidimensional data table includes:
s501: constructing a minimum heap structure which is used for storing a second target multi-dimensional data item and has the size of k, and initializing a heap top element value of the minimum heap structure to an index value of a first multi-dimensional data item in the multi-dimensional data table;
s502: in the multidimensional data table, comparing similarity values with similarity values of multidimensional data items corresponding to the heap top element values item by item starting from a second multidimensional data item, and if the similarity values of the compared multidimensional data items are greater than the similarity values of the multidimensional data items corresponding to the heap top element values, inserting indexes of the compared multidimensional data items into the heap top of the minimum heap structure, and sorting the inserted minimum pair structures;
s503: and determining the multidimensional data item corresponding to the element value in the minimum heap structure after item-by-item comparison in the multidimensional data table as the second target multidimensional data item.
For the above example, specifically, first, a minimum heap of size k may be constructed, and a heap top element of the minimum heap is defaulted to an index value of a first multidimensional data item in the multidimensional data table; then, starting from the second multidimensional data item, in the subsequent item-by-item similarity value acquisition process, the similarity value is compared with the similarity value size of the multidimensional data item corresponding to the top element of the minimum heap: and if the similarity value of the multidimensional data item is larger than the similarity value of the corresponding multidimensional data item, inserting the index value of the compared multidimensional data item into the heap top element, and sequencing the inserted minimum heap structure. And when the item-by-item traversal of the multidimensional data table is completed, the finally obtained elements recorded in the minimum heap structure are k multidimensional data items which are most similar to the multidimensional data items to be matched in the multidimensional data table, namely second target multidimensional data items.
For example, still taking the multidimensional data table shown in fig. 3 and the multidimensional data item to be matched shown in fig. 4 as an example, it can be known through the similarity matching scheme set forth in the foregoing implementation manner and examples thereof that the similarity value between the first multidimensional data item in the multidimensional data table and the multidimensional data item to be matched is smaller than 1, which indicates that a certain difference exists between the two items; and the similarity value between the second multidimensional data item in the multidimensional data table and the multidimensional data item to be matched is 1, which indicates that the data contents of the two multidimensional data items are completely the same.
According to the similarity matching scheme described in the above implementation manner and the example thereof, compared with a processing manner of performing similarity matching item by item and dimension by using a non-weighted euclidean distance in a conventional similarity matching scheme, it can be known that: because the similarity is obtained by adopting a weighted Euclidean distance mode, the finally obtained second target multidimensional data item is more inclined to the expectation of the user, and the matching accuracy is improved. Furthermore, for a data table with m dimensions and containing n data items, the time complexity of the conventional scheme is o (mn); the similarity matching scheme described in the above implementation manner and the example thereof reduces the operation complexity to o (n), thereby improving the similarity matching efficiency.
Based on the same inventive concept of the foregoing technical solution, referring to fig. 6, a multi-dimensional data matching apparatus 60 for general design according to an embodiment of the present invention is shown, where the apparatus 60 includes: a creation section 601, an exact matching section 602, and a similarity matching section 603; wherein the content of the first and second substances,
the establishing part 601 is configured to establish a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function;
the exact matching part 602 is configured to determine a hash value corresponding to the multidimensional data item to be matched according to the hash function, and search a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched, corresponding to the matching policy as exact matching; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
the similarity matching part 603 is configured to obtain similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted euclidean distance strategy, and select a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
In the above scheme, the establishing part 601 is configured to:
determining the hash function H (key) key% p according to a division residue method; the key represents a multidimensional data item to be subjected to hash operation, p represents the maximum prime number not greater than n, and n represents the number of the multidimensional data items in the multidimensional data table;
and calculating the hash value corresponding to each multidimensional data item in the multidimensional data table item by item according to the hash function, and establishing an index for each multidimensional data item.
In the above scheme, the exact match portion 602 is configured to:
establishing a hash bucket structure which is used for storing hash conflicts and has the size of p;
calculating the hash value of the multidimensional data item to be matched based on the hash function;
searching hash values corresponding to all multidimensional data items in the multidimensional data table item by item according to the hash values of the multidimensional data items to be matched, and storing the searched multidimensional data items in the hash bucket structure when the hash values of the multidimensional data items to be matched are the same as the hash values corresponding to the searched multidimensional data items;
and after the item-by-item search is finished, traversing the multidimensional data items stored in the hash bucket structure, and determining the multidimensional data items stored in the hash bucket structure as the first target multidimensional data items.
In the above scheme, the similarity matching section 603 is configured to:
for each multidimensional data item within the multidimensional data table, performing the following steps item by item:
aiming at the ith multi-dimensional data item in the multi-dimensional data table, the multi-dimensional data item y to be matched and the ith multi-dimensional data item x are obtained according to the following formulaiWeighted euclidean distance between:
Figure BDA0003027323790000101
where 1 ≦ i ≦ n, n represents the number of multidimensional data items in the multidimensional data table, m represents the number of dimensions of the multidimensional data table or the multidimensional data item to be matched, ajRepresents the weight value, x, corresponding to the jth dimensioni,jRepresenting the ith multidimensional data item xiData value of the j-th dimension, yjRepresenting the j dimension data value in the multi-dimensional data item y to be matched;
and according to the multidimensional data item y to be matched and the ith multidimensional data item xiThe weighted Euclidean distance between the first multidimensional data item and the second multidimensional data item is obtained according to the following formulaiA similarity value theta (x) with the multi-dimensional data item to be matched yi,y):
Figure BDA0003027323790000111
In the above scheme, the similarity matching section 603 is configured to:
constructing a minimum heap structure which is used for storing a second target multi-dimensional data item and has the size of k, and initializing a heap top element value of the minimum heap structure to an index value of a first multi-dimensional data item in the multi-dimensional data table;
in the multidimensional data table, comparing similarity values with similarity values of multidimensional data items corresponding to the heap top element values item by item starting from a second multidimensional data item, and if the similarity values of the compared multidimensional data items are greater than the similarity values of the multidimensional data items corresponding to the heap top element values, inserting indexes of the compared multidimensional data items into the heap top of the minimum heap structure, and sorting the inserted minimum pair structures;
and determining the multidimensional data item corresponding to the element value in the minimum heap structure after item-by-item comparison in the multidimensional data table as the second target multidimensional data item.
It is understood that in this embodiment, "part" may be part of a circuit, part of a processor, part of a program or software, etc., and may also be a unit, and may also be a module or a non-modular.
In addition, each component in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit. The integrated unit can be realized in a form of hardware or a form of a software functional module.
Based on the understanding that the technical solution of the present embodiment essentially or a part contributing to the prior art, or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium, and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method of the present embodiment. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
Therefore, the present embodiment provides a computer storage medium, where the computer storage medium stores a multi-dimensional data matching program for overall design, and the multi-dimensional data matching program for overall design, when executed by at least one processor, implements the steps of the multi-dimensional data matching method for overall design in the foregoing technical solution.
Referring to fig. 7, a specific hardware structure of a computing device 70 capable of implementing the above-mentioned overall design-oriented multidimensional data matching apparatus 60 according to the embodiment of the present invention is shown, wherein the computing device 70 can be a wireless device, a mobile or cellular phone (including a so-called smart phone), a Personal Digital Assistant (PDA), a video game console (including a video display, a mobile video game apparatus, a mobile video conference unit), a laptop computer, a desktop computer, a television set-top box, a tablet computing apparatus, an e-book reader, a fixed or mobile media player, etc. The computing device 70 includes: a communication interface 701, a memory 702, and a processor 703; the various components are coupled together by a bus system 704. It is understood that the bus system 704 is used to enable communications among the components. The bus system 704 includes a power bus, a control bus, and a status signal bus in addition to a data bus. For clarity of illustration, however, the various buses are labeled in fig. 7 as the bus system 704. Wherein the content of the first and second substances,
the communication interface 701 is configured to receive and transmit signals in a process of receiving and transmitting information with other external network elements;
the memory 702 is used for storing a computer program capable of running on the processor 703;
the processor 703 is configured to, when running the computer program, perform the following steps:
establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function;
determining a hash value corresponding to the multidimensional data item to be matched according to the hash function corresponding to the matching strategy which is accurate matching, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
corresponding to the matching strategy is similarity matching, acquiring the similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and selecting a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
It is to be understood that the memory 702 in embodiments of the present invention may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of illustration and not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (ddr Data Rate SDRAM, ddr SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchlink DRAM (SLDRAM), and Direct Rambus RAM (DRRAM). The memory 702 of the systems and methods described herein is intended to comprise, without being limited to, these and any other suitable types of memory.
The processor 703 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the method may be implemented by hardware integrated logic circuits in the processor 703 or by instructions in the form of software. The Processor 703 may be a general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps and logic blocks disclosed in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 702, and the processor 703 reads the information in the memory 702 and performs the steps of the above method in combination with the hardware thereof.
It is to be understood that the embodiments described herein may be implemented in hardware, software, firmware, middleware, microcode, or any combination thereof. For a hardware implementation, the Processing units may be implemented within one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), general purpose processors, controllers, micro-controllers, microprocessors, other electronic units configured to perform the functions described herein, or a combination thereof.
For a software implementation, the techniques described herein may be implemented with modules (e.g., procedures, functions, and so on) that perform the functions described herein. The software codes may be stored in a memory and executed by a processor. The memory may be implemented within the processor or external to the processor.
Specifically, when the processor 703 is further configured to run the computer program, the steps of the multidimensional data matching method for the general design in the foregoing technical solution are executed, which are not described herein again.
It should be understood that the above-mentioned exemplary technical solutions of the overall design-oriented multidimensional data matching apparatus 60 and the computing device 70 belong to the same concept as the technical solution of the overall design-oriented multidimensional data matching method, and therefore, the above-mentioned detailed contents that are not described in detail for the technical solutions of the overall design-oriented multidimensional data matching apparatus 60 and the computing device 70 can be referred to the description of the technical solution of the overall design-oriented multidimensional data matching method. The embodiments of the present invention will not be described in detail herein.
It should be noted that: the technical schemes described in the embodiments of the present invention can be combined arbitrarily without conflict.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims (10)

1. An ensemble design oriented multidimensional data matching method, the method comprising:
establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to a set hash function;
determining a hash value corresponding to the multidimensional data item to be matched according to the hash function corresponding to the matching strategy which is accurate matching, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
corresponding to the matching strategy is similarity matching, acquiring the similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and selecting a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
2. The method according to claim 1, wherein the establishing a corresponding hash value index for each multidimensional data item in the multidimensional data table according to the set hash function comprises:
determining the hash function H (key) key% p according to a division residue method; the key represents a multidimensional data item to be subjected to hash operation, p represents the maximum prime number not greater than n, and n represents the number of the multidimensional data items in the multidimensional data table;
and calculating the hash value corresponding to each multidimensional data item in the multidimensional data table item by item according to the hash function, and establishing an index for each multidimensional data item.
3. The method according to claim 2, wherein the determining a hash value corresponding to the multidimensional data item to be matched according to the hash function, and searching a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched comprises:
establishing a hash bucket structure which is used for storing hash conflicts and has the size of p;
calculating the hash value of the multidimensional data item to be matched based on the hash function;
searching hash values corresponding to all multidimensional data items in the multidimensional data table item by item according to the hash values of the multidimensional data items to be matched, and storing the searched multidimensional data items in the hash bucket structure when the hash values of the multidimensional data items to be matched are the same as the hash values corresponding to the searched multidimensional data items;
and after the item-by-item search is finished, traversing the multidimensional data items stored in the hash bucket structure, and determining the multidimensional data items stored in the hash bucket structure as the first target multidimensional data items.
4. The method according to claim 1, wherein the obtaining the similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on the set weighted euclidean distance strategy comprises:
for each multidimensional data item within the multidimensional data table, performing the following steps item by item:
aiming at the ith multi-dimensional data item in the multi-dimensional data table, the multi-dimensional data item y to be matched and the ith multi-dimensional data item are obtained according to the following formulaxiWeighted euclidean distance between:
Figure FDA0003027323780000021
where 1 ≦ i ≦ n, n represents the number of multidimensional data items in the multidimensional data table, m represents the number of dimensions of the multidimensional data table or the multidimensional data item to be matched, ajRepresents the weight value, x, corresponding to the jth dimensioni,jRepresenting the ith multidimensional data item xiData value of the j-th dimension, yjRepresenting the j dimension data value in the multi-dimensional data item y to be matched;
and according to the multidimensional data item y to be matched and the ith multidimensional data item xiThe weighted Euclidean distance between the first multidimensional data item and the second multidimensional data item is obtained according to the following formulaiA similarity value theta (x) with the multi-dimensional data item to be matched yi,y):
Figure FDA0003027323780000022
5. The method of claim 4, wherein selecting a set number of second target multidimensional data items from the multidimensional data table with the highest similarity comprises:
constructing a minimum heap structure which is used for storing a second target multi-dimensional data item and has the size of k, and initializing a heap top element value of the minimum heap structure to an index value of a first multi-dimensional data item in the multi-dimensional data table;
in the multidimensional data table, comparing similarity values with similarity values of multidimensional data items corresponding to the heap top element values item by item starting from a second multidimensional data item, and if the similarity values of the compared multidimensional data items are greater than the similarity values of the multidimensional data items corresponding to the heap top element values, inserting indexes of the compared multidimensional data items into the heap top of the minimum heap structure, and sorting the inserted minimum pair structures;
and determining the multidimensional data item corresponding to the element value in the minimum heap structure after item-by-item comparison in the multidimensional data table as the second target multidimensional data item.
6. An overall design oriented multidimensional data matching apparatus, the apparatus comprising: establishing a part, an accurate matching part and a similarity matching part; wherein the content of the first and second substances,
the establishing part is configured to establish a corresponding hash value index for each multi-dimensional data item in the multi-dimensional data table according to a set hash function;
the accurate matching part is configured to correspond to a matching strategy as accurate matching, determine a hash value corresponding to a multidimensional data item to be matched according to the hash function, and search a set number of first target multidimensional data items from the multidimensional data table according to the hash value corresponding to the multidimensional data item to be matched; the first target multi-dimensional data item is accurately matched with the multi-dimensional data item to be matched;
the similarity matching part is configured to be similarity matching corresponding to a matching strategy, acquire similarity between the multidimensional data item to be matched and each multidimensional data item by item in the multidimensional data table based on a set weighted Euclidean distance strategy, and select a set number of second target multidimensional data items with the highest similarity from the multidimensional data table; and the second target multi-dimensional data item is matched with the multi-dimensional data item to be matched in a similar way.
7. The apparatus of claim 6, wherein the exact match portion is configured to:
establishing a hash bucket structure which is used for storing hash conflicts and has the size of p;
calculating the hash value of the multidimensional data item to be matched based on the hash function;
searching hash values corresponding to all multidimensional data items in the multidimensional data table item by item according to the hash values of the multidimensional data items to be matched, and storing the searched multidimensional data items in the hash bucket structure when the hash values of the multidimensional data items to be matched are the same as the hash values corresponding to the searched multidimensional data items;
and after the item-by-item search is finished, traversing the multidimensional data items stored in the hash bucket structure, and determining the multidimensional data items stored in the hash bucket structure as the first target multidimensional data items.
8. The apparatus of claim 6, wherein the similarity matching section is configured to:
for each multidimensional data item within the multidimensional data table, performing the following steps item by item:
aiming at the ith multi-dimensional data item in the multi-dimensional data table, the multi-dimensional data item y to be matched and the ith multi-dimensional data item x are obtained according to the following formulaiWeighted euclidean distance between:
Figure FDA0003027323780000041
where 1 ≦ i ≦ n, n represents the number of multidimensional data items in the multidimensional data table, m represents the number of dimensions of the multidimensional data table or the multidimensional data item to be matched, ajRepresents the weight value, x, corresponding to the jth dimensioni,jRepresenting the ith multidimensional data item xiData value of the j-th dimension, yjRepresenting the j dimension data value in the multi-dimensional data item y to be matched;
and according to the multidimensional data item y to be matched and the ith multidimensional data item xiThe weighted Euclidean distance between the first multidimensional data item and the second multidimensional data item is obtained according to the following formulaiA similarity value theta (x) with the multi-dimensional data item to be matched yi,y):
Figure FDA0003027323780000042
9. The apparatus of claim 8, wherein the similarity matching section is configured to:
constructing a minimum heap structure which is used for storing a second target multi-dimensional data item and has the size of k, and initializing a heap top element value of the minimum heap structure to an index value of a first multi-dimensional data item in the multi-dimensional data table;
in the multidimensional data table, comparing similarity values with similarity values of multidimensional data items corresponding to the heap top element values item by item starting from a second multidimensional data item, and if the similarity values of the compared multidimensional data items are greater than the similarity values of the multidimensional data items corresponding to the heap top element values, inserting indexes of the compared multidimensional data items into the heap top of the minimum heap structure, and sorting the inserted minimum pair structures;
and determining the multidimensional data item corresponding to the element value in the minimum heap structure after item-by-item comparison in the multidimensional data table as the second target multidimensional data item.
10. A computer storage medium, characterized in that the computer storage medium stores a multi-dimensional data matching program for overall design, and the multi-dimensional data matching program for overall design is executed by at least one processor to realize the steps of the multi-dimensional data matching method for overall design according to any one of claims 1 to 5.
CN202110419464.6A 2021-04-19 2021-04-19 Overall design-oriented multi-dimensional data matching method and device and computer storage medium Active CN113051302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110419464.6A CN113051302B (en) 2021-04-19 2021-04-19 Overall design-oriented multi-dimensional data matching method and device and computer storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110419464.6A CN113051302B (en) 2021-04-19 2021-04-19 Overall design-oriented multi-dimensional data matching method and device and computer storage medium

Publications (2)

Publication Number Publication Date
CN113051302A CN113051302A (en) 2021-06-29
CN113051302B true CN113051302B (en) 2022-04-29

Family

ID=76519685

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110419464.6A Active CN113051302B (en) 2021-04-19 2021-04-19 Overall design-oriented multi-dimensional data matching method and device and computer storage medium

Country Status (1)

Country Link
CN (1) CN113051302B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352931A (en) * 2018-12-21 2020-06-30 中兴通讯股份有限公司 Hash collision processing method and device and computer readable storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9367454B2 (en) * 2013-08-15 2016-06-14 Applied Micro Circuits Corporation Address index recovery using hash-based exclusive or

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111352931A (en) * 2018-12-21 2020-06-30 中兴通讯股份有限公司 Hash collision processing method and device and computer readable storage medium

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
基于局部敏感哈希的安全相似性查询方案;吴瑾等;《密码学报》;20180415(第02期);全文 *
应用多索引加法量化编码的近邻检索算法;刘恒;《中国图象图形学报》;20180516(第06期);全文 *
面向Top-k快速查询的层次化LSH索引方法;罗雄才等;《计算机研究与发展》;20151015;全文 *
面向高维图像特征匹配的多次随机子向量量化哈希算法;杨恒等;《计算机辅助设计与图形学学报》;20100315(第03期);全文 *
高维分布式局部敏感哈希索引方法;林朝晖等;《计算机科学与探索》;20130528(第09期);全文 *

Also Published As

Publication number Publication date
CN113051302A (en) 2021-06-29

Similar Documents

Publication Publication Date Title
Li et al. Weakly-supervised semantic guided hashing for social image retrieval
US9454580B2 (en) Recommendation system with metric transformation
Nguyen et al. Clustering with multiviewpoint-based similarity measure
Roshdi et al. Information retrieval techniques and applications
US20210158164A1 (en) Finding k extreme values in constant processing time
KR100903961B1 (en) Indexing And Searching Method For High-Demensional Data Using Signature File And The System Thereof
US11604834B2 (en) Technologies for performing stochastic similarity searches in an online clustering space
US20150332124A1 (en) Near-duplicate video retrieval
CN109271486B (en) Similarity-preserving cross-modal Hash retrieval method
US20160342677A1 (en) System and Method for Agglomerative Clustering
EP3314464B1 (en) Storage and retrieval of data from a bit vector search index
Salesi et al. TAGA: Tabu asexual genetic algorithm embedded in a filter/filter feature selection approach for high-dimensional data
WO2013129580A1 (en) Approximate nearest neighbor search device, approximate nearest neighbor search method, and program
CN112732883A (en) Fuzzy matching method and device based on knowledge graph and computer equipment
CN112988980B (en) Target product query method and device, computer equipment and storage medium
Tiakas et al. MSIDX: multi-sort indexing for efficient content-based image search and retrieval
EP3292481A1 (en) Method, system and computer program product for performing numeric searches
US20200265045A1 (en) Technologies for refining stochastic similarity search candidates
CN112560444A (en) Text processing method and device, computer equipment and storage medium
US9104946B2 (en) Systems and methods for comparing images
Wahle et al. Deterministic binary vectors for efficient automated indexing of medline/pubmed abstracts
CN109086386B (en) Data processing method, device, computer equipment and storage medium
US11048759B1 (en) Tochenized cache
CN113051302B (en) Overall design-oriented multi-dimensional data matching method and device and computer storage medium
CN115730596A (en) Object recommendation method and device and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant