CN115935200B

CN115935200B - Mass data similarity calculation method based on Hash He Hai clear distance

Info

Publication number: CN115935200B
Application number: CN202310038988.XA
Authority: CN
Inventors: 金震; 张京日; 穆宇浩
Original assignee: Beijing SunwayWorld Science and Technology Co Ltd
Current assignee: Beijing SunwayWorld Science and Technology Co Ltd
Priority date: 2023-01-12
Filing date: 2023-01-12
Publication date: 2023-09-08
Anticipated expiration: 2043-01-12
Also published as: CN115935200A

Abstract

The invention provides a mass data similarity calculation method based on a Hash He Hai clear distance, which comprises the following steps: s1: determining local sensitive hash values of each character in the two groups of detection data tables respectively; s2: performing barrel separation processing on the local sensitive hash values of all characters in each group of inspection and detection data tables to obtain a plurality of barrel sets of each group of inspection and detection data tables; s3: mapping the local sensitive hash value in each bucket set to a high-dimensional space based on a kernel method to obtain a plurality of first high-dimensional bucket sets of a first group of inspection detection data tables and a plurality of second high-dimensional bucket sets of a second group of inspection detection data tables; s4: determining the similarity of two groups of inspection and detection data tables based on the Hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets; the method is used for reducing space consumption and data comparison quantity and improving calculation efficiency on the premise of ensuring accuracy of mass data similarity calculation.

Description

Mass data similarity calculation method based on Hash He Hai clear distance

Technical Field

The invention relates to the field of enhanced data management, in particular to a mass data similarity calculation method based on a Hash He Hai bright distance.

Background

At present, when judging whether the inspection data meets the requirement, similarity calculation is performed between the inspection data and standard inspection data to judge whether the inspection object meets the inspection requirement. The common practice of calculating the similarity between two pairs of data in a small batch can be satisfied.

However, for mass data, computing the similarity in pairs consumes a large amount of computing resources, and the computing efficiency is relatively low due to huge computing amount, and the existing similarity computing method for mass data must be weighed and discarded in terms of accuracy and computing efficiency according to actual needs, which cannot meet the requirements of judging precision and judging efficiency at the same time when judging whether the detected data meet the requirements.

Therefore, the invention provides a mass data similarity calculation method based on hash He Hai clear distance.

Disclosure of Invention

The invention provides a mass data similarity calculation method based on Hamming He Hai clear distance, which is used for mapping data into a high-dimensional space by dividing a hash of detected data into barrels when the mass data similarity of the detected data is calculated, so that space consumption can be reduced, the comparison quantity of the data can be reduced, the Hamming distance is used for calculating the similarity, the calculation efficiency is higher, and the requirements of judging precision and judging efficiency are simultaneously met when judging whether the detected data meets the requirements or not.

The invention provides a mass data similarity calculation method based on a Hash He Hai clear distance, which comprises the following steps:

s1: determining local sensitive hash values of each character in the two groups of detection data tables respectively;

s2: performing barrel separation processing on the local sensitive hash values of all characters in each group of inspection and detection data tables to obtain a plurality of barrel sets of each group of inspection and detection data tables;

s3: mapping the local sensitive hash value in each bucket set to a high-dimensional space based on a kernel method to obtain a plurality of first high-dimensional bucket sets of a first group of inspection detection data tables and a plurality of second high-dimensional bucket sets of a second group of inspection detection data tables;

s4: and determining the similarity of the two groups of inspection detection data tables based on the Hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps of: determining a locality sensitive hash value for each character in the two sets of inspection and detection data tables, respectively, comprising:

s101: determining the data attribute contained in each data record in the inspection detection data table;

s102: sorting the attributes of the inspection detection data table based on the preset attribute weight of each data attribute to obtain a first inspection detection data table;

S103: sorting the first inspection data table based on the preset object weight of each object in the first inspection data table to obtain a second inspection data table;

s104: constructing a personalized hash function based on the data structure similarity evaluation function and the data content similarity evaluation function of the two groups of detection data tables;

s105: and determining the local sensitive hash value of each character in the two groups of detection data tables based on the characteristic data and the personalized hash function of each second detection data table.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps: constructing a personalized hash function based on the data structure similarity evaluation function and the data content similarity evaluation function of the two groups of detection data tables, wherein the personalized hash function comprises the following steps:

performing feature extraction on the second inspection detection data table based on a linear feature extraction algorithm to obtain feature data of the second inspection detection data table, and calculating feature complexity based on the feature data in the second inspection detection data table;

determining unified partitioning dimensions based on the feature complexity of each second detection data table, and constructing a data structure similarity evaluation function and a data content similarity evaluation function based on the unified partitioning dimensions;

And constructing personalized hash functions of two groups of inspection and detection data tables based on the data structure similarity evaluation function and the data content similarity evaluation function.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance constructs a data structure similarity evaluation function and a data content similarity evaluation function based on unified partition dimensions, and comprises the following steps:

any combination is carried out on all the sub-data contained in the two groups of inspection detection data tables based on the unified dividing dimension, so that all the sub-data sets of each inspection detection data table under the unified dividing dimension are obtained;

determining the bit ordinal number of each sub-data in each sub-data set, and generating bit ordinal number sequences of each sub-data in all sub-data sets based on the bit ordinal numbers;

calculating the comprehensive cosine similarity of the two groups of inspection detection data based on all bit sequence numbers of all sub-data contained in the two groups of inspection bureau detection data tables;

and constructing a data structure similarity evaluation function and a data content similarity evaluation function based on the comprehensive cosine similarity.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps: determining a locality sensitive hash value for each character in the two sets of inspection and detection data tables based on the characteristic data and the personalized hash function for each second inspection and detection data table, comprising:

Dividing and summarizing the second inspection data tables based on the unified dividing dimension to obtain characteristic data blocks of each second inspection office inspection data table, and building a characteristic data block matrix of the second inspection data table based on the characteristic data blocks;

calculating a characteristic data comprehensive value of each characteristic data character based on the characteristic data block matrix of the second detection data table;

and determining the local sensitive hash value of each character in the corresponding detection data table based on the characteristic data comprehensive value of each characteristic data character in the second detection data table, the co-occurrence matrix of the characteristic data of the corresponding second detection data table and the personalized hash function.

Preferably, the method for calculating the similarity of mass data based on the hash He Hai clear distance determines a local sensitive hash value of each character in the corresponding detection data table based on the feature data integrated value of each feature data character in the second detection data table, the co-occurrence matrix of the feature data of the corresponding second detection data table and the personalized hash function, and includes:

constructing a co-occurrence matrix of the characteristic data of each second detection data table, determining a co-occurrence frequency vector between every two characteristic data characters based on the co-occurrence matrix, and summarizing all co-occurrence frequency vectors of each characteristic data character to obtain a co-occurrence frequency vector sequence of each characteristic data character;

Substituting the characteristic data comprehensive value and the co-occurrence frequency vector sequence of each characteristic data character into a personalized hash function to determine a local sensitive hash value of each characteristic data character in each second checking and detecting data table;

and determining the local sensitive hash value of each character in the corresponding detection data table based on the storage index of each characteristic data character in the second detection data table.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps of: performing bucket-splitting processing on the local sensitive hash values of all characters in each group of inspection data tables to obtain a plurality of bucket sets of each group of inspection data tables, wherein the bucket sets comprise:

s201: determining all local aggregation centers and corresponding local aggregation degrees of local sensitive hash values of all characters of the detection data table;

s202: determining all barrel division ranges based on the character scale, the maximum local sensitive hash value, all local aggregation centers and corresponding local aggregation degrees of the detection data table;

s203: and carrying out barrel division processing on the local sensitive hash values of all characters in the corresponding group detection data table based on the barrel division range, and sequencing data in each barrel obtained after barrel division to obtain a plurality of barrel sets of the corresponding group detection data table.

Preferably, in the method for calculating similarity of mass data based on a hash He Hai clear distance, determining all barrel dividing ranges based on a character scale of a detection data table, a maximum local sensitive hash value, all local aggregation centers and corresponding local aggregation degrees, including:

determining the minimum number of barrels based on the preset barrel capacity and the character scale of the detection data table;

calculating the average aggregation degree of all local aggregation degrees of the inspection detection data table;

determining the number of final barrels based on the average aggregation degree and the local aggregation degree and the minimum barrel number;

and determining all barrel division ranges based on the final barrel number and the maximum local sensitive hash value of the detection data table.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance determines the number of final barrels based on the average aggregation degree, the local aggregation degree and the minimum barrel number, and includes:

judging whether local aggregation degree which is not less than 2 times of the average aggregation degree exists or not, if so, taking the sum of the total number of the local aggregation degree which is not less than 2 times of the average aggregation degree and the total number of the local aggregation centers as a first barrel number, and taking the larger value of the first barrel number and the minimum barrel number as a final barrel number;

Otherwise, the larger value of the total number of the local aggregation centers and the minimum barrel number is taken as the final barrel number.

Preferably, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps of: determining the similarity of the two sets of inspection and detection data tables based on the hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets, including:

sequencing all the first high-dimensional bucket sets to obtain a first bucket sequence, and sequencing all the second high-dimensional bucket sets to obtain a second bucket sequence;

and determining the similarity of the two groups of inspection detection data tables based on the Hamming distance between the first high-dimensional barrel set and the second high-dimensional barrel set of all the same ordinal numbers in the first barrel sequence and the second barrel sequence.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

The accompanying drawings are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention. In the drawings:

fig. 1 is a flowchart of a mass data similarity calculation method based on a hash He Hai bright distance in an embodiment of the invention;

FIG. 2 is a flowchart of another method for calculating similarity of mass data based on a Hash He Hai clear distance according to an embodiment of the present invention;

fig. 3 is a flowchart of another mass data similarity calculation method based on a hash He Hai bright distance in an embodiment of the invention.

Detailed Description

The preferred embodiments of the present invention will be described below with reference to the accompanying drawings, it being understood that the preferred embodiments described herein are for illustration and explanation of the present invention only, and are not intended to limit the present invention.

Example 1:

the invention provides a mass data similarity calculation method based on Hash He Hai clear distance, which comprises the following steps of:

In this embodiment, the inspection data table is a table containing inspection data (for example, operation data of laboratory equipment, etc.) acquired when an inspection object (for example, laboratory quality management item) is inspected, and the inspection data is a numerical value acquired in accordance with a cycle.

In this embodiment, the two sets of inspection data tables are two sets of inspection data tables having mass data and requiring calculation of data similarity.

In this embodiment, the local sensitive hash value is a value determined by hashing data in the inspection data tables, where the value is used to determine similarity between two sets of inspection data tables, and the value is determined according to a hash function that is individually designed for data features of the two inspection data tables.

In this embodiment, the bucket set is a set of local sensitive hash values including a plurality of characters obtained after performing bucket separation processing on the local sensitive hash values of all characters in each group of inspection and detection data tables.

In this embodiment, the first high-dimensional bucket set is a set obtained after mapping the locally sensitive hash values in the bucket set in the first group of inspection data tables to the high-dimensional space.

In this embodiment, the second high-dimensional bucket set is a set obtained after mapping the locally sensitive hash values in the bucket set in the second group of inspection data tables to the high-dimensional space.

In this embodiment, the similarity is a comprehensive value for representing the similarity degree of the data structure and the similarity degree of the data content between the two sets of inspection data tables, and the greater the similarity degree, the greater the similarity degree represents the similarity degree between the two sets of inspection data tables, and vice versa.

The beneficial effects of the technology are as follows: when the similarity of mass data of the detection data is calculated, the detection data is hashed and then divided into buckets, the data is mapped into a high-dimensional space, so that the space consumption can be reduced, the comparison quantity of the data can be reduced, the Hamming distance is used for calculating the similarity, the calculation efficiency is higher, and the requirements of judging precision and judging efficiency are simultaneously met when judging whether the detection data meets the requirements or not.

Example 2:

based on the embodiment 1, the mass data similarity calculation method based on the Hash He Hai clear distance comprises the following steps of: determining the locality sensitive hash value for each character in the two sets of inspection detection data tables, respectively, referring to fig. 2, includes:

In this embodiment, the data record is a data record in the inspection data, for example: each data record characterizes the test detection values of a different test object and each data record contains the test detection values of all kinds (data attributes) of the test object, such as: real-time operating voltage, real-time operating current, number of operating hours.

In this embodiment, the data attribute is the kind (or attribute) of the different inspection detection data.

In this embodiment, the preset attribute weight is a weight preset for each data attribute, which characterizes the importance degree of the inspection detection data of the corresponding data attribute.

In this embodiment, the first inspection data table is a data table obtained after attribute sorting (i.e. reordering column data in the inspection data table) is performed on the inspection data table based on a preset attribute weight of each data attribute.

In this embodiment, the object is an object to which the inspection data belongs, for example: laboratory or laboratory equipment.

In this embodiment, the preset object weight is a weight of the importance degree of the preset characterization corresponding to the inspection detection data of each inspection object.

In this embodiment, the second inspection data table is a data table obtained after the first inspection data table is subject-ordered (i.e., the data of the rows in the first inspection data table are reordered) based on the preset subject weights of each subject in the first inspection data table.

In this embodiment, the data structure similarity evaluation function is an evaluation function determined based on the data features of the two sets of inspection and detection data tables and used for evaluating the similarity of the data structures of the two sets of inspection and detection data tables.

In this embodiment, the data content similarity evaluation function is an evaluation function for evaluating the similarity of the data content of the two sets of inspection and detection data tables based on the data characteristics of the two sets of inspection and detection data tables.

In this embodiment, the personalized hash function is a hash function configured based on the data structure similarity evaluation function and the data content similarity evaluation function of the two sets of inspection and detection data tables, and is used for evaluating the comprehensive similarity of the data structure and the data content between the two sets of inspection and detection data tables.

In this embodiment, the feature data is data obtained after feature extraction of the second inspection data table based on a linear feature extraction algorithm.

The beneficial effects of the technology are as follows: the method comprises the steps of carrying out column ordering and row ordering on the inspection data table based on preset attribute weights and preset Dixiang weights, facilitating corresponding hashing of local characters of two groups of inspection data, and carrying out targeted determination of a personalized hash function capable of evaluating data structure similarity and data content similarity between the two groups of inspection data according to a personalized hash function constructed by a data structure similarity evaluation function and a data content similarity evaluation function of the two groups of inspection data table, so that local sensitive hash values capable of representing local features of the characters when evaluating the similarity between the two groups of inspection data are determined, and accuracy of similarity calculation is greatly improved.

Example 3:

based on embodiment 2, the method for calculating the similarity of mass data based on the Hash He Hai clear distance, S104: constructing a personalized hash function based on the data structure similarity evaluation function and the data content similarity evaluation function of the two groups of detection data tables, wherein the personalized hash function comprises the following steps:

In this embodiment, the linear feature extraction algorithm, such as PCA algorithm, LDA algorithm, or the like.

In this embodiment, calculating the feature complexity based on the feature data in the second inspection detection data table includes:

calculating the feature complexity of the second inspection data table based on the correlation coefficient between the feature data characters having correlation in the feature data of the second inspection data table:

Wherein, gamma is the feature complexity of the second inspection data table, i is the second inspectionThe ith characteristic data character in the characteristic data of the test data table, n is the total number of characteristic data characters contained in the characteristic data of the second inspection data table, m _i Total number of feature data characters j for which there is correlation with the ith feature data character in the feature data of the second inspection detection data table _i For the j-th character of the feature data having a correlation with the i-th character of the feature data of the second check detection data table, sigma _ij A correlation coefficient (preset) between an ith feature data character in the feature data of the second inspection detection data table and a jth feature data character having a correlation with the ith feature data character in the feature data of the second inspection detection data table;

the feature complexity of the second inspection detection data table can be accurately calculated based on the above formula.

In this embodiment, the unified dividing dimension is an average value of numerical values obtained by taking up the quotient of 1 and the feature complexity of each second inspection and detection data table.

In this embodiment, constructing the personalized hash function of the two sets of inspection data tables based on the data structure similarity evaluation function and the data content similarity evaluation function includes:

y＝y ₁ +y ₂

Wherein y is a personalized hash function of two groups of inspection and detection data tables, y ₁ For data structure similarity evaluation function, y ₂ And evaluating the function for the similarity of the data content.

The beneficial effects of the technology are as follows: the feature data with the lowest relativity in the second inspection data table can be extracted based on the linear feature extraction algorithm, the complexity of the data features of the second inspection data table can be accurately calculated based on the feature data, the unified dividing dimension is determined based on the feature complexity, the data structure similarity evaluation function and the data content similarity evaluation function of the data features of the two groups of inspection data are determined, and the purpose of determining the pertinence personalized hash function based on the data features of the two groups of inspection data is achieved.

Example 4:

based on embodiment 3, the method for calculating the similarity of mass data based on the Hash He Hai clear distance constructs a data structure similarity evaluation function and a data content similarity evaluation function based on unified partition dimensions, and comprises the following steps:

In this embodiment, the sub data is unit data in the inspection data table, for example, a value of the laboratory device a acquired at a certain time.

In this embodiment, all the sub-data included in the two sets of inspection detection data tables are arbitrarily combined based on the unified partitioning dimension, so as to obtain all the sub-data sets of each inspection detection data table in the unified partitioning dimension, for example:

all sub-data includes: a. b, c and d, wherein the unified division dimension is 2, any two sub-data in all sub-data a, b, c, d are combined arbitrarily, and all the sub-data sets after combination are as follows: (a, b), (a, c), (a, d), (b, c), (b, d), (c, d).

In this embodiment, the sub data set is a set obtained by arbitrarily combining all sub data included in the two sets of inspection and detection data tables based on the unified partition dimension.

In this embodiment, the bit number is the number of bits of the sub-data in the sub-data set, for example: a is 1 in (a, d), d is 2 in (a, d), and when no corresponding sub-data exists in the sub-data set, the corresponding bit is 0.

In this embodiment, the bit ordinal sequence of each sub-data in all sub-data sets is generated based on the bit ordinals, which is:

the bit ordinals of the corresponding sub-data in each sub-data set are ordered according to the sequence of the sub-data sets, so as to obtain a bit ordinal sequence, for example: a is the sequence of the bit numbers in (a, b), (a, c), (a, d), (b, c), (b, d), (c, d): 1. 1, 0.

In this embodiment, the calculation of the integrated cosine similarity of the two sets of inspection data based on all bit number sequences of all sub-data contained in the two sets of inspection data tables includes:

generating corresponding column vectors based on each bit sequence, and calculating the comprehensive cosine similarity of the two groups of inspection detection data based on all column vectors of all sub-data contained in the two groups of inspection bureau detection data tables:

in cos _sim For the integrated cosine similarity of two groups of inspection data, a is the larger value of the total number of sub-data contained in the two groups of inspection data tables, p is the p-th sub-data contained in the two groups of inspection data tables, q is the q-th column vector of each sub-data, b is the maximum value of the total number of column vectors of all sub-data contained in the two groups of inspection data tables, A _pq Detecting for the first group of inspection stations the q-th column vector of the p-th sub-data contained in the data table, B _pq Detecting the q-th column vector of the p-th sub-data contained in the data table for the second group of inspection stations, |A _pq II is the norm (i.e., the sum of squares of all elements in the column vector after the evolution of the value) of the q-th column vector of the p-th sub-data contained in the first group of inspection station inspection data tables, IIB _pq II is the q-th column vector of the p-th sub-data contained in the second group of inspection station inspection data tablesA norm;

the comprehensive cosine similarity of the two groups of inspection and detection data can be accurately calculated based on the formula.

When the p-th sub data does not exist in the two groups of inspection detection data, all column vectors of the p-th sub data are zero vectors, and when the q-th column vector does not exist in a certain sub data, the q-th column vector is set to 0.

In this embodiment, constructing a data structure similarity evaluation function and a data content similarity evaluation function based on the integrated cosine similarity includes:

y ₁ ＝(1-cos _sim )*x ₁

wherein y is ₁ For data structure similarity evaluation function, cos _sim For the integrated cosine similarity of two groups of inspection and detection data, x ₁ For the characteristic data integrated value, y of the characteristic data character to be input ₂ X is the similarity evaluation function of data content ₂ For the column vector corresponding to the co-occurrence frequency vector sequence of the character of the characteristic data to be input (namely, the column vector obtained by sequentially sequencing the values contained in the co-occurrence frequency vector sequence according to the column), X ₀ Is a preset standard normalization vector (the standard normalization vector is the dimension of a column vector corresponding to the co-occurrence frequency vector sequence of the character of the characteristic data to be input is consistent with the dimension of a column vector, and all values are 1), |X ₂ II is the norm of the column vector corresponding to the co-occurrence frequency vector sequence of the character of the characteristic data to be input, and II X is the norm of the column vector corresponding to the co-occurrence frequency vector sequence of the character of the characteristic data to be input ₀ II is the norm of a preset standard normalization vector;

and determining the characteristic data comprehensive value of the characteristic data character to be input and the weight of the co-occurrence frequency vector sequence based on the comprehensive cosine similarity, so that the built personalized hash function can better reflect the characteristics of similarity of two groups of detection data.

The beneficial effects of the technology are as follows: all the sub data contained in the two groups of inspection detection data tables are combined randomly based on unified dividing dimension, comprehensive cosine similarity of the two groups of inspection detection data is calculated based on a bit sequence determined by bit numbers in the obtained sub data set, a data structure similarity evaluation function and a data content similarity evaluation function are constructed based on the comprehensive cosine similarity, and the data structure similarity evaluation function and the data content similarity evaluation function are built pertinently based on the characteristics of the two groups of inspection detection data.

Example 5:

based on embodiment 3, the method for calculating the similarity of mass data based on the Hash He Hai clear distance, S105: determining a locality sensitive hash value for each character in the two sets of inspection and detection data tables based on the characteristic data and the personalized hash function for each second inspection and detection data table, comprising:

In this embodiment, the feature data block is a data block obtained after the second inspection data table is divided and summarized based on the unified dividing dimension.

In this embodiment, the second inspection data tables are divided and summarized based on the unified dividing dimension, so as to obtain a feature data block of each second inspection office inspection data table, for example: and if the same division dimension is 3, dividing the second inspection detection data table into 3 rows and 3 columns of data blocks in the order from left to right and from top to bottom.

In this embodiment, a feature data block matrix of the second inspection data table is built based on the feature data block, which is:

and taking the average value of all data contained in the data blocks as the numerical value of the position of the corresponding data block, and building a characteristic data block matrix based on the numerical value of the position of all the data blocks.

In this embodiment, the feature data block matrix is a matrix built based on all feature data blocks in the second inspection and detection data table.

In this embodiment, based on the feature data block matrix of the second inspection data table, the feature data integrated value of each feature data character is calculated, that is,: and taking the numerical value of the position of the data block in the characteristic data block matrix as the characteristic data integrated value of each characteristic data character contained in the corresponding data block.

In this embodiment, the character of the feature data is a character in the feature data of the second inspection data.

In this embodiment, the feature data integrated value is an integrated value that characterizes a data block where a corresponding character is located, based on an average value of feature data blocks in a feature data block matrix.

The beneficial effects of the technology are as follows: and dividing and summarizing the second inspection detection data table based on the determined unified dividing dimension to obtain a characteristic data block, determining the numerical value representing the data content in the characteristic data block based on the characteristic data comprehensive value of each characteristic data character determined by the characteristic data block, combining the numerical value with the co-occurrence matrix of the characteristic data corresponding to the second inspection detection data table, and determining the local sensitive hash value of each character in the corresponding inspection detection data table.

Example 6:

based on embodiment 5, the method for calculating the similarity of mass data based on the hash He Hai clear distance, based on the feature data integrated value of each feature data character in the second inspection data table, the co-occurrence matrix of the feature data corresponding to the second inspection data table, and the personalized hash function, determines a local sensitive hash value of each character in the corresponding inspection data table, including:

In this embodiment, a co-occurrence matrix of the feature data of each second inspection data table is built, which is:

determining that the size of the co-occurrence matrix is n multiplied by n based on the total number n of the characteristic data characters in the characteristic data of the second detection data table;

when the adjacent times of the characteristic data character a and the characteristic data character b in the characteristic data of the second checking detection data table are 2, setting the numerical values of the a row, the b column and the b row, the a column in the co-occurrence matrix to be 2, and further obtaining the co-occurrence matrix.

In this embodiment, the co-occurrence frequency vector between every two characteristic data characters is determined based on the co-occurrence matrix, which is:

determining the sum M of all values in the co-occurrence matrix, and determining the adjacent times x between every two characteristic data characters, wherein the ratio of the adjacent times x to the sum M of all values in the co-occurrence matrix is 2 times as the co-occurrence frequency of the corresponding two characteristic data characters

Based on the number of rows and columns of the two characteristic data characters in the co-occurrence matrix, determining two corresponding characters in the order from top to bottom and from left to rightCo-occurrence sense vectors of the character of the feature data (e.g., sense vector of the character of the feature data a and sense vector of the character of the feature data b is (a, b) sense vector of (b, a) (when a<b), then the co-occurrence frequency vector between the corresponding two characteristic data characters is

In this embodiment, all co-occurrence frequency vectors of each feature data character are summarized to obtain a co-occurrence frequency vector sequence of each feature data character, which is:

and summarizing the co-occurrence frequency vector of each characteristic data character and other characteristic data characters to obtain a co-occurrence frequency vector sequence of the hostile characteristic data characters.

In this embodiment, the co-occurrence frequency vector sequence is a sequence obtained by integrating all co-occurrence frequency vectors of the corresponding feature data characters.

In this embodiment, the storage index is an index for associating each characteristic data character in the second inspection data table with a character in the second inspection data table, and based on the storage index, a storage position of the corresponding character of each characteristic data character in the second inspection data table can be retrieved.

The beneficial effects of the technology are as follows: and determining a co-occurrence frequency vector between every two characteristic data characters representing the data structure in the second inspection data table based on the co-occurrence matrix of the characteristic data of the second inspection data table, combining the characteristic data comprehensive values of all characteristic data characters representing the data content of the characters in the second inspection data table, substituting the characteristic data comprehensive values into the determined personalized hash function, and determining the local sensitive hash value of each character in the corresponding inspection data table in a targeted manner.

Example 7:

based on the embodiment 1, the mass data similarity calculation method based on the Hash He Hai clear distance is as follows: performing bucket-splitting processing on the local sensitive hash values of all characters in each group of inspection data tables to obtain a plurality of bucket sets of each group of inspection data tables, referring to fig. 3, including:

In this embodiment, all local aggregation centers and corresponding local aggregation degrees of the local sensitive hash values of all characters of the inspection detection data table are determined, namely:

and after sequencing the locality sensitive hash values of all characters in the detection data table, determining the difference values between adjacent locality sensitive hash values, calculating the average value of all the difference values, and taking the product of a preset aggregation degree threshold (namely, a preset threshold for screening the aggregation degree of the locality sensitive hash values meeting the aggregation requirement) and the average value of all the difference values as a corresponding aggregation group screening threshold (namely, a threshold for screening the aggregation groups of the locality sensitive hash values meeting the aggregation requirement (namely, a cluster formed by a plurality of locality sensitive hash values meeting the aggregation requirement).

In this embodiment, the local aggregation center is the average value of the aggregation group of the local sensitive hash values.

In this embodiment, the local aggregation degree is a numerical value representing the aggregation degree of the aggregation group of the local sensitive hash values.

In this embodiment, the formula for determining the local aggregation degree of the local aggregation center for checking the local sensitive hash values of all the characters of the detection data table is:

wherein ρ is the local aggregation degree of the local aggregation center of the local sensitive hash values of all characters of the inspection data table, MAX is the maximum local sensitive hash value in the aggregation group where the local aggregation center is located, MIN is the minimum local sensitive hash value in the aggregation group where the local aggregation center is located, MAX is the maximum value in the local sensitive hash values of all characters of the inspection data table, MIN is the minimum value in the local sensitive hash values of all characters of the inspection data table, N is the total number of the local sensitive hash values in the aggregation group where the local aggregation center is located, and N is the total number of the local sensitive hash values of all characters of the inspection data table;

based on the above formula, the local aggregation degree of the local aggregation center of the local sensitive hash values of all characters of the detection data table can be accurately calculated from the total number of the local sensitive hash values contained in the aggregation group and the distribution range of the local sensitive hash values.

In this embodiment, the character size is the total number of all characters contained in the inspection data table.

In this embodiment, the bucket dividing range is to perform bucket dividing processing on the local sensitive hash values of all characters in each group of inspection and detection data tables, and preset the value range of the local sensitive hash values which can be accommodated in each bucket.

In this embodiment, the processing of classifying the locally sensitive hash values of all the characters in the corresponding group test data table based on the bucket classification range, and sorting the data in each bucket obtained after classifying the buckets, to obtain a plurality of bucket sets of the corresponding group test data table, includes:

when the barrel dividing ranges are [0,10 ], [10,20 ], [20,30 ], [30,40], dividing the local sensitive hash values of all characters in the corresponding group detection data table into barrels corresponding to the corresponding range, and sorting all local sensitive hash values contained in each barrel obtained after barrel dividing processing based on the order from small to large to obtain a plurality of barrel sets.

The beneficial effects of the technology are as follows: the method comprises the steps of analyzing the numerical distribution condition (i.e. aggregation condition) of the local sensitive hash values of all characters in the inspection and detection data table, determining a bucket-A division range based on analyzed data, and carrying out bucket division processing on the local sensitive hash values of all characters in the inspection and detection data table of a corresponding group based on the bucket division range, so that the number of the local sensitive hash values contained in each bucket in a bucket division result is uniform, and the accuracy of the similarity of two groups of inspection and detection data calculated based on the Hamming distance can be further improved.

Example 8:

based on embodiment 7, the method for calculating the similarity of mass data based on the hash He Hai clear distance determines all bucket division ranges based on the character scale of the inspection and detection data table, the maximum local sensitive hash value, all local aggregation centers and corresponding local aggregation degrees, and includes:

In this embodiment, the preset bucket capacity is the total number of locally sensitive hash values that can be contained in each preset bucket, and the pre-constraint on the bucket capacity can ensure uniformity of data distribution in the subsequent bucket, so as to ensure accuracy of similarity of two groups of detection data calculated based on Hamming distance.

In this embodiment, the minimum bucket number is a value obtained by rounding up the ratio of the character size of the inspection data table to the preset bucket capacity.

In this embodiment, the average aggregation level is the average value of all local aggregation levels of the inspection data table.

In this embodiment, based on the final number of buckets and the maximum locality sensitive hash value of the inspection data table, all bucket partition ranges are determined, namely:

taking the numerical value obtained by upwardly rounding the ratio of the maximum local sensitive hash value of the detection data table to the final barrel number as a barrel division interval delta x;

all bucket partition ranges are determined based on the bucket partition interval Δx, for example:

[0, Δx), [ Δx,2 Δx), [2 Δx,3 Δx), …, [3 Δx, kΔx), where kΔx is equal to or greater than MAX, which is the maximum value in the locally sensitive hash value that verifies all characters of the detection data table.

The beneficial effects of the technology are as follows: based on all local aggregation centers and corresponding local aggregation degrees of the local sensitive hash values of the detection data table, the final barrel number is determined by combining the minimum barrel number determined based on the preset barrel capacity and the character scale of the detection data table, the barrel dividing range is further determined, and the uniformity of the number of the local sensitive hash values contained in each barrel in the barrel dividing result and the accuracy of the similarity of two groups of detection data calculated based on the Hamming distance are further ensured.

Example 9:

based on embodiment 8, the method for calculating the similarity of mass data based on the Hash He Hai bright distance determines the number of final barrels based on the average aggregation degree, the local aggregation degree and the minimum barrel number, and includes:

In this embodiment, the first number of barrels is the sum of the total number of local aggregation levels and the total number of local aggregation centers, which are not less than 2 times the average aggregation level.

In this embodiment, the final number of buckets is the total number of buckets required for finally determining the bucket-splitting process for the locality sensitive hash values of all the characters in the inspection data.

The beneficial effects of the technology are as follows: based on all local aggregation centers and corresponding local aggregation degrees of the local sensitive hash values of the detection data table, and combining the minimum barrel number determined based on the preset barrel capacity and the character scale of the detection data table, the final barrel number is determined, and the accuracy of the similarity of two groups of detection data calculated based on the Hamming distance is ensured, wherein the number of the local sensitive hash values contained in each barrel in the barrel dividing result is uniform.

Example 10:

based on embodiment 1, the method for calculating the similarity of mass data based on the Hash He Hai clear distance comprises the following steps: determining the similarity of the two sets of inspection and detection data tables based on the hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets, including:

In this embodiment, the first bucket sequence is a sequence obtained by sorting all the first high-dimensional bucket sets.

In this embodiment, the second bucket sequence is a sequence obtained by sorting all the second high-dimensional bucket sets.

In this embodiment, based on the hamming distance between the first high-dimensional bucket set and the second high-dimensional bucket set of all the same ordinal numbers in the first bucket sequence and the second bucket sequence, the similarity of the two sets of inspection and detection data tables is determined, which is:

taking the ratio of the Hamming distance between the first high-dimensional barrel set and the second high-dimensional barrel set of all the same ordinal numbers in the first barrel sequence and the second barrel sequence and half of the total number of all the characters in the first high-dimensional barrel set and the second high-dimensional barrel set corresponding to the same ordinal numbers as the deviation degree of the first high-dimensional barrel set and the second high-dimensional barrel set corresponding to the same ordinal numbers;

Taking the difference value of the deviation degree of the first high-dimensional barrel set and the second high-dimensional barrel set with the same ordinal number as the sub-similarity between the first high-dimensional barrel set and the second high-dimensional barrel set with the same ordinal number;

and taking the average value of sub-similarity between the first high-dimensional bucket set and the second high-dimensional bucket set of all the same ordinal numbers in the first bucket sequence and the second bucket sequence as the similarity of two groups of detection data tables.

The beneficial effects of the technology are as follows: the similarity of the two groups of inspection and detection data tables is determined based on the Hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets, so that the data comparison quantity is greatly reduced, the calculation rate is improved, and the calculation accuracy is also ensured.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A mass data similarity calculation method based on Hash He Hai clear distance is characterized by comprising the following steps:

s4: determining the similarity of two groups of inspection and detection data tables based on the Hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets;

step S1: determining a locality sensitive hash value for each character in the two sets of inspection and detection data tables, respectively, comprising:

S105: determining a local sensitive hash value of each character in the two groups of detection data tables based on the characteristic data and the personalized hash function of each second detection data table;

step S2: performing bucket-splitting processing on the local sensitive hash values of all characters in each group of inspection data tables to obtain a plurality of bucket sets of each group of inspection data tables, wherein the bucket sets comprise:

2. The method for calculating the similarity of mass data based on the hash He Hai clear distance according to claim 1, wherein S104: constructing a personalized hash function based on the data structure similarity evaluation function and the data content similarity evaluation function of the two groups of detection data tables, wherein the personalized hash function comprises the following steps:

3. The method for computing the similarity of mass data based on the Hash He Hai clear distance according to claim 2, wherein the step of constructing the data structure similarity evaluation function and the data content similarity evaluation function based on the unified partition dimension comprises the following steps:

4. The method for calculating the similarity of mass data based on the hash He Hai clear distance according to claim 2, wherein S105: determining a locality sensitive hash value for each character in the two sets of inspection and detection data tables based on the characteristic data and the personalized hash function for each second inspection and detection data table, comprising:

5. The method for computing the similarity of mass data based on the hash He Hai of claim 4, wherein determining the locality sensitive hash value for each character in the second inspection data table based on the feature data integrated value for each feature data character in the second inspection data table and the co-occurrence matrix and the personalized hash function for the feature data corresponding to the second inspection data table comprises:

6. The method for computing the similarity of mass data based on the hash He Hai clear distance according to claim 1, wherein determining all bucket division ranges based on the character size of the inspection detection data table, the maximum local sensitive hash value, all local aggregation centers and the corresponding local aggregation degrees comprises:

7. The method for computing the similarity of mass data based on the Hash He Hai bright distance according to claim 6, wherein determining the final number of buckets based on the average aggregation level and the local aggregation level and the minimum number of buckets comprises:

8. The method for calculating the similarity of mass data based on the Hash He Hai clear distance according to claim 1, wherein S4: determining the similarity of the two sets of inspection and detection data tables based on the hamming distances of all the first high-dimensional bucket sets and all the second high-dimensional bucket sets, including: