CN109739999A

CN109739999A - A kind of efficient drosophila neural network Hash Search WMSN data method

Info

Publication number: CN109739999A
Application number: CN201910039907.1A
Authority: CN
Inventors: 肖如良; 黄劲; 邹利琼; 杜欣; 倪友聪; 蔡声镇
Original assignee: Fujian Normal University
Current assignee: Fujian Normal University
Priority date: 2019-01-16
Filing date: 2019-01-16
Publication date: 2019-05-10

Abstract

The present invention discloses a kind of efficient drosophila neural network Hash Search WMSN data method, first, data set is pre-processed, the feature of data set is set to be converted into numeric data, secondly utilize Fast Johnson-Lindenstrauss Transform (FJLT) projection matrix by data projection to more high-dimensional metric space, to ensure the content similarity between original data object, be conducive to accurately scan for；Finally, reducing the dimension of data set using the feature selecting strategy of winner-take-all in the local sensitivity hash method of drosophila olfactory nerves, search efficiency is improved.When the present invention is using drosophila olfactory nerves simulation local sensitivity Hash procedure, it is answered with preferably pervasive, the accuracy for improving search result efficiently solves the problems, such as the approximate NN Query of higher-dimension big data, is efficiently applied to based on WMSN data search affairs in WMSN application system.

Description

A kind of efficient drosophila neural network Hash Search WMSN data method

Technical field

The present invention relates to wireless multimedia sensor more particularly to a kind of efficient drosophila neural network Hash Search WMSN Data method.

Background technique

Wireless multimedia sensor (Wireless Multimedia Sensor Networks, WMSN) is wirelessly to pass The novel wireless sensor network with multimedia messages such as video, audio, images to grow up on sensor network (WSN). Up to the present, WMSN is very extensive is applied to security monitoring, intelligent transportation, environmental monitoring etc., WMSN multimedia data query It is an important core technology in the research and development of WMSN application system.WMSN multi-medium data usually have it is high-dimensional, extensive, The characteristics of polymorphic type, conventional neighbor search algorithm are not able to satisfy the demand of system, and in recent years, research WMSN data are effectively searched Suo Fangfa is to promote search performance, it has also become the hot issue that industrial circle is paid close attention to jointly with academia.

In general, people need to carry out dimension-reduction treatment to WMSN multi-medium data, generally there are two class dimension reduction methods: first is that feature Selection, second is that feature extraction.Feature selecting is according to the correlation of feature and label during supervised learning, and selection most can generation The feature of table data point indicates data.This method can effectively handle the problem of classification, but simultaneously as need to compare The size of contribution of each feature of data object in assorting process in data set selects the feature most influenced relatively, should Method time complexity usually with higher, and need data set with label, therefore this method is in proximity search problem Upper application it is fewer.And feature extraction [1] is different with feature selection approach, maximum difference is that feature extraction passes through projection It is converted on the feature of data object, creates a completely new feature set to indicate data, which does not need label Data, and the original dimension of data set can be suitably reduced in projection process, therefore, this method is widely used in neighbour Search problem also commonly uses this method in the neighbor search problem of multi-medium data.Most common feature extracting method have it is main at Divide analytic approach (PCA), Linear Discriminant Analysis (LDA), Locally linear embedding (LLE), the methods of Laplacian Eigenmaps (LE).

Currently, there are many research based on extensive high dimensional data searching method, and achieve it is significant it is significant into Exhibition.2017 University of California Nian8Yue Sanjoy Dasgupta, CharlesF.Stevens and SaketNavlakha exist " Science " proposes a kind of novel local sensitivity Hash FLSH (Fly Local Sensitive based on accidental projection Hash, FLSH) method [2], with the Hash procedure of the olfactory nerves simulated data sets of drosophila, the method that they are proposed is to perception The processing method of the local sensitivity Hash LSH (Local Sensitive Hash, LSH) of the perception sum of neuron provides Very effective integration program, this method breach the yoke that scholars study LSH for a long time, the research to related fields Produce great influence.Before the proposition of FLSH method, people are accustomed to being studied in the general field of LSH to generate novelty Searching method.

In general, extensive high dimensional data nearest neighbor search method can generate a large amount of time and space expense, need Dimension-reduction treatment achievees the purpose that reduce query time complexity.The Johnson of Texas ,Usa A&M university in 1984 and Lindenstrauss proposes famous J-L theorem [3], provides the support that theorizes for the correlation technique of high dimensional data dimensionality reduction. Piotr Indyk and the Rajeev Motwani of Stanford University in 1998 propose local sensitivity on the basis of J-L theorem Hash algorithm (Locally sensitive hash, LSH) [4], gives and provides for extensive high dimensional data nearest neighbor search Landmark solution.Since LSH method proposes, because its superior function in high dimensional data neighbor search field obtains The extensive concern for having arrived academia and industry, by it is vicennial development and it is perfect, LSH method neighbor search field, Many successful examples are produced in the related research of many application fields such as pattern-recognition, abnormality detection.But LSH also has Significant limitation, firstly, inevitably loss section divides similitude when the feature to data set carries out Hash processing, Secondly, Hash procedure is also required to part-time consumption.How under the premise of guaranteeing time complexity, reduce as far as possible similar The loss of property, is always the target that scholars study.Finally, how to construct between biology perception nerve and data set Hash procedure Bionical relationship, equally the development of local sensitive hash algorithm is had a major impact.

The FLSH method that the scholars such as Sanjoy in 2017 are proposed on Science periodical is searched in extensive high dimensional data Although the advantages of there are many in terms of rope, the local sensitivity Hash Search based on accidental projection that FLSH method tundish contains Process has certain problems.It is mainly manifested in following three aspects:

Problem 1 (the choice problem of search result accuracy and time efficiency): during high dimensional data neighbor search, I Be frequently necessary to tradeoff search result accuracy rate and the precision of search time on weighed, although the office based on accidental projection Portion's sensitive hash strategy reduces the time complexity of neighbor search, but is lost part precision simultaneously；

Problem 2 (is hidden between the processing and nerve perception of data set and is contacted): traditional based on the close of local sensitivity Hash It is to handle in search target (data set) data object like neighbor search, initiates the main body identification data of query actions Often studied personnel ignore for the activity of perception neuron when object.

Problem 3 (Problem of Universality of the J-L theorem in High Dimensional Data Set dimensionality reduction): J-L theorem is reducing data dimension Degree field has obtained extensive reference, but about FJLT logm according to the new feature for collecting progress space translated data collection The regularity of distribution, there are also to be studied at present for influence of the new feature to data point.

The neighbor search problem of high dimensional data is always a research hotspot of the field of data mining.Early in 1984, beauty Scholar the William B.Johnson and Joram.Lindenstrauss of Texas A&M university, state have proposed J-L theorem [3], theory support is provided for the dimensionality reduction conversion of high dimensional data.But they not provide High Dimensional Data Set effective Neighbor search method.Until 1998, the scholar Piotr Indyk of Stanford Univ USA et al. on the basis of J-L theorem, The local sensitivity hash method [4] based on Hamming space is proposed, provides effective solution for the neighbor search of High Dimensional Data Set Certainly method, so-called " dimension curse " puzzlement problem have obtained effective solution scheme.Since then, local sensitivity hash method starts The neighbor search field of High Dimensional Data Set is shown up prominently, and the extensive concern of academia and industry is obtained.But first generation office Portion's sensitive hash is mainly manifested in there is also some defects: when handling data, needing for data to be mapped to Hamming sky one by one Between, greatly reduce the efficiency of our neighbor search；In addition, first generation local sensitivity hash method there is also space hold compared with Big problem.

Scholar MayurDatar, the Nicole Immorlica and Piotr Indyk et al. of the Massachusetts Institute of Technology in 2004 It is proposed the local sensitivity hash method [5] stable based on p, this method avoid the process that data set is mapped to Hamming space, Reduce the space complexity of first generation local sensitivity Hash strategy simultaneously, and on time complexity and inquiry precision and with Past algorithm, which is compared, to be improved.Since then, this method is in image recognition, and paper duplicate checking, the fields such as abnormality detection have obtained extensively General application.But this method the problems such as there is still a need for a large amount of memory spaces, the index of large-scale data is not suitable for, secondly, base The result and unstable of index coding is generated in probabilistic model.Although number of encoding bits increase, the raising ten of accuracy rate is inquired Divide slow.

The scholar AlexandrAndoni and PiotrIndyk of Massachusetts Institute Technology in 2006 propose a kind of new Local sensitivity hash method [6], their method combine accidental projection and Leech Lattice, so that it is multiple to reduce the time Miscellaneous degree and space complexity.Reduce data set and project the time it takes, improves the execution efficiency of algorithm.But the party The search accuracy rate of method is largely influenced by lattice selection and accidental projection, therefore accuracy rate is difficult to be protected Card.Scholar NirAilon and the Bernard Chazelle [7] of Princeton university in 2006 are proposed based on J-L theorem FJLT move on to image method, and carried out theoretical proof.This method combines sparse matrix and the Heisenberg of Fourier transformation is former Reason, improves the speed of the projection of data set, meanwhile, the effective similitude guaranteed in data set between data object makes to count It being capable of similarity as far as possible before retaining projection in data set between data object according to the data set after collection projection.But NirAilon et al. only theoretically demonstrates the superiority of this method, and there is no further prove that this method exists by experiment The superiority of actual application.In this stage, the research for local sensitivity hash algorithm is had been greatly developed, very More scholars, which are absorbed in, optimizes processing to data, such as the scholar QinLv and Wiliam of Princeton University in 2007 Josephson et al. proposes the local sensitivity hash method [8] that one kind detects more, and this method is by being equipped with part number to multiple It is detected simultaneously according to the bucket of collection, further improves the accuracy rate of neighbor search；California, USA university Berkeley point in 2009 The scholar BrianKulis in school and the scholar KristenGrauman of university texas at austin are proposed based on kernel function Core Hash (Kernelized Local Sensitive Hashing, KLSH) algorithm [9], the algorithm is with the random of input vector Main support [10] are projected as, the sublinear approximation neighbor search time is realized, reduces the time cost of search process；2012 Year, scholar VenuSatuluri and the Srinivasan Parthasarathy of Ohio State University were proposed based on Bayes Quick similarity query method BayesLSH [11], this method is easy to tuning, and does not need manual setting Hash coding Length.

Research for local sensitivity hash algorithm in nearly 2 years has many prioritization schemes, such as Colombia in 2016 The scholar of scholar AlexandrAndoni, IBM graduate the researcher Thijs Laarhoven and the Massachusetts Institute of Technology of university Ilya Razenshteyn et al. has weighed time complexity and space complexity [12] in approximate neighbor search, it is intended to find out most Excellent solution；The scholar Ilya Razenshteyn of the Massachusetts Institute of Technology in 2017 and the scholar of Columbia University The LSH-FOREST algorithm [13] that AlexandrAndoni is proposed changes the structure of traditional local sensitivity Hash；The U.S. in 2017 The printenv that the scholar TeresaNicoleBrooks and RaniaAlmajalid of Pei Si university is proposed on the basis of [8] it is more Face ball-type local sensitivity Hash hashes the time complexity [14] for optimizing neighbor search；In 2018 University of Illinois, the U.S. Person KarthekeyanChandrasekaran et al. proposes the local sensitivity hash algorithm [15] based on grid, by draw The process of divided data collection applies control condition to improve the inquiry precision of neighbor search.This paper seminar Fujian China in 2018 Normal university Xiao for example good equal [16] is directed to approximate neighbour's search problem of large-scale data, establishes two end number mixing mould using LSH The high-efficiency search method of state data.Undeniably, the optimization proposed in this serial of methods, including document [17,18,19,20] Method, further improves the performance of local sensitivity Hash, also extends the ability of local sensitivity Hash.But in this hair During exhibition, few scholars consider bionical class Intelligentized method and part sensitive hash process to combine, with into one Step is expanded.

Scholar the Sanjoy Dasgupta, CharlesF.Stevens of 2017 Universities of California Nian8Yue and A kind of novel local sensitivity hash method [2], the party are proposed in the article that SaketNavlakha is delivered on Science Method for the first time gets up the perception nerve connections of the process of local sensitivity Hash and cognitive subject, is simulated by the olfactory nerves of drosophila The process of local sensitivity Hash, and using random sparse binary system projection matrix, by accidental projection to data set Intrinsic dimensionality amplifies, the activation of imictron, this point is very different with previous local sensitivity hash method, finally After reducing projection by the neuron inhibition tactful (Anterior Paired Lateral, APL) of winner-take-all (WTA) The intrinsic dimensionality of data, their experimental result show that their used random sparse binary system projection matrixes are opposite In the experiment effect that the intensive projection matrix of Gauss has had, show that the bionical Hash of drosophila smell can be calculated with local sensitive hash Method is effectively combined.The maximum contribution of their local sensitivity Hash strategy is to change traditional local sensitivity Hash pair The mode of data processing, it is established that the perception nerve of cognitive subject and approximate neighbor search connection construct neuronal activation Relationship between mode and data projection, to provide a new research direction for approximate neighbor search.

Local sensitivity hash method (Fly Local Sensitive Hashing, FLSH) based on drosophila olfactory nerves There are still some problems.We have found that its sparse binary system random matrix used is not suitable for the data projection [7] of low distortion, though The speed of data projection so is improved, but sacrifices the part similitude between data object simultaneously as cost, it is thus impossible to Meet requirement of the WMSN application system to inquiry precision；The stability that we also identify FLSH algorithm under study for action needs into one Step is promoted, and the inquiry precision of the algorithm on different data sets differs larger, the enquiry machine not being suitable in WMSN application system System；Finally, the feasibility of the algorithm lacks strong theory support, although sparse binary matrix can preferably simulate mind Activation pattern through member, but lack corresponding theory support in the treatment process of data set.

Summary of the invention

The purpose of the present invention is to provide a kind of efficient drosophila neural network Hash Search WMSN data methods.

The technical solution adopted by the present invention is that:

A kind of efficient drosophila neural network Hash Search WMSN data method comprising following steps:

Step 1, feature extraction is carried out to multi-medium data, multi-medium data is converted into feature vector number by characterizing According to；

Step 2, building search indexes after being quickly converted matrix projection using FJLT to the characteristic vector data on data set；

Step 3, query object is mapped in search index structure；To the given data object for needing to inquire using FJLT The index for inquiring data is formed after being quickly converted matrix projection；

Step 4, the index based on inquiry data carries out approximate neighbor search in search index, searches and inquiry point data Most like data object.

Further, text data is converted under theorem in Euclid space using TF-IDF method or word frequency method in step 1 Feature vector；Characterization is carried out to image data by extracting SIFT feature value.

Further, step 2 specifically includes the following steps:

Step 2.1, matrix is quickly converted by FJLT and each data object in data set is projected to new measurement sky Between,

Step 2.2, neuron winner-take-all is enlivened in conjunction in drosophila neural network local sensitivity Hash FLSH algorithm (WTA) strategy accepts or rejects the new feature of data object each after projection respectively, using the feature retained as the data pair The index of elephant.

Further, the calculation formula for being quickly converted matrix projection using FJLT of step 2 is as follows:

Y=P × H × D × X

Wherein, Y is data after projection, and X is the characteristic vector data in data set,

P matrix is one using J-L theorem as the sparse matrix of theory support, and it is 0 that each element in P matrix, which meets mean value, And variance is q^-1Be just distributed very much, i.e., each element in P matrix meets P_ij- N (0, q^-1) probability be q, be not inconsistent in matrix Close P_ij- N (0, q^-1) distribution element value be 0, wherein the calculating process of q is as follows:

Wherein, d is the original dimension of data set, and n is the size of data set, and ε is to protect away from performance parameter, and p is used Normal form；

H is a hadamard matrix related with original dimension d, each element H therein_ijValue can be calculated by following formula Go out,

Wherein, d is the original dimension (meanings of parameters in supplementary type) of data set.

D is a diagonal matrix, and for the value on diagonal line between -1 and 1, -1 and 1 respectively has 1/2 probability, dimension d D indicate are as follows:

Wherein, d is the original dimension of data set.

The invention adopts the above technical scheme, replaces original FLSH local sensitivity using the PHD transition matrix in FJLT The simple sparse matrix of hash method；" winner-take-all " strategy of drosophila smell and the FJLT based on J-L theorem are converted and carried out It is effectively combined, has obtained more accurate experimental result on the basis of not increasing time loss；FJLT is quickly converted matrix It is optimized on time complexity.The present invention changes the prior art and only knows in the similitude that the object of identification is handled Not, it is realized using the similitude identification of drosophila olfactory nerves from identification object to the transformation of cognitive subject, has found similitude Retrieval is potentially contacted with cognitive subject, on the basis of controlling time complexity, further promotes high dimensional data neighbor search Accuracy.

Detailed description of the invention

The present invention is described in further details below in conjunction with the drawings and specific embodiments；

Fig. 1 is FJLT-FLSH method overall framework of the present invention；

Fig. 2 is the figure compared with flshon propagation size keeps at a distance performance of the fjlt-flsh based on data set mnist；

Fig. 3 is the figure compared with flshon propagation size keeps at a distance performance of the fjlt-flsh based on data set sift；

Green line and blue line indicate that data set features are expanded to M and tieed up by FJLT-FLSH algorithm and FLSH algorithm, and in all spies Sign is lower to keep inquiry precision.

Fig. 4 is that the projection time of FJLT matrix and sparse binary matrix based on data set mnist compares figure；

Fig. 5 is that the projection time of FJLT matrix and sparse binary matrix based on data set sift compares figure；

Fig. 6 is data set sift FJLT-FLSH (WTA), FLSH (WTA), FJLT- in Hash code length M=100 FLSH (LTA), FLSH (LTA) algorithm inquiry precision compare figure；

Fig. 7 is data set sift FJLT-FLSH (WTA), FLSH (WTA), FJLT- in Hash code length M=200 FLSH (LTA), FLSH (LTA) algorithm inquiry precision compare figure；

Fig. 8 is data set sift FJLT-FLSH (WTA), FLSH (WTA), FJLT- in Hash code length M=300 FLSH (LTA), FLSH (LTA) algorithm inquiry precision compare figure；

Fig. 9 is data set sift FJLT-FLSH (WTA), FLSH (WTA), FJLT- in Hash code length M=600 FLSH (LTA), FLSH (LTA) algorithm inquiry precision compare figure；

Figure 10 is data set sift FJLT-FLSH (WTA), FLSH (WTA), FJLT- in Hash code length M=1000 FLSH (LTA), FLSH (LTA) algorithm inquiry precision compare figure；

Figure 11 is that data set sift fjlt projection matrix in Hash code length M=100 uses wta, lta, random spy Sign keeps the inquiry precision of strategy to compare figure；

Figure 12 is that data set sift fjlt projection matrix in Hash code length M=200 uses wta, lta, random spy Sign keeps the inquiry precision of strategy to compare figure；

Figure 13 is that data set sift fjlt projection matrix in Hash code length M=300 uses wta, lta, random spy Sign keeps the inquiry precision of strategy to compare figure；

Figure 14 is that data set sift fjlt projection matrix in Hash code length M=600 uses wta, lta, random spy Sign keeps the inquiry precision of strategy to compare figure；

Figure 15 is that data set sift fjlt projection matrix in Hash code length M=1000 uses wta, lta, random spy Sign keeps the inquiry precision of strategy to compare figure.

Specific embodiment

As shown in one of Fig. 1-15, the invention discloses a kind of data sides efficient drosophila neural network Hash Search WMSN Method.In WMSN Internet of Things application system, the block linkwork based on current hotspot block chain technology and IPFS technology building WMSN System [25,26,27,28] is a very promising technical solution, can be to very sensitive crucial number in opening and shares According to being protected.The program is typically based on ether mill and IPFS to construct, and WMSN data are after storing to IPFS distributed frame Cochain processing is carried out, inquiry operation is carried out to the data after cochain on IPFS by intelligent contract.Thus, WMSN multimedia number According to search inquiry by be WMSN block catenary system research and development in a basic design.Because multi-medium data usually have compared with High characteristic dimension, therefore, conventional neighbor search algorithm are not able to satisfy the demand of system.We combine [29,30,31,32] What is proposed solves the search inquiry scheme in the application of block chain, proposes a kind of new WMSN for being known as FJLT-FLSH method High dimensional data searching method, based on this, we can construct intelligent contract in subsequent work, use for WMSN block chain Family provides query search application.

The FJLT-FLSH method incorporates the main thought of drosophila neural network Yu FJLT matrixing, is a kind of new The search process key step of the method for local sensitivity Hash Search WMSN data, high dimensional data is constituted, as shown in Figure 1.

(1) carry out feature extraction to multi-medium data: this step is data preprocessing phase, in the original more of input WMSN After categorical data, multi-medium data is converted into feature vector by characterizing；In this process, text data is carried out special Sign is extracted, and generallys use TF-IDF method or word frequency method, text data is converted to the feature vector under theorem in Euclid space；It is right When image data carries out feature extraction, characterization is carried out by extracting SIFT feature value.

(2) to the characteristic building search index on data set: after the completion of feature extraction, the characteristic dimension of data object Excessively high, a large amount of memory can be occupied and take a substantial amount of time by directly carrying out force search, it is therefore desirable to be thrown data set Shadow, and Hash Index Structure is constructed on data set in the projected, so as to quickly execute function of search.It is walked by two Suddenly the operation: the first step is completed, by projection matrix FJLT, it is empty that each data object in data set is projected into new measurement Between, second step will enliven neuron winner-take-all (WTA) strategy knot in drosophila neural network local sensitivity Hash FLSH algorithm It is incorporated, the new feature of data object each after projection is accepted or rejected respectively, using the feature retained as the data object Index.

(3) query object is mapped in search index structure: inquiry phase, the given data object for needing to inquire, warp After crossing numerical characteristicsization processing, equally query object is projected using FJLT matrix, using the feature retention strategy of WTA, The new feature of inquiry data to be accepted or rejected, the feature of reservation is to inquire the index of data,

(4) in new metric space, approximate neighbor search is carried out by the index of generation, is found and inquiry point data most phase As data object.

Next, we are unfolded stage by stage for proposed FJLT-FLSH algorithm frame.

Accidental projection based on FJLT matrix: FJLT transition matrix appears in Princeton University scholar NirAilon earliest It is published in the article on STOC [7] with Bernard Chazelle2006, the major function of the algorithm is that Fourier is become The projection of heisenberg principle and sparse matrix in changing combines, and reduces the distortion level during data projection, adds simultaneously The projection process of fast data set." low distortion " and " fast projection " of FJLT transition matrix is theoretically proven [7], proof [34] have also been obtained in superiority in use.The input of this method is image or text data warp Data after pre-processing are crossed, the feature of the data is numerical value, and output is one group of new numeric data feature, these new numerical value numbers According to representing original data set in the mapping of FJLT projector space.When selecting the projection matrix of local sensitivity Hash (LSH), square is projected The guarantor of battle array is away from property, and the speed of projection is to determine whether feasible two deciding factors of local sensitivity hash method.FJLT The property of matrix can satisfy local sensitivity Hash and be protected in projection process away from the requirement with low time complexity, use FJLT square Battle array improves FLSH method, can promote the search efficiency of inquiry mechanism in WMSN.

The FJLT-FLSH local sensitivity hash method of drosophila olfactory nerves and FJLT fusion: proposed by the present invention to be based on The drosophila algorithm of FJLT is pre-processed by data set, and the projection transform and data characteristics of data set choose three parts composition.In number According to collection pretreatment stage, groundwork is by image data set, and lteral data collection and other nonnumeric type data sets are converted into Digital data collection.In the second step stage, data set is projected, Sanjoy Dasgupta et al. is used in document [1] Random, sparse binary matrix projects raw data set, and sparse binary matrix improves data set projection Speed, but part precision is sacrificed simultaneously, we are quickly converted the sparse binary system that matrix replaces them using FJLT herein Logm is projected according to collection.

The hash algorithm of nerve based on drosophila smell has higher requirement to projection matrix, firstly, because data set The phenomenon that will appear " distortion " in projection process, therefore, transition matrix, need the phase after protection data projection to greatest extent Adjust the distance, secondly as need to amplify the intrinsic dimensionality of data object in experimentation, therefore, transition matrix Time complexity needs in projection process meet the needs of experiment.FJLT transition matrix is in " protect away from " performance and time complexity On have excellent performance [7,34].In conclusion the FLSH algorithm and FJLT of present invention combination drosophila olfactory nerves are quickly thrown Shadow algorithm proposes a kind of new local sensitivity hash method FJLT-FLSH.

During using FJLT transition matrix, the parameter required special attention has ε and p, and the former projected The fluctuation range of journey relative distance can derive by formula (8), data of the data set after FJLT is projected, after conversion The fluctuation range for collecting relative distance is [(1- ε) α_p, (1+ ε) α_p], whereinα₂=k, k indicate that data object is thrown The dimension of movie queen, according to J-L theorem, it is only related with the size of data set and ε, unrelated [3] with the former dimension d of data set；P table Demonstration formula, usual value range are { 1,2 }.

FJLT-FLSH algorithm is embedded in FJLT transformation matrix, which meets two following properties:

It (1), include two below set X, ε < 1and p ∈ { 1,2 } Φ~FJLT. of the Rd of n point for any one A event at least 2/3 can occur.

1, for any x ∈ X, (1- ε) α_p||x||₂≤||Φx||_p≤(1+ε)α_p||x||₂ (12)

In formulaThe formula is the deformable body of J-L theorem

2, mapping φ: Rd → R^kTime complexity are as follows:

O(dlogd+min{dε^-2Logn, ε^p-4log^p+1n}) (13)

The above property is similar to scholar NirAilon and the Bernard Chazelle of Princeton university in 2006 [7] it proposes the FJLT based on J-L theorem and moves on to image method.

(2), in FJLT-FLSH algorithm projection matrix time response:

Projection operator in FJLT-FLSH algorithm determines by FJLT, matrix D_d×dTime for diagonal matrix, D (x) is complicated Degree is O (d), and matrix H is Walsh-Hadamard matrix, and the time complexity of H (Dx) is O (dlogd), finally, P (HDx) Time complexity is O (| P |), | P | in representing matrix P for 0 element number, | P | obey B (nk, q) distribution, can push away Export

E [| P |]=O (ε^p-4log^p+1n) (14)

Due in P matrixTherefore, it can be deduced that the time complexity of matrix P is min{dε^-2Logn, ε^p-4log^p+1N }, so the time complexity of FJLT transition matrix is O (dlogd+min { d ε^-2Logn, ε^p- ⁴log^p+1n}).Markov boundary, which demonstrates FJLT lemma, ensure that the projection time of FJLT

The above property is similar to the scholar AlexandrAndoni and PiotrIndyk of Massachusetts Institute Technology in 2006 A kind of new local sensitivity hash method [6] is proposed, their method combines accidental projection and Leech Lattice, from And reduce time complexity and space complexity.The property is theoretically illustrated in use FJLT matrix to high dimensional data pair When as carrying out dimensionality reduction operation, original dimension d, the size n of data set are protected away between performance parameter ε and used normal form p Relationship.It is of great significance for the parameter selection during dimension transformation of the present invention.

The proof of time response property:

(Without loss of generality) without loss of generality, if ε be one close enough 0 and be greater than 0 number, Define variable h=HD (x)=(h₁..., h_d)^T, it is assumed that W.1.o.g set up, then | | x | |₂=1, by H, the generation of D matrix Formula (10), (11) can be deriveda_i=± d^-1/2, pass through H, the property of D matrix, it can be deduced that knot By: a_iIt is independent and uniform distribution.

Chernoff-type shows that the probability of formula (15) at least 19/20 is set up；

Obviously, by general mathematical proof, formula (16) is centainly set up；

Therefore, it by markov injustice scheduling theory, can derive；

Formula (14) defers to joint circle rule (The union bound:P (A₁∪A₂∪...∪A_k)≤P(A₁)+P(A₂) +…+P(A_k)), thenFor arbitrary n and d, nd < n², it was therefore concluded that:

Since matrix H and matrix D are all isometric matrixes, so h=HD (x) is equidistantly to change, i.e., | | h | |₂=| | x | |₂, If y=(y₁, y₂..., y_k)^T=Ph=Φ_x, available according to the definition of three matrixes in FJLTIts In, c_jThe probability that value is 1 is q, and obeys random independent same distribution.Another parameter r_jObey N (0, q^-1), if By 2 it is stable be just distributed very much, obtain (y₁| Z=z)~N (0, q^-1z).Due to y₁, y₂..., y_kIt is to obey independently With distribution, therefore obtain Z=Z₁, Z₂..., Z_kSame obey is independently distributed, i.e. E [Z]=q.Princeton in 2006 Scholar NirAilon and the Bernard Chazelle [7] of university, which propose the FJLT based on J-L theorem and moves on to image method, to be provided FJLT lemma is in l₂Proof under normal form.

The present invention uses two datasets, and first data set is that handwritten word identifies Mnist data set, second data set For image data set Sift, wherein the intrinsic dimensionality of Mnist data set is the intrinsic dimensionality dimension 128 of 784, Sift data set.? In experimentation, the subset that the size for choosing each data set is 10000 is tested.

Our experimentation is divided into three groups:

Experiment one: respectively expanding to two datasets M dimension (M=100,200,300,600,1000) first, extends M dimension afterwards all retains, directly progress neighbor search, is respectively compared FJLT and sparse binary matrix in different data collection Guarantor when extension dimension is away from performance；

Experiment two: after data set extension is to different dimensions M, using different projection matrix FJLT and sparse binary system Projection matrix, uses two kinds of feature retention strategies: winner-take-all (WTA), and weak person covers all (LTA).Retain identical Hash coding length The intrinsic dimensionality of k (k=cM, c=0.01,0.02,0.03,0.05,0.1,0.20.3,0.5) is spent, respectively to FJLT-WTA, FLY-WTA, the inquiry precision of FJLT-LTA, FLY-LTA method compare, and judge that two kinds of algorithms are excellent in inquiry precision It is bad；

Experiment three: the projection property in order to further analyze FJLT projection matrix, we choose different features and protect Stay strategy: WTA, LTA, RANDOM analyze the feature of the data set by the conversion of FJLT projection matrix, to find its spy Sign rule.

During the experiment, the main judgment criteria for using searching accuracy and time (second) to test as the present invention, it is first First, set the scale N of data set, initial data concentrate find out and query point in Euclidean space closest to S it is close Adjoint point, then the S point closest in Euclidean space with the data set after projection carries out the operation that seeks common ground, with weight The number of multiple Neighbor Points is obtained divided by S, the inquiry precision of the query point.As shown in formula (18).Wherein x ∈ X, X indicate number According to collection；P indicates to inquire the probability of Neighbor Points；F is represented in data set in luv space, and query point x is in theorem in Euclid space next week S nearest point is enclosed, R is nearest S point around query point x under theorem in Euclid space in the projected；| | return is set Length.

In order to keep experimental result more convincing, in experimentation of the invention, each parameter repeats 20 times 1000 points of random challenge are tested in experiment every time, then calculate the average value of this 20 times 1000 points inquiry precision every time, as Experimental evaluation standard of the invention.

Experiment and interpretation of result:

Firstly, we compared us and project square using FJLT in the case that dimensional characteristics after expansion all retain Inquiry of the sparse binary system random matrix on Mnist data set and Sift data set used in battle array and drosophila algorithm is accurate Degree.

Experiment one: the purpose for designing the experiment is to judge that two kinds of projection matrixes expand to data set intrinsic dimensionality Guarantor when exhibition is away from performance, during the experiment, uses two different projection matrix FJLT and sparse binary matrix first Data set features dimension is extended to M dimension, all intrinsic dimensionalities of the data object after retaining extension directly carry out neighbour and look into It askes." protect away from " performance when comparing two kinds of matrix-expand dimensions.In the specific experiment stage, S value 200 calculates mapping front and back Similitude of 200 points as us around query point.

As shown in figures 2 and 3, the amplitude amplified with data set dimensional characteristics is increasing, and the FJLT that the present invention uses is fast Advantage of the fast transition matrix in inquiry precision is increasing.This shows that FJLT transition matrix is in data set used in us There is better experiment effect during the amplification of dimensional characteristics.The core strategy of drosophila algorithm is exactly to simulate the smell of drosophila Neural network first amplifies the signal of input, then, utilizes the plan of the winner-take-all (WTA) in drosophila algorithm FLSH Slightly, make the relatively small feature of maximum remaining characteristic value of k feature suppression of characteristic value, to obtain the Kazakhstan of local sensitivity Hash Uncommon coding, as shown in figures 2 and 3, FJLT algorithm proposed by the present invention has higher inquiry precision in the amplified space of feature, Therefore our FJLT projection matrix is more suitable the simulation algorithm of drosophila olfactory neural network.

The comparison of FJLT-FLSH and FLSH hold in range performance in extension dimension in 1. two datasets of table.

Experiment two: experiment purpose is to compare the time it takes when two kinds of projection matrixes project data set. Compare the time efficiency and time trend of two kinds of matrixes.During the experiment, firstly, it is necessary to which data set is amplified to M Then dimension directly intercepts projection process the time it takes.Matrix and data set are converted into array in projection process by we, The process is necessary, and can greatly shorten projection time；It is tested every time in experimentation and runs a program only to prevent Because other programs occupy computer resource, the experimental result of the secondary experiment is influenced；Average value is calculated using many experiments, to reduce Interference of the abnormal experimental result to normal experimental result；Our experimental situation is 10 system of Window, processor i5- 6500,8G memories, PyCharm compiler.

As shown in fig. 4 or 5, the experimental results showed that, FJLT of the present invention is quickly converted matrix and drosophila article Used sparse binary system random matrix [2] compare have better time performance, this expression: FJLT matrix more suitable for The treatment process of large-scale dataset.Meanwhile from experimental result, we can also conclude that the increasing with extension dimension M Add, the projection time of two kinds of projection matrixes used by testing linearly increases, this is consistent with actual conditions.But FJLT The rate of rise of projection matrix is far below sparse binary matrix.Since the growth rate of FJLT projection time is smaller, because This FJLT matrix, which is more suitable, is extended the dimension of data set.

The projection time of FJLT matrix and sparse binary matrix compares in 2. two datasets of table

Experiment three: experiment purpose is the inquiry precision in order to compare two kinds of algorithms under identical Hash coding, first when experiment First data set is pre-processed using the division normalization [23] of nerve perception, the M of Sift data set amplification is tieed up (M= 100,200,300,600,1000) inquiry that two kinds of algorithms are covered all under (LTA) in winner-take-all (WTA) and weak person, is then calculated Precision.Retain after two kinds of algorithms are converted respectively in M dimensional feature value in maximum k feature (k=cM) and M dimensional feature value most K small feature, as Hash code length.Compare inquiry precision of two kinds of algorithms under different characteristic retention strategy.

As shown in one of Fig. 6-10, during the experiment, the present invention compares FJLT matrix and 2012 years Ohio first Scholar VenuSatuluri and the Srinivasan Parthasarathy of state university propose the quick phase based on Bayes Like inquiry precision of the sparse binary matrix in property querying method BayesLSH [11] under WTA (winner-take-all), by right Than discovery, in the case where k value takes minimum, sparse binary system projection matrix is quickly converted matrix with higher compared to FJLT Accuracy, still, under this section of code length, the inquiry precision of two kinds of algorithm entirety is not high, with the increase of code length, FJLT-FLSH algorithm queries precision starts to surpass in reverse FLSH algorithm, and highest can achieve 83% inquiry precision,

It is compared by experimental result, it can be deduced that conclusion, with the increase of the amplitude of data set amplification dimension, FJLT- The superiority of FLSH algorithm is more prominent.

When the hash code length of table 3. takes m=600, fjlt-flsh (wta), flsh (wta), fjlt-flsh (lta), flsh (lta) the inquiry precision of algorithm

Comparing two kinds of algorithms after the inquiry precision under winner-take-all (WTA) feature retention strategy, we are also to warp Inquiry precision of the new feature of data set after crossing two kinds of algorithm conversions under other feature retention disciplines compares, first First, we cover all to two kinds of algorithms in weak person and have done the comparison of precision on (LTA), it has been found that inquiry of the FLY algorithm on LTA Precision is relatively low, and inquiry precision curve of the FJLT matrix that the present invention uses on LTA be basic and the inquiry precision on WTA Curve essentially coincides, this shows when carrying out neighbor search, the feature tribute having the same that the feature and WTA that LTA chooses are chosen It offers, guesses that the feature after FJLT is converted all has characteristic of equal value one by one to verify us, we devise a reality It tests, in the data set after FJLT is converted, randomly selects k feature and encoded as Hash, then retained with WTA and LTA The inquiry precision of feature be compared,

Experiment four: it is distributed to verify the contribution of the feature of the data set after FJLT is converted, we use three kinds of spies Retention strategy is levied, the first is winner-take-all (WTA), that is, chooses maximum k feature as our Hash coding, be for second Weak person covers all (LTA), and the Hash for choosing the smallest k characteristic value as us encodes, the third is to randomly select k characteristic value Hash as us encodes (k=cM, c=0.01,0.02,0.03,0.05,0.1,0.13,0.15,0.2,0.3,0.5), so Progress neighbor search calculates the inquiry precision between them and compares and analyzes afterwards.Firstly, testing according to upper one, FJLT is used Transition matrix ties up the dimension of data set to M, then, retains its maximum k characteristic value respectively, the smallest k characteristic value, and K random characteristic value carries out neighbor search, calculates inquiry precision.In order to keep experimental result closer to the truth.It generates Different query points calculate the inquiry precision of three kinds of different characteristic retention strategies.

As shown in one of Figure 11-15, it can be deduced that conclusion is converted by FJLT-FLSH algorithm in the data set after projection, Influence of the feature of data to inquiry precision be it is identical, i.e., arbitrarily take k feature after projection in data set, can obtain closely Like identical inquiry precision, this characteristic can bring us and inspire, if a kind of completely new feature retention strategy can be found, Under this policy, inquiring precision and query time can further be promoted.

When the hash code length of table 4. takes m=600, fjlt projection matrix keeps rule in (wta, lta, random) three features Inquiry precision under then

Experiments have shown that FJLT projection matrix is more applicable for the data characteristics dimension in WMSN block catenary system inquiry mechanism It is extended, under identical and moderate Hash code length, FJLT-WTA algorithm proposed by the present invention has in inquiry precision There is superior performance, meanwhile, time of the FJLT matrix spent by projection process is much smaller than general sparse, binary system square Battle array, this shows that FJLT matrix is more suitable in WMSN system large-scale data neighbor search.Finally, due to which FJLT matrix is certainly The characteristics of band standardization, the data characteristics after the conversion of FJLT projection matrix is to data contribution having the same.Therefore, it is selecting When taking characteristic dimension, we can use the lower strategy of time complexity, reduce the time in our query process, further Improve time efficiency.

Of the invention is more efficient compared with prior art, and WMSN application system is improved in the case where guaranteeing time complexity The inquiry precision of high dimensional data in system meets us using the query demand of multi-medium data in WMSN block catenary system.It is real Test the result shows that, Hash strategy of the invention drosophila olfactory nerves identify smell during to odiferous information intrinsic dimensionality carry out When extension, has and preferably protect away from performance, higher inquiry precision, while greatly reducing projection time, therefore be more suitable Data intrinsic dimensionality is extended in the bionical Hash procedure of data, meanwhile, retain phase choosing identical feature retention strategy With length characteristic dimension when, the present invention has higher inquiry precision, meets WMSN block catenary system to inquiry precision It is required that and the projecting method that uses make the feature of the data set after projection to inquiry precision contribution having the same, can be with It is random to retain specified Hash code length, secondary treatment of the strategy of winner-take-all (WTA) to data characteristics is needed not move through, To improve the time efficiency in query process.

Claims

1. a kind of efficient drosophila neural network Hash Search WMSN data method, it is characterised in that: itself the following steps are included:

Step 1, feature extraction is carried out to multi-medium data, multi-medium data is converted into characteristic vector data by characterizing；

Step 3, query object is mapped in search index structure；It is quick using FJLT to the given data object for needing to inquire The index of inquiry data is formed after transition matrix projection；

Step 4, the index based on inquiry data carries out approximate neighbor search in search index, searches and inquiry point data most phase As data object.

2. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 1, feature exist In: text data is converted to in step 1 using TF-IDF method or word frequency method the feature vector under theorem in Euclid space；By mentioning SIFT feature value is taken to carry out characterization to image data.

3. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 1, feature exist In: step 2 specifically comprises the following steps:

Step 2.1, matrix is quickly converted by FJLT and each data object in data set is projected into new metric space,

Step 2.2, neuron winner-take-all strategy pair is enlivened in conjunction in drosophila neural network local sensitivity Hash FLSH algorithm The new feature of each data object is accepted or rejected respectively after projection, using the feature retained as the index of the data object.

4. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 1, feature exist In: the calculation formula for being quickly converted matrix projection using FJLT of step 2 is as follows:

Y=P × H × D × X

Wherein, Y is data after projection, and X is the characteristic vector data in data set, and P matrix is one and is propped up with J-L theorem to be theoretical The sparse matrix of support, H are a hadamard matrixs related with original dimension d, and D is a diagonal matrix, on the diagonal line of D Value between -1 and 1, and -1 and 1 respectively have 1/2 probability.

5. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 4, feature exist Meet that mean value is 0 and variance is q in: each element in P matrix^-1Be just distributed very much, i.e., each element in P matrix meets P_ij-N(0,q^-1) probability be q, do not meet P in matrix_ij-N(0,q^-1) value of element of distribution is 0, the wherein calculating process of q It is as follows:

Wherein, d is the original dimension of data set, and n is the size of data set, and ε is to protect away from performance parameter, and p is used normal form.

6. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 4, feature exist In: each element H of H-matrix_ijValue can by following formula calculate go out:

Wherein, d is the original dimension of data set.

7. a kind of efficient drosophila neural network Hash Search WMSN data method according to claim 4, feature exist In: the D matrix that dimension is d indicates are as follows:

Wherein, d is the original dimension of data set.