CN114357220A

CN114357220A - Similar medical image calculation method based on locality sensitive hashing algorithm

Info

Publication number: CN114357220A
Application number: CN202210019188.9A
Authority: CN
Inventors: 刘万里; 杨晓辉; 张武; 王逸文; 王泽廷; 徐雷; 李鑫
Original assignee: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL; Nanjing University of Science and Technology
Current assignee: NANJING INTEGRATED TRADITIONAL CHINESE AND WESTERN MEDICINE HOSPITAL; Nanjing University of Science and Technology
Priority date: 2022-01-07
Filing date: 2022-01-07
Publication date: 2022-04-15

Abstract

The invention provides a similar medical image calculation method based on a locality sensitive hashing algorithm. The method comprises the steps of medical image vectorization, hash bucket calculation, hash bucket list construction, medical image vector similarity calculation and the like. The method is mainly applied to a massive high-dimensional vector space formed by medical images, can quickly classify the medical images and quickly screen out similar items of target images in the high-dimensional vector space, and has better time complexity compared with the existing classification calculation method.

Description

Similar medical image calculation method based on locality sensitive hashing algorithm

Technical Field

The invention relates to the field of big data, belongs to medical images and computer interdisciplines, and particularly relates to a calculation method for solving similar vectors in a high-dimensional vector space.

Background

At present, the application of big data in the medical field is very wide, the medical field gathers very rich data resources, and medical images comprise X-rays, nuclear magnetic resonance imaging, ultrasonic waves and the like, which are all key links in the medical process. Radiologists often need to view each examination individually, which creates an unrealistic large amount of work and may delay the optimal treatment time for the patient. But big data can completely change their way of analysis. In the past, a large amount of medical image data are checked independently and classified in a manual mode, the method consumes manpower time, human resources in the medical field are precious, and how to help medical workers classify the medical image data by using a big data technology and the similar retrieval mode is very important. The invention provides a similar medical image calculation method technology based on a locality sensitive hash algorithm, which is an improvement on the existing locality sensitive hash algorithm and can help medical workers to quickly complete classification and retrieval of medical image data. The method can quickly classify the massive medical image data and screen out the similar items of the target image, thereby greatly saving the labor cost and relieving the data classification pressure of medical workers.

Disclosure of Invention

The invention provides a similar medical image calculation method technology based on a locality sensitive hashing algorithm, the method is mainly used for classification and retrieval of medical images, in a space containing massive medical images, the method can quickly complete image classification, and similar items of target images are screened out.

The technical solution for realizing the purpose of the invention is as follows:

step 1, the medical image picture is unidimensionalized.

For a (N × N) size medical image picture in a massive medical image space, obtaining a pixel matrix P of the medical image picture, wherein the dimension of the pixel matrix is (N × N), and each element value in the matrix is 0 or 1; the matrix is expanded by rows to obtain a matrix N²The high-dimensional vector is subjected to one-dimensional processing on all pictures in the medical image library according to the one-dimensional method to obtain a high-dimensional vector space Q

And 2, constructing a redundant hash table set of the high-dimensional vector space Q.

The projection values of all vectors in the vector space Q are calculated using two hash functions and projected into one or two hash buckets. The specific hash function is shown in equations 1.1 and 1.2.

Equation 1.1 calculates the central bucket, where v is a vector in the vector space Q, x is a random vector of the same dimension, each element of the random vector x satisfies the gaussian distribution, the random vector is used as a reference vector, and d (v, x) calculates the projection distance of the vector v in the direction of the vector x. w is the width of the hash bucket, and the selection of the width determines the number of vectors falling into the same hash bucket and the sparsity of the hash bucket to a great extent; of the first term of the above formula

Is calculated to obtain

The hash bucket corresponding to the rounding-down value n is taken as a central bucket, and the vector v needs to be placed into the central bucket at first.

Equation 1.2 compute redundant buckets, mod () represents a computation

The remainder r of (c). As shown in fig. 1, when the hash value of the vector v is close to the left boundary (right boundary), the similarity point of the target item may fall in the hash bucket on the left side thereof, and in order to sufficiently obtain the similarity set of the target item, the boundary condition should be discussed case by case.

The specific calculation method is as follows:

c is a hyper-parameter, if the obtained redundant bucket is equal to the central bucket, no putting operation is carried out, if the obtained redundant bucket is not equal to the central bucket, the items are simultaneously put into the redundant buckets. And repeating the steps to complete the projection calculation of all vectors in the vector space Q, and putting the corresponding vectors into the corresponding central bucket and the corresponding redundant bucket. The information contained in the vector is fully utilized by the proposal of the local redundancy sensitive hash concept, which is a relatively special place compared with the current mainstream local sensitive hash algorithm, and the information contained in the characteristic vector is fully mined. The redundant similar calculation can avoid the influence caused by the boundary error, and further ensure the similar accuracy of the algorithm calculation.

And (3) reselecting a reference vector x, and repeating all projection operations in the step (2) in the direction of the vector x to obtain a new hash bucket list. And repeatedly selecting n reference vectors in total, finishing all projection operations in n directions, and obtaining n hash bucket lists.

Step 3, calculating the similar vector of the single target medical image vector

After the above-mentioned redundant hash bucket set is established, how to calculate the similarity vector of the single target medical image vector is described next. Assuming a target directionQuantity g, randomly selecting m reference vectors from the n reference vectors, calculating for each reference vector according to the algorithm described in step 2

Get the center barrel i, and calculate

Obtaining a remainder r, if the value of cxw is more than or equal to r and less than or equal to (1-c) xw, then no redundant barrel number exists, and at the moment, the scheme considers that all similarities of the target item are in the central barrel, and the algorithm only extracts the vector in the central barrel; if and only if the number of candidates in the central bucket is lower than a certain threshold, then vectors are randomly extracted from the hash buckets close to the two sides, as shown in fig. 2, in order to ensure similar correlation, the method extends the distance between at most two hash buckets to the left and right. If r is less than c multiplied by w, searching the hash bucket closest to the left side of the central bucket by the algorithm, and searching the distance between at most two hash buckets to the left; if the effective redundant bucket is retrieved, vectors in the central bucket and the redundant bucket are extracted at the same time to serve as candidate vectors, if the distance between the two hash buckets is searched leftwards, the algorithm considers that the central bucket does not have a left similar hash bucket, and at the moment, the algorithm only extracts candidate items in the central bucket; if r is more than or equal to (1-c) multiplied by w, the algorithm searches the right similar hash bucket by using a left similar same method and extracts the candidate vector. Due to the existence of the redundant hash bucket, repeated items may exist in the obtained candidate vector set, and the candidate vector set needs to be subjected to repeated vector elimination processing;

after the step 3, the algorithm obtains a candidate vector set x '═ { x'₁，x′₂，x′₃，...，x′_mThen, for each vector y in the candidate vector set x', we call the vector y and the target vector have a projection overlap, and a vector pair (x, y) composed of the target vector and the vector y performs an overlap count; using the method on each redundant hash bucket set to obtain m candidate vector sets, respectively executing overlap counting, regarding a vector y, if the corresponding overlap counting is not less than a specified threshold t, the algorithm considers the y as a similar vector of a target vector, and the threshold m is in a certain rangeThe degree determines the number of similarity vectors of the target vector.

Step 4, sorting candidate vector set

And (4) performing distance sorting on the similar vector set obtained in the step (3) to obtain N vectors with the closest distance to form a most similar vector set, selecting the vector with the first sorting as the most similar vector of the target vector, and completing similar vector retrieval.

Drawings

FIG. 1 is a redundant similarity calculation explanation of the present invention

FIG. 2 is a similar hash bucket interpretation of the present invention

Detailed Description

For a better understanding of the present disclosure, reference is made to the following description taken in conjunction with the accompanying drawings.

The invention discloses a local sensitive Hash similarity medical image calculation method, which comprises the following steps:

step 1, obtaining a pixel matrix P of a (N multiplied by N) medical image picture in a massive medical image space, wherein the dimension of the pixel matrix is (N multiplied by N), and each element value in the matrix is 0 or 1; the matrix is expanded by rows to obtain a matrix N²And (3) maintaining a high-dimensional vector, and performing one-dimensional processing on all pictures in the medical image library according to the one-dimensional method to obtain a high-dimensional vector space Q.

Step 2, calculating hash values of all vectors in the vector space Q, and determining a hash barrel number corresponding to each vector according to the hash values, wherein the specific steps of the complete process are as follows:

(1a) and constructing a random vector x with the dimension of k, wherein each element of the random vector x satisfies the Gaussian distribution, and the random vector is used as a reference vector.

(1b) For all vectors in the vector space Q, each vector in Q in the vector space is projected into one or two hash buckets using the following equation.

Wherein v is a vector in the vector space Q, x is the random k-dimensional vector, d (v, x) calculates the projection distance of the vector v in the direction of the vector x, w is the width of the hash bucket, and the selection of the width largely determines the number of vectors falling into the same hash bucket and the sparsity of the hash bucket. The formula for d (v, x) is as follows:

in the above formula (1-5), k represents a vector dimension, v_i，x_iRepresenting the ith component of the vector v, x, respectively.

(1c) In equation (1.4)

Representation calculation

(1d) Equation (1.5) compute redundant bucket affinity, mod () represents the computation

The remainder r of (c). As shown in fig. 1, when the hash value of the target item is close to the left boundary (right boundary), the similarity point of the target vector may fall in the hash bucket on the left side thereof, and in order to sufficiently obtain the similarity set of the target vector, the boundary should be discussed case by case. The specific calculation formula is as follows:

c is a hyper-parameter and w is the width of the hash bucket, as explained below in terms of different scenarios.

If r < c × w, then

The redundant bucket is located to the left of the central bucket, and vector x is placed into the left redundant bucket. If c x w is not more than r not more than (1-c) x w, then

At this point, the redundant bucket equals the central bucket and no put operation is done. If r is not less than (1-c) xw, then

The redundant bucket is now located to the right of the central bucket, and vector x is placed into the right redundant bucket.

And 3, reselecting a reference vector x, and repeating the projection operation in the step 3 in the direction of the vector x to obtain a new hash bucket list. And repeatedly selecting n reference vectors in total, finishing all projection operations in n directions, and obtaining n hash bucket lists.

Step 4, calculating a hash bucket of the target vector, assuming the target vector g, randomly selecting m reference vectors from the n reference vectors, and calculating each reference vector according to the algorithm in the step 1

Get the center barrel i, and calculate

And obtaining a remainder r, and calculating the hash bucket number of the target vector according to different conditions.

If the value of c multiplied by w is more than or equal to r and less than or equal to (1-c) multiplied by w, no redundant barrel number exists, the algorithm considers that all similarities of the target vectors are in the central barrel, and only the candidate vectors in the central barrel are extracted; and if and only if the number of the candidate vectors in the central bucket is lower than a certain threshold value, randomly extracting the vectors from the hash buckets close to the two sides, and expanding the distance between the two hash buckets at most towards the left and the right in order to ensure the similar correlation. If r is less than c multiplied by w, searching the hash bucket closest to the left side of the central bucket by the algorithm, and searching the distance between at most two hash buckets to the left; if the effective redundant bucket is searched, the candidate vectors in the central bucket and the redundant bucket are extracted at the same time, if the distance between the two hash buckets is searched leftwards, the algorithm considers that the central bucket does not have a left similar hash bucket, and at the moment, the algorithm only extracts the candidate vectors in the central bucket. If r is more than or equal to (1-c) multiplied by w, the algorithm searches the right similar hash bucket by using a left similar same method and extracts the candidate vector. Due to the existence of the redundant hash bucket, repeated vectors may exist in the obtained candidate vector set, and the candidate vector set needs to be subjected to repeated vector elimination processing.

Step 5, through the steps, obtaining a candidate vector set x '═ { x'₁，x′₂，x′₃，...，x′_mThen, for each vector x 'in the candidate vector set x'_iWe call vector x'_iAnd a target vector g, and a projection overlap is generated by the target vector and the vector x'_iVector pair of compositions (g, s'_i) Performing an overlap count plus one; using the above method on each redundant hash bucket set may result in 1 set of candidate vectors,

and 6, as can be known from the step 3, n hash bucket lists are constructed in the method, and the operation of the step 5 is repeated on each hash bucket list, so that a projection overlap counting table can be obtained. For each pair of vectors (g, x ') in the projection overlay technology table'_i) If its corresponding overlap count is not less than the specified threshold t, the algorithm considers this vector x'_iIs the similarity vector of the target vector, and the threshold t determines the number of similarity vectors of the target vector to a certain extent.

And 7, performing distance sorting on all the similar vectors obtained in the step 6 to obtain a vector with the closest distance, and reversely generating a (NxN) size medical image picture by using the vector to finish similar image retrieval.

The present invention will be described in further detail with reference to examples.

Example 1

Step 1, obtaining a pixel matrix P of a (256 × 256) medical image picture in a massive medical image space, wherein the dimension of the pixel matrix is (256 × 256), and each element value in the matrix is 0 or 1; and expanding the matrix according to rows to obtain a 65536-dimensional high-dimensional vector, and performing one-dimensional processing on all pictures in the medical image library according to the one-dimensional method to obtain a high-dimensional vector space Q.

And 2, calculating hash values of all vectors in the vector space Q, and determining a hash barrel number corresponding to each vector according to the hash values.

And 3, reselecting a reference vector x, and repeating the projection operation in the step 3 in the direction of the vector x to obtain a new hash bucket list. A total of 100 reference vectors are repeatedly selected, and all projection operations in 100 directions are completed, so that 100 hash bucket lists are obtained.

Step 4, calculating a hash bucket of the target vector, assuming the target vector g, randomly selecting 50 reference vectors from 100 reference vectors, and calculating each reference vector according to the algorithm in the step 1

Get the center barrel i, and calculate

Step 5, through the steps, obtaining a candidate vector set x '═ { x'₁，x′₂，x′₃，...，x′_mThen, for each vector x 'in the candidate vector set x'_iWe call vector x'_iAnd a target vector g, and a projection overlap is generated by the target vector and the vector x'_iVector pair of compositions (g, s'_i) Performing an overlap count plus one; using the above method on each redundant hash bucket set, 1 candidate vector set can be obtained.

Step 6, as can be seen from step 3, n hash bucket lists are constructed in the method, and the operation of step 5 is repeated on each hash bucket list, so that a projection overlap count can be obtainedTable (7). For each pair of vectors (g, x ') in the projection overlay technology table'_i) The algorithm considers this vector x 'if its corresponding overlap count is not less than the specified threshold 20'_iIs the similarity vector of the target vector, and the threshold t determines the number of similarity vectors of the target vector to a certain extent.

And 7, performing distance sorting on all the similar vectors obtained in the step 6 to obtain a vector with the closest distance, and reversely generating a medical image picture with the size of (256 multiplied by 256) by using the vector to finish similar image retrieval.

Claims

1. A similar medical image calculation method based on a locality sensitive hashing algorithm is characterized by comprising the following steps:

step 1, the medical image picture is unidimensional, the image picture with the size of (NxN) is executed with unidimensional operation, and the image picture is tiled into N²Vector of dimensions, construct a high-dimensional vector space.

And 2, constructing a random reference vector, and projecting all medical image vectors in the vector space to a central bucket and a redundant bucket.

And 3, repeating the operation in the step 2, constructing n hash bucket lists, and finishing the classification of the medical images.

And 4, calculating the similar vectors of the single medical image vector in the n hash bucket lists, and selecting the similar vectors by adopting a projection counting method.

And 5, performing distance sequencing on the similar vectors obtained in the step 4 to obtain the most similar medical image vectors.

And 6, reversely generating a (N x N) medical image picture according to the most similar medical image vector obtained in the step 5.

2. The method for computing a similar medical image based on locality sensitive hashing algorithm of claim 1, wherein the one-dimensional medical image picture of step 1 is obtained by obtaining a pixel matrix P of (N × N) for a medical image picture with size of (N × N), wherein each element value in the matrix is 0 or 1; the matrix is arranged according to rowsExpand to obtain one as N²And (3) maintaining a high-dimensional vector, and performing one-dimensional processing on all pictures in the medical image library according to the one-dimensional method to obtain a high-dimensional vector space Q.

3. The method for computing similar medical images based on locality sensitive hashing algorithm according to claim 1, wherein said computing step 2 computes central and redundant buckets of all medical image vectors in vector space as follows:

(1) step 1, obtaining a high-dimensional vector space Q, wherein the vector dimension is N²Constructing an N²A random vector x of dimensions, each element of the random vector x satisfying a gaussian distribution, with the random vector as a reference vector.

(2) For all vectors in the above vector space, they are projected into one or two hash buckets using the following equation.

Where v is a vector in vector space Q and x is the random N mentioned above²The dimension vector, d (v, x), calculates the projection distance of the vector v in the direction of the vector x, w is the width of the hash bucket, and the selection of the width largely determines the number of vectors falling into the same hash bucket and the sparseness of the hash bucket. The formula for d (v, x) is as follows:

in the above formula, k represents the vector dimension, v_i，x_iRepresenting the ith component of the vector v, x, respectively.

(3) In the above formula

Representation calculation

(4) Formula (II)

Compute redundant buckets of similarity, mod () represents a computation

If r < c × w, then

With the redundant bucket inTo the right of the heart bucket, vector x is placed into the right redundant bucket.

4. The method for computing similar medical images based on locality sensitive hashing algorithm according to claim 1, wherein the computing of the similar vectors of the single medical image vector in step 4 is as follows:

(1) calculating a hash bucket of a single target medical image vector, assuming a target vector g, randomly selecting m reference vectors from n reference vectors, and calculating for each reference vector according to the algorithm in step 2

Get the center barrel i, and calculate

(2) Through the steps, a candidate vector set x 'is obtained by the method'＝{x′₁，x′₂，x′₃，...，x′_mThen, for each vector x 'in the candidate vector set x'_iWe call vector x'_iAnd a target vector g, and a projection overlap is generated by the target vector and the vector x'_iVector pair of compositions (g, x'_i) Performing an overlap count plus one; using the above method on each redundant hash bucket set may result in 1 set of candidate vectors,

(3) the method constructs n hash bucket lists, and repeats the operations (1) and (2) on each hash bucket list to obtain a projection overlap count table. For each pair of vectors (g, x ') in the projection overlay technology table'_i) If its corresponding overlap count is not less than the specified threshold t, the algorithm considers this vector x'_iIs a similarity vector of the target vector.