CN106777090A

CN106777090A - The medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features

Info

Publication number: CN106777090A
Application number: CN201611150453.8A
Authority: CN
Inventors: 李媛媛; 季长清; 肖鹏; 邓武; 张雪; 杨书惠
Original assignee: Dalian Jiaotong University
Current assignee: Dalian Jiaotong University
Priority date: 2016-12-14
Filing date: 2016-12-14
Publication date: 2017-05-31

Abstract

The medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features, belong to intelligent medical treatment with big number treatment crossing domain, be applied in the middle of the medical Image Retrieval Technology based on content for metric space Skyline inquiries by the system, and technical essential is：Extract the characteristics such as SIFT, Color of medical image, multiple low-level image features of image are merged using distributed Skyline operations, each characteristic similarity as Skyline evaluation objective, the result of return is all to compare the similar or extremely similar candidate image of certain one-dimensional characteristic on multidimensional characteristic with query image, Liu Shi treatment finally is carried out using the Spark systems of cloud computing, and is inquired about in real time or result.Effect is：The corresponding information for getting picture in user terminal is uploaded and is saved in cloud server, and then cloud server is processed, and is obtained optimal medical image clustering schemes and is fed back to user.

Description

Skyline medical big data retrieval based on visual vocabulary and multi-feature matching method

技术领域technical field

本发明专利属于智慧医疗与大数处理交叉领域，是一种基于视觉词汇表与多特征匹配的Skyline的医学大数据检索系统，该系统将度量空间Skyline查询应用到基于内容的医学图像检索技术当中，涉及到大规模医疗数据分析、云计算环境下的海量数据处理，涉及到智能数据处理与应用开发。The patent of the present invention belongs to the intersection field of smart medical care and large number processing. It is a Skyline medical big data retrieval system based on visual vocabulary and multi-feature matching. The system applies metric space Skyline query to content-based medical image retrieval technology. , involving large-scale medical data analysis, massive data processing in the cloud computing environment, and intelligent data processing and application development.

背景技术Background technique

随着互联网的发展和医疗数字化设备的普及，医疗图像数据呈指数级增长，相关的图像数据的检索技术也越来越受到人们的关注，海量数据不仅具有数据量大的特点，它们还蕴含着巨大的商业价值。例如分析医学癌症用户的肿瘤生长情况，可以指导医生进行相关的个性化治疗方案推荐；分析脑活动，心率的记录可以给医院厂家和病人带来诊疗指导或家庭监护的病前预警。然而，海量医学影像数据的爆炸式增长，使得传统的单机数据分析处理技术已经越来越不适应当前密集型数据分析和处理的需为了在保证图像检索精度的前提下，提高医学图像检索效率，度量空间Skyline查询(MetricSkylineQuery)算法在图像处理领域得到了很好的应用。该算法可以通过对度量空间中的数据剪枝来提高图像检索效率。With the development of the Internet and the popularization of medical digital equipment, medical image data has grown exponentially, and the retrieval technology of related image data has attracted more and more attention. Massive data not only has the characteristics of large data volume, but also contains Huge commercial value. For example, analyzing the tumor growth of medical cancer users can guide doctors to recommend relevant personalized treatment plans; analyzing brain activity and heart rate records can provide hospital manufacturers and patients with diagnosis and treatment guidance or pre-disease warning for family monitoring. However, the explosive growth of massive medical image data has made the traditional stand-alone data analysis and processing technology less and less suitable for the current intensive data analysis and processing. In order to improve the efficiency of medical image retrieval under the premise of ensuring the accuracy of image retrieval, The metric space Skyline query (MetricSkylineQuery) algorithm has been well applied in the field of image processing. The algorithm can improve the efficiency of image retrieval by pruning the data in the metric space.

现有图像数据的度量空间Skyline算法大多数是基于一般文本语义进行度量空间建模。在医学为背景的语义图像检索方法中，尽管图像的语义信息丰富，但也存在着语义信息复杂、语义理解主观、语义提取和表达困难等缺点，这些缺点影响了度量空间建模和医学图像检索效果；另外，由于语义信息的模糊性，大部分算法为了提高了查询精度，根据语义需要选择多张图像参与查询，这又大大增加了查询过程的计算量。计算量大成为度量空间Skyline查询的一大瓶颈，这点在海量医学图像数据处理上尤其突出。Most existing metric space Skyline algorithms for image data are based on general text semantics for metric space modeling. In the semantic image retrieval method with medical background, although the semantic information of the image is rich, there are also shortcomings such as complex semantic information, subjective semantic understanding, semantic extraction and expression difficulties, which affect the metric space modeling and medical image retrieval. In addition, due to the ambiguity of semantic information, most algorithms select multiple images to participate in the query according to semantic needs in order to improve the query accuracy, which greatly increases the amount of calculation in the query process. The large amount of calculation has become a major bottleneck of Skyline query in metric space, which is especially prominent in the processing of massive medical image data.

近年来，基于内容的图像检索技术得到了迅速的发展，并逐渐成为图像检索领域的主流技术。针对已有医学图像数据的度量空间算法选择图像语义信息进行检索的缺点，从医学图像内容入手，在度量空间上选取图像的底层特征作为研究对象。为了提高检索精度，为了节省计算开销、加快相似度距离计算速度，从多特征融合角度设计度量空间Skyline算法，基于此，我们设计并实现了该发明专利。In recent years, content-based image retrieval technology has developed rapidly, and has gradually become the mainstream technology in the field of image retrieval. Aiming at the shortcomings of existing metric space algorithms for medical image data to select image semantic information for retrieval, starting from the content of medical images, the underlying features of images are selected as the research object in metric space. In order to improve the retrieval accuracy, in order to save computational overhead and speed up the calculation of similarity distance, the metric space Skyline algorithm was designed from the perspective of multi-feature fusion. Based on this, we designed and implemented this invention patent.

发明内容Contents of the invention

根据上述背景技术中存在的缺陷和不足，本发明将度量空间Skyline查询应用到基于内容的医学大规模图像检索技术当中，并提出了一种基于视觉词汇表与Skyline多特征融合的医学大规模图像检索方法(BigFeatureFusionbySkyline，BSKFF)，利用Skyline操作进行多特征的融合，设计了一种新的基于视觉词汇的医学大数据检索系统，更好的解决了医学大规模图象数据检索问题。According to the defects and deficiencies in the above-mentioned background technology, the present invention applies the metric space Skyline query to the content-based large-scale medical image retrieval technology, and proposes a large-scale medical image based on the fusion of visual vocabulary and Skyline multi-features The retrieval method (BigFeatureFusionbySkyline, BSKFF) uses Skyline operation to fuse multiple features, and designs a new medical big data retrieval system based on visual vocabulary, which better solves the problem of medical large-scale image data retrieval.

为了实现上述目的，本专利所采用的技术方案是：In order to achieve the above object, the technical solution adopted in this patent is:

一种基于视觉词汇表与多特征匹配的Skyline的医学大数据检索方法，其特征在于，包括如下步骤：A kind of medical big data retrieval method based on the Skyline of visual vocabulary and multi-feature matching, it is characterized in that, comprises the steps:

S1.提取医学图像的底层特征，分别对底层特征集合进行聚类，构建视觉词汇表，以此，将图像库中的图像量化为一个视觉单词出现频率的向量，得到分区特征向量；S1. Extract the underlying features of medical images, cluster the underlying feature sets respectively, and construct a visual vocabulary, thereby quantizing the images in the image library into a vector of the frequency of occurrence of visual words to obtain the partition feature vector;

S2.计算查询图像和图像库中的任意图像在每个特征上的相似度距离，以构造不同特征的图像相似度向量；S2. Calculate the similarity distance between the query image and any image in the image library on each feature to construct image similarity vectors of different features;

S3.调用基于Skyline的多特征融合方法进行分布式检索计算决策。S3. Invoking the Skyline-based multi-feature fusion method for distributed retrieval calculation decision-making.

进一步的，所述步骤S1.提取医学图像的特征数据，给定一个查询图像，提取该图像的底层特征，包括如下步骤：Further, the step S1. extracting the feature data of the medical image, given a query image, extracting the underlying features of the image, includes the following steps:

S1.1.Color特征的提取；S1.1. Extraction of Color features;

S1.2.SIFT特征的提取；S1.2. Extraction of SIFT features;

S1.3.构建视觉词汇表；S1.3. Build a visual vocabulary;

S1.4.图像量化表示。S1.4. Image quantization representation.

进一步的，所述步骤S2中构造不同特征的图像相似度向量的方法是：一个包含n幅医学图像的图像库和查询图像q，医学图像被表达为特征向量，查询图像q和图像库I中的任意图像o_i在第t个特征上的相似度距离，其表示为两向量的L₁距离：Further, the method for constructing image similarity vectors of different features in the step S2 is: an image library containing n pieces of medical images and the query image q, the medical image is expressed as a feature vector, the similarity distance between the query image q and any image o _i in the image library I on the t-th feature, which is expressed as the L1 distance _of the two vectors:

其中表示图像o_i的第t个特征描述子向量，是图像o_i的第t维底层特征的k维向量；in Represents the t-th feature descriptor vector of image o _i , which is the k-dimensional vector of the t-th dimension bottom layer feature of image o _i ;

基于公式1.3，得到查询医学图像q和医学图像库I中的任意图像o_i在每个特征上的相似度距离，图像q和o_i的相似度向量如定义1.2所示：Based on the formula 1.3, the similarity distance between the query medical image q and any image o _i in the medical image library I on each feature is obtained, and the similarity vector of the image q and o _i is shown in Definition 1.2:

定义1.2：设为包含n幅图像的图像库，q为查询图像，查询图像q与图像库I中任意图像o_i的相似度向量表示为m维向量：Definition 1.2: Let is an image library containing n images, q is a query image, and the similarity vector between query image q and any image o _i in image library I is expressed as an m-dimensional vector:

Vect_i(o_i,q)＝＜dist(o_i.x₁,q.x₁),dist(o_i.x₂,q.x₂),...,dist(o_i.x_m,q.x_m)＞Vect _i (o _i ,q)＝＜dist(o _i .x ₁ ,qx ₁ ),dist(o _i .x ₂ ,qx ₂ ),...,dist(o _i .x _m ,qx _m )＞

其中i∈[1,n]，m表示底层特征数目，Vect_i(o_i,q)表示图像q与图像o_i的相似度向量，dist(o_i.x_k,q.x_k)表示两幅图像第k(k≤m)维特征的相似度距离；图像库I中的所有图像分别与查询图像q在各维特征上计算相似度距离，构造生成n个相似度向量。Where i∈[1,n], m represents the number of underlying features, Vect _i (o _i ,q) represents the similarity vector between image q and image o _i , dist(o _i .x _k ,qx _k ) represents two images The similarity distance of the kth (k≤m) dimensional feature; all the images in the image library I and the query image q calculate the similarity distance on each dimensional feature, and construct and generate n similarity vectors.

进一步的，所述步骤S3的具体方法：Further, the specific method of the step S3:

给定一个包含n幅图像的医学图像库和一幅查询图像q，集合R为多特征融合方法的查询结果，对于每幅图像的m个底层特征向量 Given a medical image library containing n images and a query image q, the set R is the query result of the multi-feature fusion method, for the m underlying feature vectors of each image

当一幅图像o_i∈R，当且仅当满足如下条件：When an image o _i ∈ R, if and only if it satisfies the following conditions:

则R集合包含了与查询图像q在X向量空间上相似度向量Vect_i(o_i,q)＝＜dist(o_i.x₁,q.x₁),dist(o_i.x₂,q.x₂),...,dist(o_i.x_m,q.x_m)＞不被医学图像库I上的其他任何图像相似度向量支配的所有图像的集合；Then the R set contains the similarity vector Vect _i (o _i ,q)=<dist(o _i .x ₁ ,qx ₁ ),dist(o _i .x ₂ ,qx ₂ ) with the query image q in the X vector space ,...,dist(o _i .x _m ,qx _m )>The collection of all images not dominated by any other image similarity vectors on the medical image library I;

进一步的，基于Skyline的多特征融合方法的结果集是医学图像库的子集，且在多特征度量空间中不被图像集里任意图像所支配的图像集合，查询图像q与任意图像o_i的SIFT和Color特征相似度距离值构成点，点的横坐标表示图像o₁与查询图像q之间SIFT特征的相似度距离，纵坐标表示图像o₁与查询图像q之间Color特征的相似度距离，该所述相似度距离在多特征度量空间上都是基于词袋模型计算得到的，相似度距离越小，两者之间越相似。Furthermore, the result set of the multi-feature fusion method based on Skyline is a subset of the medical image library, and the image set is not dominated by any image in the image set in the multi-feature metric space, the query image q and any image o _i The SIFT and Color feature similarity distance values constitute a point, the abscissa of the point represents the similarity distance of the SIFT feature between the image o ₁ and the query image q, and the ordinate represents the similarity distance of the Color feature between the image o ₁ and the query image q , the similarity distance is calculated based on the bag-of-words model in the multi-feature metric space, and the smaller the similarity distance is, the more similar they are.

进一步的，使用Spark进行流处理，将流式计算分解成一系列短小的批处理作业，逐渐融合与决策结果推荐。Further, use Spark for stream processing, decompose stream computing into a series of short batch jobs, and gradually integrate and recommend decision results.

进一步的，步骤S1.1.Color特征的提取的方法如下：Further, the extraction method of step S1.1.Color feature is as follows:

Color特征用颜色属性CN描述子来表示，由红、黑、蓝、绿、褐、灰、粉、橙、白、紫、黄色颜色组成，把颜色属性CN定义为一个11维的变量，为图像中所有像素赋予一个颜色属性标签，此标签作为Skyline多因素分析的一个主因素，采用Spark进行流处理，结果逐渐完善与输出；The Color feature is represented by the color attribute CN descriptor, which is composed of red, black, blue, green, brown, gray, pink, orange, white, purple, and yellow colors. The color attribute CN is defined as an 11-dimensional variable, which is an image All pixels in the image are assigned a color attribute label, which is a main factor of Skyline multi-factor analysis, and Spark is used for stream processing, and the results are gradually improved and output;

进一步的，步骤S1.2.SIFT特征的提取的方法如下：Further, the method of step S1.2.SIFT feature extraction is as follows:

由检测特征点和描述特征点两部分组成，对原始图像进行尺度转换，得到图像的尺度空间表示序列，然后对图像进行处理得到特征点，采用128维的描述子向量来表示特征点，得到共128维的SIFT特征向量，用SIFT特征提取过程中生成的特征点，将特征点及其所在的周围区域作为局部区域，提取局部区域中的每个像素的CN向量，得到SIFT和CN局部特征向量，此向量作为Skyline多因素分析的一个主因素，采用Spark进行流处理，结果逐渐完善与输出；It consists of two parts: detecting feature points and describing feature points. Scale conversion is performed on the original image to obtain the scale space representation sequence of the image, and then the image is processed to obtain the feature points. The 128-dimensional descriptor vector is used to represent the feature points, and the total 128-dimensional SIFT feature vector, using the feature points generated during the SIFT feature extraction process, using the feature point and its surrounding area as a local area, extracting the CN vector of each pixel in the local area, and obtaining SIFT and CN local feature vectors , this vector is used as a main factor of Skyline multi-factor analysis, using Spark for stream processing, and the results are gradually improved and output;

进一步的，步骤S1.3.构建视觉词汇表的方法如下：Further, step S1.3. The method for constructing a visual vocabulary is as follows:

通过基于Spark的多层聚类算法k-means及其变种以及过采样修正，利用Spark系统，对图像库中的图像进行流式训练，并分别为SIFT和Color特征向量逐步生成视觉词汇表，生成视觉词汇表时，使用先切分数据，并用Spark系统，以流的方式进行分布式处理，并递增导出结果集；Through the Spark-based multi-layer clustering algorithm k-means and its variants and oversampling correction, the Spark system is used to perform streaming training on the images in the image library, and gradually generate visual vocabulary for SIFT and Color feature vectors, and generate For the visual vocabulary, the data is first segmented, and the Spark system is used for distributed processing in a streaming manner, and the result set is incrementally exported;

其中，多层k-means聚类算法是在一些维度的特征点集合X＝{x₁,x₂,...,x_n}中寻找k个聚类中心C＝{c₁,c₂,...,c_k}，使每个特征点到所在簇中心的平方误差和最小；这些聚类中心将X划分成k个不相交的簇Y＝{Y₁,Y₂,...,Y_k}，使得对于任意的1≤i≠j≤k，对于一个簇Y_i，它的中心点为：Among them, the multi-layer k-means clustering algorithm is to find _k cluster centers C= _{ _{c 1} _, c ₂ , ...,c _k }, so that the sum of square errors from each feature point to the center of the cluster is the smallest; these cluster centers divide X into k disjoint clusters Y={Y ₁ ,Y ₂ ,..., Y _k }, such that for any 1≤i≠j≤k, For a cluster Y _i , its center point is:

其中，过采样修正算法是利用一个SparkSpark作业来进行中心点选择和全局误差的计算(与传统的MapReduce不同在于，我们采用了Spark，利用分布式缓存进行处理，以加快迭带的速度，结果以流式递增的方式进行)，其目标函数为：Among them, the oversampling correction algorithm uses a SparkSpark job to select the center point and calculate the global error (different from the traditional MapReduce, we use Spark and use distributed cache for processing to speed up the iteration speed. The result is flow-increasing way), its objective function is:

每一个分解阶段产生的OnR聚类算法的目标是找到一个最优的划分C，使得Spark的最终全局聚类误差φ_X(C)最小，其中φ_X(C)是利用中心点集C，对特征集合X划分产生的全局聚类误差，|| ||为欧几里得距离。分别对SIFT和CN特征集合进行聚类，得到的k个聚类中心即为它们视觉词汇表。The goal of the OnR clustering algorithm generated in each decomposition stage is to find an optimal partition C, so that the final global clustering error φ _X (C) of Spark is the smallest, where φ _X (C) uses the central point set C, for The global clustering error generated by the feature set X division, || || is the Euclidean distance. The SIFT and CN feature sets are clustered separately, and the k cluster centers obtained are their visual vocabulary.

进一步的，步骤S1.4.图像量化表示的方法如下：Further, step S1.4. The method of image quantization representation is as follows:

基于聚类算法生成的视觉词汇表，每幅图像的SIFT描述子被量化为一个装满单词的词袋，在视觉词袋模型中，给定一个特征的视觉词汇表其中j＝1,...,m，k是视觉词汇表中单词的个数，图像库中，每幅图像被量化为一个视觉单词出现频率的k维向量，以相同的方式对Color特征进行量化处理，并且将每幅图像量化生成相应的特征向量，对于多特征的量化过程，以此类推，直到所有特征被量化，得到如定义1.1所示的特征向量；Based on the visual vocabulary generated by the clustering algorithm, the SIFT descriptor of each image is quantized as a bag of words full of words. In the visual bag of words model, the visual vocabulary of a given feature Where j=1,...,m, k is the number of words in the visual vocabulary, in the image library, each image is quantified as a k-dimensional vector of the frequency of a visual word, and the Color feature is processed in the same way Quantization processing, and quantize each image to generate a corresponding feature vector, for the multi-feature quantization process, and so on, until all features are quantized, and the feature vector shown in Definition 1.1 is obtained;

定义1.1：在每一个数据分区中，查找一个包含n幅图像的图像库假定每幅图像o_i有一组底层特征m是底层特征的数量，每幅图像o_i的特征向量表示为＜o_i.x₁,o_i.x₂,...,o_i.x_m＞。Definition 1.1: In each data partition, find an image library containing n images Assume that each image o _i has a set of underlying features m is the number of underlying features, and the feature vector of each image o _i is expressed as <o _i .x ₁ ,o _i .x ₂ ,...,o _i .x _m >.

有益效果：该医学大数据检索系统会通过相关技术在用户端获取到图片的相应信息上传并保存到云端服务器，然后云端服务器进行分布式处理，得到最佳的医学图像聚类方案并逐步反馈给用户。Beneficial effects: The medical big data retrieval system will obtain the corresponding information of the image on the client side through related technologies, upload it and save it to the cloud server, and then the cloud server will perform distributed processing to obtain the best medical image clustering scheme and gradually feed it back to the user.

附图说明Description of drawings

图1本发明的特征融合方法的系统模型；The system model of the feature fusion method of Fig. 1 of the present invention;

图2本发明基于Skyline的特征融合过程；Fig. 2 the feature fusion process based on Skyline of the present invention;

图3本发明的SKFF算法的伪代码。Fig. 3 is the pseudo code of the SKFF algorithm of the present invention.

具体实施方式detailed description

实施例1：参考图1，是一种基于视觉词汇表与多特征匹配的Skyline的医学大数据检索系统，所述系统由一个云中心服务系统和一个手机智能移动客户端软件系统组成。其中，云服务系统负责进行分布式逐步提取医学图像的SIFT、Color等特征数据，利用Skyline操作对图像的多个底层特征进行融合，每个特征相似度都作为Skyline的评价目标，经过Spark计算，逐步返回结果，而最终返回的结果是与查询图像在多维特征上都比较相似或某一维特征极其相似的候选图像；我们的移动医学端软件根据需要将需要进行医学大规模图像分层聚类的医学图像发送至云中心服务系统，并接收云端请求。 Embodiment 1: With reference to Fig. 1, it is a kind of medical big data retrieval system based on visual vocabulary and multi-feature matching Skyline, said system is made up of a cloud center service system and a mobile phone intelligent mobile client software system. Among them, the cloud service system is responsible for the distributed and gradual extraction of feature data such as SIFT and Color of medical images, and uses the Skyline operation to fuse multiple underlying features of the image. The similarity of each feature is used as the evaluation target of Skyline. After calculation by Spark, The results are returned step by step, and the final returned results are candidate images that are similar to the query image in terms of multi-dimensional features or extremely similar in one-dimensional features; our mobile medical software will perform hierarchical clustering of medical large-scale images as needed Send medical images to the cloud center service system and receive cloud requests.

作为一个实施例，该基于视觉词汇表与多特征匹配的Skyline的医学大数据检索系统的执行流程是，当移动用户通过医学影像扫描仪器，采集并发出相关医学图像检索的请求后，由云端系统提取医学图像的SIFT、Color等特征数据，利用Skyline操作对图像的多个底层特征进行融合，得到最好的聚类方案并返回逐步返回给用户，如果时间足够长，会将最终结果给用户，中间可以通过移动交流平台进行业务的逐步确认和最终完整结果的确认工作。As an example, the execution flow of the Skyline medical big data retrieval system based on visual vocabulary and multi-feature matching is that when a mobile user collects and sends a request for relevant medical image retrieval through a medical image scanning instrument, the cloud system Extract feature data such as SIFT and Color of medical images, use Skyline operation to fuse multiple underlying features of the image, get the best clustering scheme and return it to the user step by step, if the time is long enough, the final result will be given to the user, In the middle, the step-by-step confirmation of the business and the confirmation of the final and complete results can be carried out through the mobile communication platform.

SIFT、Color特征数据算法的处理步骤具体为：Color特征用颜色属性ColorNames(CN)描述子来表示，把颜色属性CN定义为一个11维的变量，为图像中所有像素赋予一个颜色属性标签，此标签作为Skyline多因素分析的一个主因素。SIFT特征提取是对原始图像进行尺度转换，得到图像的尺度空间表示序列，然后采用128维的描述子向量来表示特征点，得到共128维的SIFT特征向量。用SIFT特征提取过程中生成的特征点，将特征点及其所在的周围区域作为局部区域，提取局部区域中的每个像素的CN向量，得到SIFT和CN局部特征向量，此向量作为Skyline多因素分析的一个主因素。然后我们将对采集的CN标签和特征向量采用Spark进行流处理，结果逐渐完善与输出。基于SIFT和CN特征向量的提取方法，通过基于Spark的多层聚类算法k-means及其变种以及过采样修正，利用Spark系统，对大规模医学图像库中的图像进行流式训练，并分别为SIFT和Color特征向量逐步生成视觉词汇表，我们使用先切分数据，并用Spark系统，以流的方式进行分布式处理，并递增导出结果集；其中，多层k-means聚类算法是在一些维度(比如说网格或更高维空间中)的特征点集合中寻找k个聚类中心，使每个特征点到所在簇(病灶区)中心的平方误差和最小。这些聚类中心将特征点集合划分成k个不相交的簇(病灶区)，使得对于任意的，对于一个簇(病灶区)，即可算出病灶点。The processing steps of the SIFT and Color feature data algorithms are as follows: the Color feature is represented by the color attribute ColorNames (CN) descriptor, the color attribute CN is defined as an 11-dimensional variable, and a color attribute label is assigned to all pixels in the image. Tags were used as a principal factor in the Skyline multivariate analysis. SIFT feature extraction is to perform scale conversion on the original image to obtain the scale space representation sequence of the image, and then use the 128-dimensional descriptor vector to represent the feature points, and obtain a total of 128-dimensional SIFT feature vectors. Use the feature points generated during the SIFT feature extraction process, use the feature points and their surrounding areas as local areas, extract the CN vector of each pixel in the local area, and obtain SIFT and CN local feature vectors, which are used as Skyline multi-factor A major factor in the analysis. Then we will use Spark to stream process the collected CN tags and feature vectors, and the results will be gradually improved and output. Based on the extraction method of SIFT and CN feature vectors, through the Spark-based multi-layer clustering algorithm k-means and its variants and oversampling correction, using the Spark system, the images in the large-scale medical image database are streamed for training, and respectively To gradually generate a visual vocabulary for SIFT and Color feature vectors, we first segment the data, and use the Spark system to perform distributed processing in a streaming manner, and incrementally export the result set; among them, the multi-layer k-means clustering algorithm is in Find k cluster centers in the set of feature points of some dimensions (such as grid or higher-dimensional space), so that the sum of squared errors from each feature point to the center of the cluster (lesion area) where it is located is minimized. These clustering centers divide the set of feature points into k disjoint clusters (focus areas), so that for any , the focus points can be calculated for one cluster (focus area).

基于聚类算法生成的视觉词汇表，每幅图像的SIFT描述子被量化为一个装满单词的词袋。在视觉词袋模型中，给定一个特征的视觉词汇表其中j＝1,...,m，k是视觉词汇表中单词的个数(即聚类中心个数)。于是医学图像库中，每幅医学图像被量化为一个视觉单词出现频率的向量(k维向量)。以相同的方式对Color特征进行量化处理，并且将每幅图像量化生成相应的特征向量。对于多特征(m≥2)的量化过程，以此类推，直到所有特征被量化。Based on the visual vocabulary generated by the clustering algorithm, the SIFT descriptor of each image is quantized as a bag-of-words filled with words. In the bag-of-visual-words model, given a feature’s visual vocabulary Where j=1,...,m, k is the number of words in the visual vocabulary (that is, the number of cluster centers). Therefore, in the medical image database, each medical image is quantified as a vector (k-dimensional vector) of the frequency of occurrence of a visual word. The Color feature is quantized in the same way, and each image is quantized to generate a corresponding feature vector. For the quantization process of multi-features (m≥2), and so on until all features are quantized.

作为另一个实施例，过采样修正算法的定义为：在每一次迭代中，过采样修正(OversamplingandRefining，简称为OnR)使用一个SparkSpark作业来进行中心点选择和全局误差的计算(与传统的MapReduce不同在于，我们采用了Spark，利用分布式缓存进行处理，以加快迭带的速度，结果以流式递增的方式进行)，OnR方法受到scalablek-means++方法的启发，除了过采样因子，它使用另一个过采样因子，进一步增大Map阶段选的中心点的数目。As another embodiment, the oversampling correction algorithm is defined as: in each iteration, oversampling and refining (Oversampling and Refining, referred to as OnR) uses a SparkSpark job to carry out center point selection and global error calculation (different from traditional MapReduce The reason is that we use Spark and use distributed cache for processing to speed up the iteration speed, and the results are incrementally streamed), the OnR method is inspired by the scalablek-means++ method, in addition to the oversampling factor, it uses another The oversampling factor further increases the number of center points selected in the Map stage.

在每一个数据分区中，查找一个包含n幅医学图像的图像库和查询的医学图像q，根据S1，医学图像被表达为特征向量。于是，查询图像q和图像库I中的任意图像o_i在第t个特征上的相似度距离可表示为两向量的L₁距离，根据公式，我们得到查询图像q和图像库I中的任意图像o_i在每个特征上的相似度距离，那么图像q和o_i的相似度向量可以表示为两幅图像第k(k≤m)维特征的相似度距离。图像库I中的所有图像分别与查询图像q在各维特征上计算相似度距离，构造生成n个相似度向量。In each data partition, find an image library containing n medical images and the query medical image q, according to S1, the medical image is expressed as a feature vector. Therefore, the similarity distance between the query image q and any image o _i in the image library I on the t-th feature can be expressed as the L ₁ distance between two vectors. According to the formula, we get the query image q and any image o i in the image library I The similarity distance of image o _i on each feature, then the similarity vector of image q and o _i can be expressed as the similarity distance of the kth (k≤m) dimensional features of two images. Calculate the similarity distance between all the images in the image database I and the query image q on each dimension feature, and construct n similarity vectors.

参考图3，计算图像库中每幅图像和查询图像在特征SIFT和Color上的相似度，得到二维的图像相似度向量集合；进一步的，查询图像q与任意图像o_i的SIFT和Color特征相似度距离值构成点，通过基于Skyline的多特征融合方法进行分布式计算决策，相似度距离越小，两者之间越相似，我们采用Spark进行流处理，结果逐渐融合与决策结果推荐，用户得到的结果随时时间会逐步精确。Referring to Figure 3, calculate the similarity between each image in the image library and the query image on the feature SIFT and Color, and obtain a two-dimensional image similarity vector set; further, the SIFT and Color features of the query image q and any image o _i The similarity distance value constitutes a point, and the distributed computing decision is made through the multi-feature fusion method based on Skyline. The smaller the similarity distance, the more similar the two are. We use Spark for stream processing, and the results are gradually fused and recommended for decision results. Users The results obtained will be progressively more accurate over time.

实施例2：一种基于视觉词汇表与多特征匹配的Skyline的医学大数据检索系统，主要是提取医学图像的SIFT、Color等特征数据，利用分布式Skyline操作对图像的多个底层特征进行融合，每个特征相似度都作为Skyline的评价目标，返回的结果是与查询图像在多维特征上都比较相似或某一维特征极其相似的候选图像，最后利用云计算的Spark系统进行流氏处理，并实时得到查询或处理结果。可分为以下三个阶段： Embodiment 2: A Skyline medical big data retrieval system based on visual vocabulary and multi-feature matching, which mainly extracts feature data such as SIFT and Color of medical images, and uses distributed Skyline operations to fuse multiple underlying features of images , each feature similarity is used as the evaluation target of Skyline, and the returned result is a candidate image that is relatively similar to the query image in terms of multi-dimensional features or a certain dimensional feature is extremely similar, and finally uses the cloud computing Spark system for stream processing. And get query or processing results in real time. It can be divided into the following three stages:

第一阶段：提取图像的特征。给定一个查询图像，提取该图像的底层特征。步骤如下：The first stage: extracting the features of the image. Given a query image, extract the underlying features of that image. Proceed as follows:

S1.Color特征的提取；S1.Color feature extraction;

S2.SIFT特征的提取；S2.SIFT feature extraction;

S3.构建视觉词汇表；S3. Building a visual vocabulary;

S4.图像量化表示。S4. Image quantization representation.

进一步的，步骤S1.Color特征用颜色属性ColorNames(CN)描述子来表示，由11种基本颜色组成，即红、黑、蓝、绿、褐、灰、粉、橙、白、紫和黄色，由此把颜色属性CN定义为一个11维的变量，为图像中所有像素赋予一个颜色属性标签，此标签作为Skyline多因素分析的一个主因素，我们采用Spark进行流处理，结果逐渐完善与输出。Further, step S1.Color features are represented by the color attribute ColorNames (CN) descriptor, which consists of 11 basic colors, namely red, black, blue, green, brown, gray, pink, orange, white, purple and yellow, Therefore, the color attribute CN is defined as an 11-dimensional variable, and a color attribute label is assigned to all pixels in the image. This label is used as a main factor in Skyline multi-factor analysis. We use Spark for stream processing, and the results are gradually improved and output.

进一步的，步骤S2.SIFT特征提取过程由检测特征点和描述特征点两部分组成。对原始图像进行尺度转换，得到图像的尺度空间表示序列，然后对图像进行相关处理得到特征点。采用128维的描述子向量来表示特征点，得到共128维的SIFT特征向量。用SIFT特征提取过程中生成的特征点，将特征点及其所在的周围区域作为局部区域，提取局部区域中的每个像素的CN向量，得到SIFT和CN局部特征向量，此向量作为Skyline多因素分析的一个主因素，我们采用Spark进行流处理，结果逐渐完善与输出；Further, the step S2.SIFT feature extraction process consists of two parts: detecting feature points and describing feature points. Scale conversion is performed on the original image to obtain the scale space representation sequence of the image, and then the image is correlated to obtain the feature points. A 128-dimensional descriptor vector is used to represent the feature points, and a total of 128-dimensional SIFT feature vectors are obtained. Use the feature points generated during the SIFT feature extraction process, use the feature points and their surrounding areas as local areas, extract the CN vector of each pixel in the local area, and obtain SIFT and CN local feature vectors, which are used as Skyline multi-factor As a main factor of the analysis, we use Spark for stream processing, and the results are gradually improved and output;

进一步的，步骤S3.基于SIFT和CN特征向量的提取方法，通过基于Spark的多层聚类算法k-means及其变种以及过采样修正，利用Spark系统，对图像库中的图像进行流式训练，并分别为SIFT和Color特征向量逐步生成视觉词汇表，我们与之前的视觉词汇表不同在于，我们使用先切分数据，并用Spark系统，以流的方式进行分布式处理，并递增导出结果集；Further, step S3. Based on the extraction method of SIFT and CN feature vectors, through the Spark-based multi-layer clustering algorithm k-means and its variants and oversampling correction, using the Spark system to perform streaming training on the images in the image library , and gradually generate a visual vocabulary for SIFT and Color feature vectors respectively. The difference between us and the previous visual vocabulary is that we use the first data segmentation, and use the Spark system to perform distributed processing in a streaming manner, and incrementally export the result set ;

其中，多层k-means聚类算法是在一些维度(比如说网格或更高维空间中)的特征点集合X＝{x₁,x₂,...,x_n}中寻找k个聚类中心C＝{c₁,c₂,...,c_k}，使每个特征点到所在簇中心(在肿瘤图像中，这些簇中心代表了肿瘤病灶区，或可能的病灶区)的平方误差和最小(SumofsquaredError，SSE)。这些聚类中心将X划分成k个不相交的簇Y＝{Y₁,Y₂,...,Y_k}，使得对于任意的1≤i≠j≤k，对于一个簇Y_i，它的中心点(即质心)为：Among them, the multi-layer k-means clustering algorithm is to find k feature point sets X={x ₁ ,x ₂ ,...,x _n } in some dimensions (such as grid or higher-dimensional space) Clustering center C={c ₁ ,c ₂ ,...,c _k }, so that each feature point is located at the center of the cluster (in the tumor image, these cluster centers represent tumor lesions, or possible lesions) The minimum sum of squared errors (SumofsquaredError, SSE). These cluster centers divide X into k disjoint clusters Y={Y ₁ ,Y ₂ ,...,Y _k }, such that for any 1≤i≠j≤k, For a cluster Y _i , its center point (ie centroid) is:

每一个分解阶段产生的OnR聚类算法的目标是找到一个最优的划分C，使得Spark的最终全局聚类误差φ_X(C)最小。其中φ_X(C)是利用中心点集C，对特征集合X划分产生的全局聚类误差，|| ||为欧几里得距离。分别对SIFT和CN特征集合进行聚类，得到的k个聚类中心即为它们视觉词汇表。The goal of the OnR clustering algorithm generated in each decomposition stage is to find an optimal partition C that minimizes the final global clustering error φ _X (C) of Spark. Among them, φ _X (C) is the global clustering error generated by dividing the feature set X by using the central point set C, and || || is the Euclidean distance. The SIFT and CN feature sets are clustered separately, and the k cluster centers obtained are their visual vocabulary.

进一步的，步骤S4.基于聚类算法生成的视觉词汇表，每幅图像的SIFT描述子被量化为一个装满单词的词袋。在视觉词袋模型中，给定一个特征的视觉词汇表其中j＝1,...,m，k是视觉词汇表中单词的个数(即聚类中心个数)。于是图像库中，每幅图像被量化为一个视觉单词出现频率的向量(k维向量)。以相同的方式对Color特征进行量化处理，并且将每幅图像量化生成相应的特征向量。对于多特征(m≥2)的量化过程，以此类推，直到所有特征被量化，得到如定义1.1所示的特征向量。Further, step S4. Based on the visual vocabulary generated by the clustering algorithm, the SIFT descriptor of each image is quantized into a bag of words filled with words. In the bag-of-visual-words model, given a feature’s visual vocabulary Where j=1,...,m, k is the number of words in the visual vocabulary (that is, the number of cluster centers). Therefore, in the image library, each image is quantized as a vector (k-dimensional vector) of the frequency of occurrence of a visual word. The Color feature is quantized in the same way, and each image is quantized to generate a corresponding feature vector. For the quantization process of multi-features (m≥2), and so on, until all the features are quantized, the feature vector shown in Definition 1.1 is obtained.

定义1.1(分区特征向量)：在每一个数据分区中，查找一个包含n幅图像的图像库假定每幅图像o_i有一组底层特征m是底层特征的数量，每幅图像o_i的特征向量表示为＜o_i.x₁,o_i.x₂,...,o_i.x_m＞。Definition 1.1 (partition feature vector): In each data partition, find an image library containing n images Assume that each image o _i has a set of underlying features m is the number of underlying features, and the feature vector of each image o _i is expressed as <o _i .x ₁ ,o _i .x ₂ ,...,o _i .x _m >.

第二阶段，特征匹配。分布式计算查询图像和图像库里每个每个数据分区中的图像的SIFT和Color的相似度。步骤如下：The second stage is feature matching. Distributed computing queries the image and the similarity of SIFT and Color of each image in each data partition in the image database. Proceed as follows:

S1.给定一个医学图像，利用Spark逐步提取它的SIFT特征和Color特征，然后根据已生成的视觉词汇表将其特征描述子各自量化为特征向量,我们采用Spark进行流处理，结果逐渐提取与量化；S1. Given a medical image, use Spark to gradually extract its SIFT features and Color features, and then quantify its feature descriptors into feature vectors according to the generated visual vocabulary. We use Spark for stream processing, and the results are gradually extracted and Quantify;

S2.计算医学图像之间各特征的相似度；S2. Calculate the similarity of each feature between medical images;

进一步的，步骤S2.现有一个包含n幅医学图像的图像库和查询图像q，根据S1，医学图像被表达为特征向量。于是，查询图像q和图像库I中的任意图像o_i在第t个特征上的相似度距离可表示为两向量的L₁距离：Further, step S2. There is an existing image library containing n medical images and query image q, medical images are expressed as feature vectors according to S1. Therefore, the similarity distance between the query image q and any image o _i in the image library I on the t-th feature can be expressed as the L ₁ distance between two vectors:

其中表示图像o_i的第t个特征描述子向量，即代表着图像o_i的第t维底层特征的k维向量。in Represents the t-th feature descriptor vector of image o _i , that is, the k-dimensional vector representing the t-th dimension bottom layer feature of image o _i .

基于公式1.3，我们得到查询医学图像q和医学图像库I中的任意图像o_i在每个特征上的相似度距离。那么图像q和o_i的相似度向量如定义1.2所示：Based on Equation 1.3, we obtain the similarity distance between the query medical image q and any image o _i in the medical image library I on each feature. Then the similarity vectors of images q and o _i are shown in Definition 1.2:

定义1.2(图像相似度向量)：设为包含n幅图像的图像库，q为查询图像，查询图像q与图像库I中任意图像o_i的相似度向量可以表示为m维向量：Definition 1.2 (image similarity vector): Let is an image library containing n images, q is a query image, and the similarity vector between query image q and any image o _i in image library I can be expressed as an m-dimensional vector:

其中i∈[1,n]，m表示底层特征数目，Vect_i(o_i,q)表示图像q与图像o_i的相似度向量，dist(o_i.x_k,q.x_k)表示两幅图像第k(k≤m)维特征的相似度距离。Where i∈[1,n], m represents the number of underlying features, Vect _i (o _i ,q) represents the similarity vector between image q and image o _i , dist(o _i .x _k ,qx _k ) represents two images The similarity distance of the kth (k≤m) dimensional feature.

图像库I中的所有图像分别与查询图像q在各维特征上计算相似度距离，构造生成n个相似度向量。Calculate the similarity distance between all the images in the image database I and the query image q on each dimension feature, and construct n similarity vectors.

第三阶段，特征融合。将不同特征的相似度向量构造成一个新的向量，调用基于Skyline的多特征融合方法(SKFF)进行分布式计算决策。最后，我们采用Spark进行流处理，结果逐渐融合与决策结果推荐，用户得到的结果随时时间会逐步精确。The third stage is feature fusion. The similarity vectors of different features are constructed into a new vector, and the Skyline-based multi-feature fusion method (SKFF) is called for distributed computing decision-making. Finally, we use Spark for stream processing, and the results are gradually integrated and recommended for decision-making results. The results obtained by users will be gradually accurate at any time.

S1.分布式计算图像库中每幅图像和查询图像在特征SIFT和Color上的相似度，得到二维的图像相似度向量集合；S1. Distributed calculation of the similarity between each image in the image library and the query image on the feature SIFT and Color, to obtain a two-dimensional image similarity vector set;

S2.利用Skyline的多特征融合进行特征融合，前面特征匹配的结果可作为Skyline操作的输入；S2. Use Skyline's multi-feature fusion to perform feature fusion, and the result of the previous feature matching can be used as the input of Skyline operation;

S3.利用云计算的Spark系统进行流氏处理，并实时得到查询或处理结果。S3. Use the Spark system of cloud computing to perform streaming processing, and obtain query or processing results in real time.

进一步的，给出基于Skyline的多特征融合方法的定义(4.1)。Further, the definition (4.1) of the multi-feature fusion method based on Skyline is given.

定义1.4(基于Skyline的多特征融合方法)：给定一个包含n幅图像的医学图像库和一幅查询图像q，集合R为多特征融合方法的查询结果。对于每幅图像的m个底层特征向量R集合包含了与查询图像q在X向量空间上相似度向量Vect_i(o_i,q)＝＜dist(o_i.x₁,q.x₁),dist(o_i.x₂,q.x₂),...,dist(o_i.x_m,q.x_m)＞不被医学图像库I上的其他任何图像相似度向量支配的所有图像的集合，即当一幅图像o_i∈R，当且仅当满足如下条件：Definition 1.4 (Skyline-based multi-feature fusion method): Given a medical image library containing n images and a query image q, the set R is the query result of the multi-feature fusion method. For the m underlying feature vectors of each image The R set contains the similarity vector Vect _i (o _i ,q)=<dist(o _i .x ₁ ,qx ₁ ),dist(o _i .x ₂ ,qx ₂ ), ...,dist(o _i .x _m ,qx _m )＞The set of all images not dominated by any other image similarity vector on the medical image database I, that is, when an image o _i ∈R, if and only When the following conditions are met:

进一步的，基于Skyline的多特征融合方法(SKFF)的结果集是医学图像库的子集，且在多特征度量空间中不被图像集里任意图像所支配的图像集合。查询图像q与任意图像o_i的SIFT和Color特征相似度距离值构成点，如图2所示，例如p₁点的横坐标表示图像o₁与查询图像q之间SIFT特征的相似度距离，纵坐标则表示它们之间Color特征的相似度距离，这些距离在多特征度量空间上都是基于词袋模型计算。Furthermore, the result set of the Skyline-based multi-feature fusion method (SKFF) is a subset of the medical image database, and an image collection that is not dominated by any image in the multi-feature metric space. The SIFT and Color feature similarity distance values between the query image q and any image o _i constitute points, as shown in Figure 2. For example, the abscissa of point p ₁ represents the similarity distance of SIFT features between image o ₁ and query image q, The ordinate indicates the similarity distance of the Color features between them, and these distances are calculated based on the bag-of-words model in the multi-feature metric space.

进一步的，相似度距离越小，两者之间越相似，因此{p₁,p₂,p₃,p₄}是最后的Skyline结果，表示没有其他更好的图像比{o₁,o₂,o₃,o₄}在SIFT和Color特征上都与查询图像的更相似，即在图像库中没有图像与查询图像的相似度向量在SIFT和Color特征上支配它们。Further, the smaller the similarity distance, the more similar they are, so {p ₁ ,p ₂ ,p ₃ ,p ₄ } is the final Skyline result, indicating that there is no other better image than {o ₁ ,o ₂ ,o ₃ ,o ₄ } are more similar to the query image on both SIFT and Color features, that is, there are no images in the image library whose similarity vectors to the query image dominate them on SIFT and Color features.

S3.Spark进行流处理，逐渐融合与决策结果推荐。S3.Spark performs stream processing, gradually integrates and recommends decision results.

进一步的，步骤S2，得出最后的Skyline结果是{p₁,p₂,p₃,p₄}。Further, in step S2, the final Skyline result is {p ₁ , p ₂ , p ₃ , p ₄ }.

进一步的，利用Spark进行流处理，将流式计算分解成一系列短小的批处理作业。整个流式计算根据业务的需求可以对中间的结果进行叠加，或者存储到外部设备，把最佳的医学聚类方案逐步反馈给用户。Further, use Spark for stream processing, and decompose stream computing into a series of short batch jobs. The entire streaming computing can superimpose the intermediate results according to the needs of the business, or store them in an external device, and gradually feed back the best medical clustering scheme to the user.

以上所述，仅为本发明创造较佳的具体实施方式，但本发明创造的保护范围并不局限于此，任何熟悉本技术领域的技术人员在本发明创造披露的技术范围内，根据本发明创造的技术方案及其发明构思加以等同替换或改变，都应涵盖在本发明创造的保护范围之内。The above is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto, any person familiar with the technical field within the technical scope of the disclosure of the present invention, according to the present invention Any equivalent replacement or change of the created technical solution and its inventive concept shall be covered within the scope of protection of the present invention.

Claims

1. a kind of medical science big data search method of the Skyline that view-based access control model vocabulary is matched with multiple features, it is characterised in that Comprise the following steps：

S1. the low-level image feature of medical image is extracted, low-level image feature set is clustered respectively, build visual vocabulary table, with this, It is a vector for the vision word frequency of occurrences by the image quantization in image library, obtains partition characteristics vector；

S2. similarity distance of the arbitrary image in query image and image library in each feature is calculated, to construct different spies The image similarity vector levied；

S3. call the multiple features fusion method based on Skyline to carry out distributed search and calculate decision-making.

2. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 1 is matched with multiple features Method, it is characterised in that the step S1. extracts the characteristic of medical image, gives a query image, extracts the image Low-level image feature, comprises the following steps：

The extraction of S1.1.Color features；

The extraction of S1.2.SIFT features；

S1.3. visual vocabulary table is built；

S1.4. image quantization is represented.

3. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 1 is matched with multiple features Method, it is characterised in that the method for the image similarity vector of different characteristic is constructed in the step S2 is：One is cured comprising n width Learn the image library of imageWith query image q, medical image is expressed as characteristic vector, query image q and image library I In arbitrary image o_iSimilarity distance in t-th feature, its L for being expressed as two vectors₁Distance：

WhereinRepresent image o_iT-th Feature Descriptor vector, be image o_iT tie up low-level image feature k dimensional vectors；

Based on formula 1.3, the arbitrary image o inquired about in medical image q and medical image storehouse I is obtained_iIt is similar in each feature Degree distance, image q and o_iSimilarity vector as define 1.2 shown in：

Define 1.2：IfIt is the image library comprising n width images, q is query image, is appointed in query image q and image library I It is intended to as o_iSimilarity vector be expressed as m dimensional vectors：

Vect_i(o_i, q)=＜ dist (o_i.x₁,q.x₁),dist(o_i.x₂,q.x₂),...,dist(o_i.x_m,q.x_m) ＞

Wherein i ∈ [1, n], m represent low-level image feature number, Vect_i(o_i, q) represent image q and image o_iSimilarity vector, dist(o_i.x_k,q.x_k) represent two images kth (k≤m) dimensional feature similarity distance；All images point in image library I Do not calculate similarity distance, construction n similarity vector of generation on each dimensional feature with query image q.

4. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 1 is matched with multiple features Method, it is characterised in that the specific method of the step S3：

A given medical image storehouse comprising n width imagesIt is multiple features fusion side with width a query image q, set R The Query Result of method, for the m low-level image feature vector of each image

As piece image o_i∈ R, and if only if meets following condition：

Then R set contains with query image q the similarity vector Vect in X vector spaces_i(o_i, q)=＜ dist (o_i.x₁, q.x₁),dist(o_i.x₂,q.x₂),...,dist(o_i.x_m,q.x_m) ＞ is by other any image phases on the I of medical image storehouse Like the set of all images for spending vector domination.

5. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 4 is matched with multiple features Method, it is characterised in that the result set of the multiple features fusion method based on Skyline is the subset in medical image storehouse, and in Duo Te Levy in metric space the image collection do not arranged by arbitrary image in image set, query image q and arbitrary image o_iSIFT Point is constituted with Color characteristic similarities distance value, the abscissa of point represents image o₁The phase of SIFT feature between query image q Like degree distance, ordinate represents image o₁The similarity distance of Color features between query image q, the similarity away from It is all based on what bag of words were calculated from multiple features metric space, similarity is more similar between the two apart from smaller.

6. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 5 is matched with multiple features Method, it is characterised in that carry out stream process using Spark, streaming is calculated and resolves into a series of short and small batch processing jobs, gradually Fusion is recommended with the result of decision.

7. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 2 is matched with multiple features Method, it is characterised in that the method for the extraction of step S1.1.Color features is as follows：

Color features describe son to represent with color attribute CN, by red, black, blue, green, brown, grey, powder, orange, white, purple, yellow face Colour cell into, color attribute CN is defined as one 11 variable of dimension, be that all pixels assign a color attribute label in image, This label carries out stream process as a main factor of Skyline multiplicities using Spark, as a result gradually improve with it is defeated Go out.

8. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 2 is matched with multiple features Method, it is characterised in that the method for the extraction of step S1.2.SIFT features is as follows：

It is made up of detection characteristic point and Expressive Features point two parts, spatial scaling is carried out to original image, obtains the yardstick of image Space representation sequence, then to image process obtaining characteristic point, and characteristic point is represented using the description subvector of 128 dimensions, The SIFT features vector of totally 128 dimensions is obtained, with the characteristic point generated in SIFT feature extraction process, by characteristic point and its place Peripheral region extracts the CN vectors of each pixel in regional area as regional area, obtain SIFT and CN local features to Amount, this vector carries out stream process as a main factor of Skyline multiplicities using Spark, as a result gradually improve with Output.

9. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 2 is matched with multiple features Method, it is characterised in that the method that step S1.3. builds visual vocabulary table is as follows：

It is right using Spark systems by the multi-level clustering algorithm k-means based on Spark and its mutation and over-sampling amendment Image in image library carries out streaming training, and respectively SIFT and Color characteristic vectors progressively generate visual vocabulary table, generation During visual vocabulary table, using first cutting data, and Spark systems are used, distributed treatment is carried out in a streaming manner, and be incremented by derivation Result set；

Wherein, multilayer k-means clustering algorithms are the set of characteristic points X={ x in some dimensions₁,x₂,...,x_nMiddle searching k Cluster centre C={ c₁,c₂,...,c_k, make each characteristic point to the square error and minimum at place cluster center；In these clusters X is divided into k disjoint cluster Y={ Y by the heart₁,Y₂,...,Y_kSo that for arbitrary 1≤i ≠ j≤k,For a cluster Y_i, its central point is：

Wherein, over-sampling correction algorithm is the meter that center point selection and global error are carried out using a SparkSpark operation Calculate and (be that we employ Spark with traditional MapReduce differences, processed using distributed caching, to accelerate to change The speed of band, is as a result carried out in the way of streaming is incremented by), its object function is：

The target of the OnR clustering algorithms that each catabolic phase is produced is to find an optimal division C so that Spark is most Whole global clustering error φ_X(C) it is minimum, wherein φ_X(C) it is, using center point set C, the overall situation for producing to be divided to characteristic set X and is gathered Class error, | | | | it is Euclidean distance.SIFT and CN characteristic sets are clustered respectively, the k cluster centre for obtaining is i.e. It is their visual vocabulary tables.

10. the medical science big data retrieval side of the Skyline that view-based access control model vocabulary as claimed in claim 2 is matched with multiple features Method, it is characterised in that the method that step S1.4. image quantizations are represented is as follows：

Based on the visual vocabulary table of clustering algorithm generation, SIFT description of each image are quantified as a word for filling word Bag, in vision bag of words, gives a visual vocabulary table for featureWherein j=1 ..., m, k are visions The number of word in vocabulary, in image library, each image is quantified as a k dimensional vector for the vision word frequency of occurrences, with Identical mode carries out quantification treatment to Color features, and each image is quantified into the corresponding characteristic vector of generation, for The quantizing process of multiple features, by that analogy, until all features are quantized, obtains as defined the characteristic vector shown in 1.1；

Define 1.1：In each data partition, an image library comprising n width images is searchedIt is assumed that every width figure As o_iThere is one group of low-level image featureM is the quantity of low-level image feature, each image o_iCharacteristic vector be expressed as ＜ o_i.x₁,o_i.x₂,...,o_i.x_m＞.