New! View global litigation for patent families

CN104317867A - System for carrying out entity clustering on web pictures returned by search engine - Google Patents

System for carrying out entity clustering on web pictures returned by search engine Download PDF

Info

Publication number
CN104317867A
CN104317867A CN 201410554684 CN201410554684A CN104317867A CN 104317867 A CN104317867 A CN 104317867A CN 201410554684 CN201410554684 CN 201410554684 CN 201410554684 A CN201410554684 A CN 201410554684A CN 104317867 A CN104317867 A CN 104317867A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
system
clustering
context
concept
layer
Prior art date
Application number
CN 201410554684
Other languages
Chinese (zh)
Other versions
CN104317867B (en )
Inventor
朱其立
赵凯祺
蔡智源
隋清宇
魏恩勋
Original Assignee
上海交通大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRICAL DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • G06F17/30864Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems
    • G06F17/30867Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems with filtering and personalisation

Abstract

The invention relates to a system for carrying out entity clustering on web pictures returned by a search engine. The system comprises an offline system and an online system, wherein the offline system is used for preprocessing a source webpage in which all pictures are stored, the online system is used for receiving the inquiry, submitting the inquiry to the search engine and receiving multiple pages of returned picture results, concept element data and text of the source webpage are found for each page of returned results, an inquiry context and a picture context are extracted from the concept text, the online system carries out the three-layer clustering on the element data, the context and the expanded context after the context is expanded in a concept manner, a relevant descriptive concept is automatically marked for each category so as to know the entity of each category. The three-layer clustering algorithm has identical time complexity with an ordinary layering clustering algorithm; by subdividing the characteristics, more precision in the input of each layer, i.e. the output of a previous layer can be realized, the clustering effect can be effectively improved, and an accurate descriptive concept can be provided.

Description

对搜索引擎返回的网页图片进行实体聚类的系统 The search engine returns pages picture entity clustering system

技术领域 FIELD

[0001] 本发明涉及计算机技术领域的自然语言处理,文本挖掘,具体地,涉及对搜索引擎返回的网页图片进行实体聚类的系统。 [0001] The present invention relates to the field of computer technology natural language processing, text mining, in particular, it relates to a search engine returns the page image to the entity clustering system.

背景技术 Background technique

[0002] 随着互联网的普及以及网页图片日益增长,网页图片搜索逐渐成为互联网用户的一大日常应用。 [0002] With the growing popularity of the Internet and web images, web image search is becoming a major daily application of Internet users. 目前的图片搜索引擎主要返回跟查询关键词相关的图片。 The current main image search engine returns query keywords associated with pictures. 而这些图片往往包含多个同名的实体。 These images often contain multiple entities of the same name. 用户需要从搜索结果中找到所要的图片,需要浏览查看每张返回的图片。 Users need to find the picture you want from the search results, you need to see pictures of each return. 为了提高搜索结果的可读性,按照不同实体区分搜索结果成为了图像搜索引擎的一个改良反向。 In order to improve the readability of the search results, according to the different entities distinguish search results became a modified reverse image search engine.

[0003] 图像聚类是自动区分不同实体的方法。 [0003] The image clustering method to automatically distinguish between different entities. 在过去的研究中,D.Cai(参见Cai, D. , He, X. , Ma, ff. Y. , Wen, JR , Zhang, H. : Organizing www images based on the analysis of page layout and web link structure. ICME 2004)利用基于视觉的分块的方式抽取网页图片的上下文,并且利用该上下文和网页链接信息进行聚类。 In previous studies, D.Cai (see Cai, D., He, X., Ma, ff Y., Wen, JR, Zhang, H.:. Organizing www images based on the analysis of page layout and web link structure. ICME 2004) using a context-based approach to extract the page image of the visual block, and utilizing the context information and web links clustering. 然而由于视觉分块的不稳定,以及上下文中的噪声数据,聚类的精度有很大的限制;Z. Fu(参见Fu, Z.,Ip, HHS,Lu, H.,Lu, Z. :Multi-modal constraint propagation for heterogeneous image clustering. MultiMedia 2011)提供了一种结合照图像的标签和图像的视觉特征等多个模块的框架,在多个图上通过传递类的约束来实现图像聚类。 However, due to instability of the visual block, and a noise data context, clustering accuracy is very limited; Z Fu (see Fu, Z., Ip, HHS, Lu, H., Lu, Z..: Multi-modal constraint propagation for heterogeneous image clustering. MultiMedia 2011) provides a framework for a plurality of modules according to one binding tag and an image of the visual image characteristics, etc., to realize an image on a plurality of clustering constraints passed through FIG class. 目前视觉特征的抽取精度的不足,该框架会传播视觉特征所包含的错误。 Current lack of precision visual features extracted, the frame will propagate errors contained visual features. 而且,该方法需要在多个图中进行约束传递,导致聚类效率低下,不适合于对在线图片搜索结果的聚类。 Moreover, this method requires a plurality of constraints in FIG transfer, resulting in low efficiency of clustering, clustering is not suitable for online image search results. 目前的图像聚类方法并不能提供描述性的概念去给每一个类进行标注。 The current image clustering method does not provide for the concept of a descriptive label to the class to each.

发明内容 SUMMARY

[0004] 本发明针对现有技术中的不足,提供了一个对搜索引擎返回的网页图片进行实体聚类的系统,使得图片搜索结果更好地按照不同实体组织起来,并且每个实体类具有高精度,不同实体之间具有明显的区分度。 [0004] The present invention addresses deficiencies in the prior art, a system that provides search engine returns the page image to the physical cluster, such that image search results better organized in different entities, and each entity class with high accuracy of a clear distinction between the different entities. 本发明把整个框架分成了在线和离线两个部分,大大减小了在线聚类的时间开销。 The whole framework of the present invention divided into two parts and off-line, greatly reducing the time overhead line clustering.

[0005] 为达到上述目的,本发明所采用的技术方案如下: [0005] To achieve the above object, the technical solution employed in the present invention is as follows:

[0006] -种对搜索引擎返回的网页图片进行实体聚类的系统,包括离线系统和在线系统两部分,其中: [0006] - the kind of search engine returns pages picture entity clustering system, including the system off-line and on-line system in two parts, in which:

[0007] 离线系统,用于对所有图片所在的源网页进行预处理,包括抽取网页元数据,把原网页文本和元数据概念化成一组带权概念的集合(概念向量)。 [0007] The off-line system for preprocessing the source images where all pages, including page metadata extraction, and the original text of the page metadata concepts into a set of weighted group concept (concept vector). 概念化后的元数据和网页内容供在线系统查询使用。 Metadata and web content for the conceptualization online system queries.

[0008] 在线系统,用于接收查询,提交到搜索引擎并接收返回的多页图片结果,对于每一个页的返回结果,找到源网页的概念化元数据和文本,并在概念化的文本中抽取查询关键词的上下文(查询上下文)以及图片上下文,在线系统分别利用元数据,上下文,以及通过维基百科对上下文进行概念扩展后的扩展上下文进行三层聚类,并为每一个类别自动标注相关的描述性概念,以了解每一个类别的实体。 [0008] online system for receiving a query submitted to the search engine and receive multi-page picture results returned for each page of returned results found conceptualization source metadata and text pages, and extract text query in the conceptualization keywords context (query context) as well as pictures contexts, respectively, using the online system metadata, context, and three clusters by Wikipedia context after context to extend the concept of expansion, and automatic annotation for each category related to the description concept, in order to understand each category of the entity.

[0009] 所述离线系统进行元数据抽取,包括对URL中有效词条的抽取,图片ALT属性,对URL有效词条的抽取,利用二类分类器对有效和无效词条进行分类,并返回有效词条。 [0009] The meta data extraction offline systems, including the extraction of valid entries in the URL, ALT attribute of URL entries effective extraction of valid and invalid entries are classified using the second-class classifier, and returns valid entries. 图片ALT属性可以直接从HTML源代码获得。 Image ALT attributes can be obtained directly from the HTML source code.

[0010] 所述离线系统包括概念化模块,包括对元数据和图片原网页文本的概念化,概念化通过把元数据和文本中的词映射到维基百科的概念上,使元数据和文本转化成带权概念的集合,以计算相似度,供聚类算法使用,每个概念的权值为该概念对图片的重要性,其定义如下: [0010] The off-line system comprises a conceptualization module comprising metadata and conceptualization of the original page of text images, and conceptualized by the metadata text word mapped to the Wikipedia concept that metadata and text into weighted a collection of concepts, in order to calculate the similarity for clustering algorithm, the importance of the concept of the picture right of each concept is defined as follows:

[0011] [0011]

Figure CN104317867AD00051

[0012] 其中,CF-IDF(c,d)为概念C对图片d的重要性,包括两部分的乘积:概念在图片上下文出现的频率CF(c,d),以及反向上下文频率,其中反向上下文频率反比于概念出现过的上下文的数量DF (c)。 [0012] wherein d is the importance of the concept of image C CF-IDF (c, d), a product comprising two parts: the concept of frequency CF (c, d) occurring context picture, and inverse context frequency, wherein inverse context frequency is inversely proportional to the concept of a context number appeared DF (c).

[0013] 所述在线系统包括文本上下文抽取模块,在已经概念化的原网页文本里抽取上下文信息,包括图片上下文的抽取和查询上下文的抽取,图片上下文和查询上下文皆通过一个固定大小的窗口截取,比如图片或者查询关键词前后50个概念,抽取的文本上下文形成一个概念向量,以用于计算图片相似度。 [0013] The system includes a line text context extraction module extracts the context information in the original text of the page has been conceptualized, the context comprising image extraction extraction and query context, and the context query context images are by a fixed size window to capture, before and after the keyword query such as pictures or concept 50, forming a context for the text extracted concept vector, for calculating the image similarity.

[0014] 所述在线系统包含三层聚类算法模块,包括元数据聚类,文本上下文聚类,以及上下文概念扩展聚类三个模块,其中: [0014] The system comprises a three-line clustering module comprising metadata clustering, clustering text context, and a context clustering concept three expansion modules, wherein:

[0015] 第一层聚类,通过元数据概念化后的概念向量进行聚合层次聚类,获得类内精度高的聚类结果,并且合并每个类里所有图片的概念向量作为类的概念向量。 [0015] cluster of the first layer, through the metadata conceptual vector conceptualized hierarchical clustering polymerization, to obtain a high accuracy of within-class clustering result, in each class and the combined images as all conceptual vector concept vector class.

[0016] 其中,聚合层次聚类算法利用类的概念化进行类的相似度计算。 [0016] wherein the polymerization using the hierarchical clustering based conceptual similarity calculation categories. 类的概念化通过把类中的图片的概念向量进行相加,并且去除向量中值比较低的概念,得到高精度的类概念。 Class conceptual vector obtained by adding the concept of class image and removes a relatively low value of the vector concept, the concept of classes obtained with high accuracy. 类的概念化用如下公式定义: Conceptualization class defined by the following equation:

[0017] [0017]

Figure CN104317867AD00052

[0018] 其中,c为概念,C为类,d为类中图片,CF-IDF(c,d)为概念对图片的重要性。 [0018] wherein, c is the concept, C class, d is the image type, the importance of the picture CF-IDF (c, d) concept.

[0019] 第二层聚类,向每个图片的概念向量中加入概念化上下文的概念向量,更新所有第一层聚类后得到的类的概念向量,并进一步对这些得到的类进行聚合层次聚类。 [0019] cluster of the second layer, the concept vector for each picture in the context of the concept vector conceptualization added, updated concept vector of the first layer obtained by the clustering all the classes, and further polymerizing the resulting class hierarchy poly class.

[0020] 第三层聚类,把每个图片的向量替换成扩展的概念向量,更新所有第二层聚类后得到的类的概念向量,并进一步对这些概念向量进行聚合层次聚类。 After the [0020] third layer clustering, the replacement vector for each picture into an extended concept vector, to update all of the second layer obtained by the clustering concept vector class, hierarchical clustering and further polymerized vectors of these concepts.

[0021] 其中,向量的扩展利用维基百科的概念描述页面,把相关的概念加入到图片的概念向量中,并且更新每个类的概念向量。 [0021] wherein the extended vector using the concept of Wikipedia page describing the concepts related to the concept of vector added to the picture, and update the concept of vectors for each class. 其更新定义为如下公式: Which is defined as the update equation:

[0022] [0022]

Figure CN104317867AD00053

[0023] 其中,rF-IDF(c,dCi)为概念c对概念Ci的维基百科描述页面的重要性,Ci为当前类概念向量中的概念,此上下文扩展过程通过选取值最大的前k个概念对噪声数据进行过滤。 [0023] wherein, rF-IDF (c, dCi) the importance of the concept c Wikipedia page description of the concept of Ci, Ci is the notion of a current vector generic concept, this context expansion process by selecting the maximum value of the previous k concept of the noise data filter.

[0024] 用三层聚类后得出的类概念向量给每个图片类标注相关的描述概念:选取每个类的概念向量中值最高的前几个概念用于描述该类所代表的实体。 [0024] After three derived classes clustering concept vector for each image class label associated with the concepts described: Select concept vectors for each class of the highest value of the first few conceptual entities used to describe the class represented by .

[0025] 本发明解决的技术问题包括: [0025] The technical problem solved by the present invention comprises:

[0026] 1.抽取图像上下文信息,并把上下文信息表示为概念空间中的向量,为图像相似度的计算提供特征。 [0026] 1. The context decimated image information, and information indicating the context space vector concept to provide an image wherein the calculated similarity.

[0027] 2.由于某些图像存在上下文信息量不足的情况,本发明提供一种扩展上下文信息的机制,把上下文的概念向量通过维基百科或者其他知识库进行扩展。 [0027] 2. Because there is insufficient information in both the context of certain images, the present invention provides a mechanism for an extended context information, the context of the concept vector extended by Wikipedia or other repository.

[0028] 3.由于不同的特征跟图片的相关度不同,相关度越高的特征的置信度越高,本发明为了有效利用不同相关度的特征来提高聚类的精度,依次对图片的概念向量进行扩展, 并且聚类。 [0028] 3. Due to the different features of the image with different affinity, the higher the confidence level of the higher degree of correlation of the features, the present invention is characterized in order to effectively utilize a different degree of correlation to improve the accuracy of clustering, the concept of the picture sequence extended vector, and the cluster.

[0029] 以下通过检索的相关现有技术与本发明进行的对比,来说明本发明的技术特征。 [0029] The following comparison performed by retrieving the related art and the present invention is to describe the technical features of the invention.

[0030] 相关检索1 : [0030] Related retrieving 1:

[0031] 申请(专利)号:2012101444570,名称:一种图片聚类的方法及装置 No. [0031] Application (Patent): 2012101444570, Title: A method and apparatus for image clustering

[0032] 该专利文献通过对图片的视觉特征,包括全局特征以及局部特征进行了两次聚类,第二次聚类在第一次聚类的基础上进行切割。 [0032] The Patent Document visual characteristics of the picture, including local and global features two features cluster, the second cluster is cut on the basis of the first cluster.

[0033] 技术要点比较: [0033] Techniques Comparison:

[0034] 1.该专利根据图片的内容,即视觉特征进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0034] 1. According to this patent the picture content, i.e. clustering visual image characteristics, and the present invention is in the context of clustering using image features.

[0035] 2.该专利的二次聚类把大的类切割成小的类,而本发明从小的类聚合成大的类, 利用每次扩展概念向量进行特征的筛选,过滤噪声数据。 [0035] 2. Patent of the secondary clusters of the large class cut into small class, and the present invention is small based polymerization into large categories, each extended concept screening using vector features, filtering noise data.

[0036] 3.本发明米用的概念向量表不方式能为每一类标注描述概念,而基于图片内容的聚类方式无法提供概念描述。 [0036] 3. The concept of using hair Mingmi vector table for each class can not marked manner described concept, content-based image clustering method can not provide for the concepts described.

[0037] 相关检索2 : [0037] Related to retrieve 2:

[0038] 申请(专利)号:2013106111554,名称:一种基于聚类紧凑特征的海量图像检索系统 No. [0038] Application (Patent): 2013106111554, Title: Mass one kind of image retrieval system based on clustering feature compact

[0039] 该专利文献通过图像的局部特征对图像库中的图像进行聚类。 [0039] The Patent Document clustering the image on the image database by local features of the image. 搜索的时候通过查询关键词先检索到图片聚类然后返回相应的图像。 When the first search query keywords to the images retrieved by clustering and then returns the corresponding image.

[0040] 技术要点比较: [0040] Techniques Comparison:

[0041]1.该专利根据图片的局部特征生成聚类紧凑特征,进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0041] 1. This patent generates a feature image according to the local characteristics of the compact cluster, clustering for pictures, but the present invention is the use of clustering in image context feature.

[0042] 2.该专利通过图像聚类来提高检索的速度,而本发明通过把搜索结果进行聚类并概念化以提供区分各个类别的搜索结果。 [0042] 2. The patent to improve the speed of retrieval by the image clustering, and clustering the present invention is conceptualized by the search results and to provide search results to distinguish individual classes.

[0043] 相关检索3 : [0043] Related retrieving 3:

[0044] 申请(专利)号:201210545637X,名称:一种基于分层聚类的均衡图像聚类方法 No. [0044] Application (Patent): 201210545637X, Title: An equalization method based on hierarchical clustering image clustering

[0045] 该专利文献利用图片聚类的方式减少搜索时所需要遍历的图片数量。 [0045] When the number of images required to reduce the search traversed the Patent Document clustering using picture mode. 图片聚类基于图像高维特征数据。 Image data is based on clustering high-dimensional feature image.

[0046] 技术要点比较: [0046] Techniques Comparison:

[0047] 1.该专利根据图片的高维特征,进行图片聚类,而本发明中利用图片上下文的特征进行聚类。 [0047] 1. According to this patent high dimensional feature image performs image clustering, and utilized in the present invention is characterized in the context of image clustering.

[0048] 2.该专利通过图像聚类减少检索时需要遍历的图片,采用的图像聚类方式是层次聚类,而本发明基于三种不同的上下文特征,通过三层聚类的方式提升聚类的精度。 [0048] 2. The decreases patent requires traversed by the image retrieval clustering picture, image clustering way hierarchical clustering is employed, and the present invention is based on three different context features, by way of clustering three lifting poly accuracy class.

[0049] 相关检索4 : [0049] Related retrieved 4:

[0050] 申请(专利)号:201210163641X,名称:图像聚类方法 No. [0050] Application (Patent): 201210163641X, Title: Image Clustering Method

[0051] 该专利通过拍摄设备获取图像的时间数据和位置数据,并利用时间和位置以及速度数据作为特征进行聚类。 [0051] This patent acquisition time data and the position data of the image capturing device, using the position and velocity data and time clustering as a feature.

[0052] 技术要点比较: [0052] Techniques Comparison:

[0053] 1.该专利主要针对拍摄图像进行聚类,而本发明针对网页图片进行聚类。 [0053] Cluster 1. This patent mainly for capturing an image, and the present invention is to cluster page-image. 拍摄的图像没有上下文信息,而网页图片不一定是拍摄图像,大部分没有拍摄时间和位置。 Images taken no contextual information, images and web pages are not necessarily capture an image, most did not take the time and location. 两者的特征有所不同。 Both features are different.

[0054] 2.该专利基于事件序列进行聚类,而本发明基于概念向量。 [0054] 2. This sequence of events based clustering patent and the present invention is based on the concept vectors. 概念向量可以用于描述概念的生成。 Generating a concept vector may be used to describe the concept.

[0055] 相关检索5 : [0055] 5 related retrieval:

[0056] 申请(专利)号:2009801523973,名称:使用基于内容的过滤和基于主题的聚类将图像布置到页面中 Application No. [0056] (Patent): 2009801523973, Title: Topic-based filtering and content-based clustering images arranged in the page

[0057] 该专利基于设备捕获到的图片的内容,即视觉特征,按照不同的主题聚类,并且把聚类的结果映射到相应的相簿中。 [0057] This patent based device to capture the picture content, i.e. visual features in different clusters of topics, and the clustering result is mapped to the corresponding album.

[0058] 技术要点比较: [0058] Techniques Comparison:

[0059]1.该专利利用图片的视觉特征聚类,而本发明利用网页图片的上下文进行聚类。 [0059] 1 The patent utilized visual image feature clustering, the present invention utilizes a web context image clustering.

[0060] 2.该专利将图片通过图片布局到不同的页面上,而本发明为用户提供分类的搜索结果以及相应的描述概念。 [0060] 2. The patent pictures by picture on a different page layouts, and the present invention is to provide users with search results and corresponding classification concepts described.

[0061] 相关检索6: [0061] Related to retrieve 6:

[0062] 申请(专利)号:2010105171639,名称:图像聚类方法和系统 [0062] Application No. (Patent): 2010105171639, Title: Method and system for image clustering

[0063] 该专利采用参数估计的方式建立图像的有向图,并且以分割有向图的方式进行图像聚类。 [0063] This patent by way of image parameter estimation establishing a directed graph, and have to divide clustered image to the embodiment of FIG. 有向图的分割形成多个子图,而每个子图的图像归为一个类。 FIG forming a plurality of sub-dividing the graph, and the image of each sub-classified as a category of FIG.

[0064] 技术要点比较: [0064] Techniques Comparison:

[0065] 1.该专利利用图的方式进行聚类,图像库表示成一个有向图。 [0065] FIG. 1 of this patent using the clustering manner, the image library represented as a directed graph. 本发明通过从小到大的方式聚合图片形成图片类,每一层聚类考虑不同的图像上下文特征。 The present invention is by way of polymerization of from small to large image-based image is formed, each layer a different context clustering considering image characteristics.

[0066] 相关检索7 : [0066] Related retrieval 7:

[0067] 申请(专利)号:2005800393866,名称:图像聚类方法和系统 No. [0067] Application (Patent): 2005800393866, Title: Method and system for image clustering

[0068] 该专利利用时间地点特征对图像按照事件进行聚类,采用的聚类算法根据不同的时间范围进行不同层的聚类。 [0068] This patent clustering using the time and place of image features by event clustering, clustering is performed using the different layers according to different time ranges.

[0069]技术要点比较: [0069] Techniques Comparison:

[0070]1.该专利的多层聚类中的层是不同时间范围,而本发明的层是按照不同特征所定义的层。 [0070] 1. HIBERARCHY layer in this patent are different time frames, and the layer is a layer according to the present invention as defined by different characteristics.

[0071] 2.该专利按照事件序列进行聚类,而本发明按照不同的实体区分不同的图片类。 [0071] 2. The sequence of events in accordance with patent clustering, according to the present invention, different entities to distinguish between different classes of images.

[0072] 与现有技术相比,本发明创造性地利用三种不同的特征,和对应的三层聚类算法, 对图片进行聚类,并且为每一个类提供概念标注,使得图片搜索结果更好地按照不同实体组织起来,并且每个实体类具有高精度,不同实体之间具有明显的区分度。 [0072] Compared with the prior art, the present invention creatively using three different characteristics, and the corresponding three clustering algorithm for clustering images, and provides a concept for each class labeled, so that more image search results well tissue together in different entities, and each entity class with high precision, a significant degree of discrimination between different entities. 本发明把整个框架分成了在线和离线两个部分,大大减小了在线聚类的时间开销。 The whole framework of the present invention divided into two parts and off-line, greatly reducing the time overhead line clustering.

附图说明 BRIEF DESCRIPTION

[0073] 通过阅读参照以下附图对非限制性实施例所作的详细描述,本发明的其它特征、 目的和优点将会变得更明显: [0073] By reading the following detailed description of non-limiting embodiments given with reference to the following figures, other features of the present invention, objects and advantages will become more apparent:

[0074] 图1示出本发明的系统框架图; [0074] FIG. 1 shows a system according to the present invention, FIG frame;

[0075] 图2示出本发明的三层聚类算法示例图。 [0075] FIG. 2 shows three exemplary clustering algorithm of the present invention of FIG.

具体实施方式 detailed description

[0076] 下面结合附图对本发明的实施例作详细说明,本实施例在以发明技术方案为前提下进行实施,给出了详细的实施方式和具体的操作过程,但本发明的保护范围不限于下述的实施例。 [0076] The following embodiments in conjunction with the accompanying drawings of embodiments of the present invention will be described in detail, embodiments of the present invention according to the embodiment of the premise, and given the specific operation of the embodiment in detail, but the scope of the present invention is not limited to the following embodiments.

[0077] 本实施例的任务是对用户输入的查询关键词"bean",获取搜索引擎图片搜索结果,对结果中的不同"bean"的实例进行聚类,以辨别不同的实体,并为每个不同的"bean" 提供前不同的概念标注。 [0077] The task of the present embodiment is a user input query keywords "bean", image acquisition search engine search results, different instances of a "bean" clustering results to distinguish between different entities, and for each different different "bean" concept to provide pre-marked.

[0078] 如图1所示,本发明的离线系统的元数据抽取模块对本实施例"bean"相关的所有原始网页进行元数据上下文抽取。 [0078] As shown, the system of the present invention the offline metadata extraction module of the present embodiment "bean" all original pages related to Embodiment 1 context metadata extraction. 如某网页的URL为: Such as the URL of a web page is:

[0079] "http://domain. com/53C316-C2oJ5/mr_bean. jpg" [0079] "http: // domain com / 53C316-C2oJ5 / mr_bean jpg.."

[0080] 元数据抽取模块通过分割符将词分开,并利用二类分类器将有效字符检测出来。 [0080] metadata extraction module separate symbol by dividing the word, and using a second-class classifier valid characters detected. 如:"mr bean"。 Such as: "mr bean". 离线系统的概念化模块对"bean"的元数据以及相关网页进行了概念化,得到元数据概念向量和文本概念向量。 Conceptualization of the offline system module "bean" and metadata pages conceptualized obtain metadata concept vector concept vector and text.

[0081] 当接收到用户的查询关键词"bean"后,在线系统的文本上下文抽取模块从概念化的文本中找到图片和查询关键词"bean"的位置,并且抽取前后50个概念作为文本上下文概念向量。 [0081] After receiving the user query keywords "bean", the text of the online system and find the image extraction module context query keywords "bean" text position from conceptualization, and 50 before and after the extraction as a text context conceptual concept vector. 利用元数据概念向量和文本上下文概念向量,在线系统进行三层聚类。 Using the metadata clustering performed three concept vector concept vector and text context, the online system.

[0082] 如图2所示,在线系统的三层聚类模块首先按照元数据概念向量计算图片相似度并进行聚合层次聚类(图片1和图片2的概念向量皆包含概念"Mr. Bean",而图片3和图片4皆没找到有效的元数据概念)。 [0082] As shown, the three-line system according to the first metadata clustering module concept vector image similarity calculation 2 and polymerization hierarchical clustering (image 1 and image 2 are conceptual vector concept including "Mr. Bean" while the picture 3 and 4 are pictures not find a valid metadata concept). 在聚合层次聚类中,类间的相似度用类的概念向量来计算。 In the polymerization hierarchical clustering, the similarity with the class concept vector between classes is calculated. 系统从第一层聚类的结果计算出类的概念向量,如图片1和图片2形成了一个类,此类的概念向量包含概念"Mr. Bean"。 The system calculated from the result of the first layer of the clustering concept vector class, such as image 1 and image 2 is formed of a class, the vector concept of such a concept including "Mr. Bean".

[0083] 第二层聚类在第一层聚类的基础上通过扩展图片的概念向量进行进一步聚类。 [0083] The second cluster layer is further extended by the clustering concept vector image clustering on the basis of the first layer. 如图2中图片1和图片2形成的类的概念向量加入了概念"Rowan Atkinson",图片3的概念向量加入了"Rowan Atkinson"以及"Comedy",图片4加入了"Blackadder"。 Concept vector as shown in picture 1 and picture 2 is formed to join the class concept "Rowan Atkinson", 3 conceptual vector image joined "Rowan Atkinson" and "Comedy", the added image 4 "Blackadder". 由于扩展后的向量拥有更多共同的概念,在线系统经过第二次层次聚类合并一些相似的类,得到更为大的类。 Due to the extended vector has more common concepts, online system through a second hierarchical clustering merge some of similar class, get more large classes. 如图2中图片1,2, 3形成了新的类,并且把类的概念向量扩展为"Mr. Bean","Rowan Atkinson","Comedy"。 As shown in pictures 1, 2, 32 to form a new class, and the concept is extended to the vector class "Mr. Bean", "Rowan Atkinson", "Comedy".

[0084] 第三层聚类首先对各个类或者图片的向量用维基百科进行扩展,如图2中图片1,2, 3组成的类的概念向量中加入了"Blackadder",图片4加入了"Rowan Atkinson"。 [0084] The third layer of each class or cluster first vector image is expanded by Wikipedia, the concept of the vector shown in pictures 1, 2, 32 consisting of the added class "Blackadder", the added image 4 " Rowan Atkinson ". 通过基于维基百科的扩展,类向量之间拥有更高的相似度。 By extension based on Wikipedia, you have a higher degree of similarity between the vector class. 在线系统通过第三次层次聚类去进一步聚合一些原来由于信息量不足而没有合并的类。 Online system through the third hierarchical clustering to further polymerization some of the original class due to insufficient amount of information without the merger. 如图2中的图片4通过扩展向量可以合并到包含图片1,2, 3的类中。 Image 4 in Figure 2 may be incorporated into class contains images 1, 3, by the spread vector.

[0085] 在三层聚类算法结束后,在线系统分开不同的类别,把所有实体及其图片呈现给用户。 [0085] After the end of the three-tier clustering algorithm, the online system to separate different categories, all the entities and their pictures to the user. 每个实体用对应概念向量中最有代表性的概念(值最大)的前几个概念来描述。 The most representative vector concept (maximum value) of each entity before the concepts described with the corresponding concept. 比如图2 中的类可以用"Mr. Bean","Rowan Atkinson","Comedy","Blackadder" 等概念来描述关于名为憨豆先生的美国喜剧演员的图片。 For example, class 2 may be called on to describe the image of American comedian Mr. Bean with "Mr. Bean", "Rowan Atkinson", "Comedy", "Blackadder" and other concepts.

[0086] 以上对本发明的具体实施例进行了描述。 [0086] The foregoing specific embodiments of the invention have been described. 需要理解的是,本发明并不局限于上述特定实施方式,本领域技术人员可以在权利要求的范围内做出各种变形或修改,这并不影响本发明的实质内容。 Is to be understood that the present invention is not limited to the particular embodiments, those skilled in the art can make various changes and modifications within the scope of the appended claims, this does not affect the substance of the present invention.

Claims (8)

1. 一种对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,包括离线系统和在线系统,其中: 离线系统,用于对所有图片所在的源网页进行预处理,包括抽取网页元数据,把原网页文本和元数据概念化成一组带权概念的集合,即,概念向量,概念化后的元数据和网页内容供在线系统查询使用; 在线系统,用于接收查询,提交到搜索引擎并接收返回的多页图片结果,对于每一个页的返回结果,找到源网页的概念化元数据和文本,并在概念化的文本中抽取查询关键词的上下文以及图片上下文,在线系统分别利用元数据,上下文,以及对上下文进行概念扩展后的扩展上下文进行三层聚类,并为每一个类别自动标注相关的描述性概念,以了解每一个类别的实体。 A search engine returns the page image to the entity clustering system, wherein the system includes an offline and online systems, wherein: the off-line system for preprocessing the source images where all pages, including page extraction meta data, and the original text of the page metadata concepts into a set of weighted set of concepts, i.e., the vector concept, metadata and web content for the conceptual query using the online system; on-line system for receiving queries submitted to a search engine and receive multi-page picture of the results returned for each return a page of results, find the conceptualization source metadata and text pages, and extract text query keywords in the conceptualization of the context and the context of the picture, respectively, using the online system metadata , context, expansion and after expansion of the concept of context clustering three contexts, and for each type automatic annotation associated descriptive term, for each type of entity.
2. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述离线系统进行元数据抽取,包括对URL中有效词条的抽取,图片ALT属性,其中对URL有效词条的抽取,是利用二类分类器对有效和无效词条进行分类,并返回有效词条。 2. The system according to physical clustering of images on a web search engine returns according to claim 1, wherein the metadata extraction offline systems, including the extraction of valid entries in the URL, ALT attribute, wherein the effective extraction of URL entries, using second-class classifier classifying valid and invalid entries, and returns a valid entry.
3. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述离线系统包括概念化模块,用于对上下文进行概念扩展,文本通过概念化模块,转换成带权概念的集合,每个概念的权值为该概念对图片的重要性,其定义如下: |D| CF-IDF(c,d) =CF(c,d)x\og-^-^ 其中,CF-IDF(c,d)为概念c对图片d的重要性,包括两部分的乘积:概念在图片上下文出现的频率CF(c,d),以及反向上下文频率,其中反向上下文频率反比于概念出现过的上下文的数量DF(C),D为所有图片的上下文的集合。 The search engine returns the page image according to claim 1 for the entity clustering system, wherein said system comprises a conceptualization off module, configured to extend the concept of contexts, text by module conceptualization converted into weighted set of concepts, each concept is the weight the importance of the concept of the picture, which is defined as follows: | D | CF-IDF (c, d) = CF (c, d) x \ og - ^ - ^ wherein, CF-IDF (c, d) is a conceptual importance of c d picture, comprising the product of two parts: frequency CF2 (c, d) emerging concepts in the context of images, and the inverse context frequency, wherein the reverse context frequency is inversely proportional to the concept of a context number appeared DF (C), D is the set of all images of the context.
4. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,在线系统包括文本上下文抽取模块,用于对所输入的查询关键词,抽取其概念化查询上下文和图片上下文。 The search engine returns the page image according to claim 1 for the entity clustering system, wherein the system includes a line text context extraction module configured to query the inputted keyword, the query context extraction conceptualized and pictures context.
5. 根据权利要求4所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所述在线系统包含三层聚类算法模块,该模块根据抽取的元数据,上下文,以及扩展的上下文三类特征从置信度最高的元数据,到上下文,到扩展上下文进行三个层次的聚类,其中: 第一层聚类,通过元数据概念化后的概念向量进行聚合层次聚类,获得类内精度高的聚类结果,并且合并每个类里所有图片的概念向量作为类的概念向量; 第二层聚类,向每个图片的概念向量中加入概念化上下文的概念向量,更新所有第一层聚类后得到的类的概念向量,并进一步对这些得到的类进行聚合层次聚类; 第三层聚类,把每个图片的向量替换成扩展的概念向量,更新所有第二层聚类后得到的类的概念向量,并进一步对这些概念向量进行聚合层次聚类。 The search engine returns the page image to claim 4, wherein an entity clustering system, wherein said system comprises three line clustering module that the meta data extraction, context, and extended context features three highest confidence level from the metadata to a context, the extended context for clustering three levels, wherein: a first layer clustering, hierarchical clustering by polymerizing the metadata conceptual vector conceptualization, high accuracy is obtained within-class clustering result, and the combined pictures in each class all concept vector as a concept vector class; a second cluster layer, was added to the conceptual vector concept conceptualization context vector for each picture, the update all clustering concept vector obtained after class the first layer, and further polymerizing the resulting class hierarchical clustering; clustering the third layer, the vector of each picture to replace the extended concept vector to update all of the second layer obtained after the clustering concept vector class, hierarchical clustering and further polymerized vectors of these concepts.
6. 根据权利要求5所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,所使用的聚合层次聚类算法利用类的概念化进行类的相似度计算,类的概念化通过把类中的图片的概念向量进行相加,并且去除向量中值比较低的概念,得到高精度的类概念, 类的概念化用如下公式定义: The search engine returns the page image according to claim 5 clustering system of an entity, characterized in that the polymerization using a hierarchical clustering algorithm used in the class conceptualization based similarity calculations, conceptualized by class the image class concept vector are added and removed relatively low value of the vector concept, the concept of classes obtained with high accuracy, the class is defined by the following equation conceptualization:
Figure CN104317867AC00031
其中,C为概念,C为类,d为类中图片,CF-IDF(c,d)为概念对图片的重要性。 Wherein, C is a concept, C class, d is the image type, the importance of the picture CF-IDF (c, d) concept.
7. 根据权利要求5所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,第三层聚类通过维基百科进行上下文的扩展,把图片的概念向量替换成扩展的概念向量,并目1更新毎个类的概念向量,更新定义为如下公式: 7. The concept of clustering system of an entity based on the search engine returns the page image according to claim 5, characterized in that the third layer is extended context clustering by Wikipedia, replacing the concept of the vector into an expanded picture of vector, and updating a project concept vector every one class, update definitions of the following formula:
Figure CN104317867AC00032
其中,CF-IDF〇,dCi)为概念c对概念Ci的维基百科描述页面的重要性,V。 Where c is the importance of the concept of CF-IDF〇, dCi) Wikipedia page description of the concept Ci, V. 为当前类概念向量所有概念的集合,Ci为当前类概念向量中的概念,上下文扩展过程通过选取值最大的前k个概念对噪声数据进行过滤。 Is the set of all current-based concept vector concept, the concept of the current Ci of generic concept vector, context data expansion process on the noise filtered by selecting the maximum value of the previous k concept.
8. 根据权利要求1所述的对搜索引擎返回的网页图片进行实体聚类的系统,其特征在于,利用所述三层聚类后得出的类概念向量给每个图片类标注相关的描述概念,选取每个类的概念向量中值最高的前几个概念用于描述该类所代表的实体。 The search engine returns the page image according to claim 1 for clustering system entity, wherein, after using the three derived classes clustering concept vector for each image class label associated description concept, the concept of selecting the highest value vector of each class in the first few concepts used to describe an entity class represents.
CN 201410554684 2014-10-17 2014-10-17 The search engine returns pages picture entity clustering system CN104317867B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201410554684 CN104317867B (en) 2014-10-17 2014-10-17 The search engine returns pages picture entity clustering system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201410554684 CN104317867B (en) 2014-10-17 2014-10-17 The search engine returns pages picture entity clustering system

Publications (2)

Publication Number Publication Date
CN104317867A true true CN104317867A (en) 2015-01-28
CN104317867B CN104317867B (en) 2018-02-09

Family

ID=52373099

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201410554684 CN104317867B (en) 2014-10-17 2014-10-17 The search engine returns pages picture entity clustering system

Country Status (1)

Country Link
CN (1) CN104317867B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279264A (en) * 2015-10-26 2016-01-27 深圳市智搜信息技术有限公司 Semantic relevancy calculation method of document

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094020A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Recommending Terms To Specify Ontology Space
CN101751439A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Image retrieval method based on hierarchical clustering
CN102902821A (en) * 2012-11-01 2013-01-30 北京邮电大学 Methods for labeling and searching advanced semantics of imagse based on network hot topics and device
CN103577537A (en) * 2013-09-24 2014-02-12 上海交通大学 Image sharing website picture-oriented multi-pairing similarity determining method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090094020A1 (en) * 2007-10-05 2009-04-09 Fujitsu Limited Recommending Terms To Specify Ontology Space
CN101751439A (en) * 2008-12-17 2010-06-23 中国科学院自动化研究所 Image retrieval method based on hierarchical clustering
CN102902821A (en) * 2012-11-01 2013-01-30 北京邮电大学 Methods for labeling and searching advanced semantics of imagse based on network hot topics and device
CN103577537A (en) * 2013-09-24 2014-02-12 上海交通大学 Image sharing website picture-oriented multi-pairing similarity determining method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105279264A (en) * 2015-10-26 2016-01-27 深圳市智搜信息技术有限公司 Semantic relevancy calculation method of document

Also Published As

Publication number Publication date Type
CN104317867B (en) 2018-02-09 grant

Similar Documents

Publication Publication Date Title
Hu et al. Toward scalable systems for big data analytics: A technology tutorial
Li et al. Tag-based social interest discovery
US7917514B2 (en) Visual and multi-dimensional search
US20090265338A1 (en) Contextual ranking of keywords using click data
US20080005091A1 (en) Visual and multi-dimensional search
Hu et al. Semantic link network-based model for organizing multimedia big data
Mei et al. Multimedia search reranking: A literature survey
US20070098266A1 (en) Cascading cluster collages: visualization of image search results on small displays
Jiang et al. Learning and inferencing in user ontology for personalized Semantic Web search
Chen et al. Collabseer: a search engine for collaboration discovery
US20080147642A1 (en) System for discovering data artifacts in an on-line data object
Chau et al. Personalized spiders for web search and analysis
US20080010291A1 (en) Techniques for clustering structurally similar web pages
US20050050086A1 (en) Apparatus and method for multimedia object retrieval
US20110173197A1 (en) Methods and apparatuses for clustering electronic documents based on structural features and static content features
Hua et al. Clickage: Towards bridging semantic and intent gaps via mining click logs of search engines
US20110113047A1 (en) System and method for publishing aggregated content on mobile devices
Ma et al. Efficiently finding web services using a clustering semantic approach
US20090112830A1 (en) System and methods for searching images in presentations
Jansen Searching for digital images on the web
Au Yeung et al. Contextualising tags in collaborative tagging systems
US20110082859A1 (en) Information theory based result merging for searching hierarchical entities across heterogeneous data sources
US20080147588A1 (en) Method for discovering data artifacts in an on-line data object
Zhao et al. Event detection from evolution of click-through data
US20110264651A1 (en) Large scale entity-specific resource classification

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
GR01