CN103412883A - Semantic intelligent information publishing and subscribing method based on P2P technology - Google Patents
Semantic intelligent information publishing and subscribing method based on P2P technology Download PDFInfo
- Publication number
- CN103412883A CN103412883A CN2013103021876A CN201310302187A CN103412883A CN 103412883 A CN103412883 A CN 103412883A CN 2013103021876 A CN2013103021876 A CN 2013103021876A CN 201310302187 A CN201310302187 A CN 201310302187A CN 103412883 A CN103412883 A CN 103412883A
- Authority
- CN
- China
- Prior art keywords
- node
- information
- data
- subscription
- semantic
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000005516 engineering process Methods 0.000 title claims abstract description 24
- 230000009467 reduction Effects 0.000 claims abstract description 29
- 238000009826 distribution Methods 0.000 claims abstract description 12
- 238000005192 partition Methods 0.000 claims abstract description 10
- 239000011159 matrix material Substances 0.000 claims description 64
- 239000003795 chemical substances by application Substances 0.000 claims description 42
- 239000013598 vector Substances 0.000 claims description 33
- 230000008569 process Effects 0.000 claims description 13
- 238000010606 normalization Methods 0.000 claims description 5
- 238000003860 storage Methods 0.000 claims description 4
- 238000004422 calculation algorithm Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 2
- 238000009825 accumulation Methods 0.000 claims 10
- 238000012423 maintenance Methods 0.000 claims 3
- 238000007596 consolidation process Methods 0.000 claims 2
- 208000037656 Respiratory Sounds Diseases 0.000 claims 1
- 238000013506 data mapping Methods 0.000 claims 1
- 238000000605 extraction Methods 0.000 claims 1
- 238000007726 management method Methods 0.000 claims 1
- 206010037833 rales Diseases 0.000 claims 1
- 238000012546 transfer Methods 0.000 claims 1
- 239000002699 waste material Substances 0.000 claims 1
- 230000002776 aggregation Effects 0.000 abstract description 63
- 238000004220 aggregation Methods 0.000 abstract description 63
- 230000006872 improvement Effects 0.000 abstract description 8
- 230000008901 benefit Effects 0.000 abstract description 2
- 238000012545 processing Methods 0.000 description 12
- 238000010586 diagram Methods 0.000 description 4
- 230000014509 gene expression Effects 0.000 description 4
- 230000003993 interaction Effects 0.000 description 4
- 238000013507 mapping Methods 0.000 description 4
- 238000000638 solvent extraction Methods 0.000 description 4
- 238000004364 calculation method Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 3
- 238000000354 decomposition reaction Methods 0.000 description 3
- 238000010845 search algorithm Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000005295 random walk Methods 0.000 description 2
- 230000004044 response Effects 0.000 description 2
- 241000405147 Hermes Species 0.000 description 1
- 238000012993 chemical processing Methods 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000008878 coupling Effects 0.000 description 1
- 238000010168 coupling process Methods 0.000 description 1
- 238000005859 coupling reaction Methods 0.000 description 1
- 238000013480 data collection Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 239000003814 drug Substances 0.000 description 1
- 238000003780 insertion Methods 0.000 description 1
- 230000037431 insertion Effects 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 241000894007 species Species 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
一种基于P2P技术的语义智能信息发布订阅方法,其步骤为:(1)构建系统拓扑结构并设立超级节点;(2)高维属性空间规格化处理;(3)高维数据分区并建立维护全局索引树;(4)高维数据降维:节点将分配给自己的数据聚集区归整为高维属性空间内的多维超立方体,最后利用金字塔降维方法将高维数据对象映射到一维数据空间,并用B+树进行索引;(5)利用i-Chord方法管理数据对象;(6)实现基于语义的智能信息订阅与发布。本发明具有原理简单、易实现和推广、提高系统的容错性、动态性及信息分发效率等优点。
A method for publishing and subscribing semantic intelligent information based on P2P technology, the steps of which are: (1) Construct the system topology and set up super nodes; (2) Normalize the high-dimensional attribute space; (3) Partition high-dimensional data and establish and maintain Global index tree; (4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to one-dimensional Data space, and use B+ tree to index; (5) Use i-Chord method to manage data objects; (6) Realize semantic-based intelligent information subscription and release. The invention has the advantages of simple principle, easy realization and popularization, improvement of system fault tolerance, dynamicity, information distribution efficiency and the like.
Description
技术领域technical field
本发明主要涉及到大规模信息网络中基于语义的智能信息交互方法领域,特指一种基于P2P技术的语义智能信息发布订阅方法。The invention mainly relates to the field of semantic-based intelligent information interaction methods in large-scale information networks, in particular to a P2P technology-based semantic intelligent information publishing and subscribing method.
背景技术Background technique
随着计算机网络技术的飞速发展及广泛应用,网络中的数字资源呈现出指数增长的态势,表现形式日趋多样化,用户对信息获取的需求也在日益提高,如何从海量繁杂的网络资源中高效获取用户“感兴趣”的信息成为人们越来越关注的问题。With the rapid development and wide application of computer network technology, the digital resources in the network are showing an exponential growth trend, the forms of expression are becoming more and more diversified, and the user's demand for information acquisition is also increasing. Obtaining information that users are "interested in" has become a problem that people pay more and more attention to.
网络环境呈现出规模庞大、分散控制、松散耦合、自治性、动态性等特点,为此研究者提出了发布/订阅(Publish/Subscribe,以下简称P/S)技术。P/S系统由发布者、订阅者和事件代理三部分组成。发布者是指产生事件的对象,即信息生产者;订阅者是指消费事件的对象,即信息消费者;而事件代理是发布/订阅的中间件,发布者以“事件”形式发布信息到事件代理,订阅者向事件代理订阅感兴趣的事件,事件代理把发布的事件及时可靠地路由给感兴趣的订阅。P/S技术是一种能够同时提供一对多和多对多信息交互的异步通信模式,它能使信息交互的各个参与者在时间、空间和控制流上完全解耦,同时还具有匿名通信等特征,能够很好地满足大规模分布式网络系统松散通信的需求。The network environment presents the characteristics of large scale, decentralized control, loose coupling, autonomy, and dynamics. For this reason, researchers have proposed the Publish/Subscribe (hereinafter referred to as P/S) technology. The P/S system consists of three parts: publisher, subscriber and event agent. The publisher refers to the object that generates the event, that is, the information producer; the subscriber refers to the object that consumes the event, that is, the information consumer; and the event agent is the middleware for publishing/subscribing, and the publisher publishes information to the event in the form of an "event" Agents, subscribers subscribe interested events to the event agent, and the event agent routes the published events to interested subscriptions in a timely and reliable manner. P/S technology is an asynchronous communication mode that can provide one-to-many and many-to-many information interaction at the same time. It can completely decouple the participants of information interaction in time, space and control flow, and also has anonymous communication. and other features, which can well meet the needs of loose communication in large-scale distributed network systems.
现有的发布/订阅技术研究仍处于发展阶段,在可靠性和信息分发效率等方面都存在着一定问题,仍有许多关键技术亟待解决。例如,在拓扑结构方面,现有P/S系统中间件通常设计为集中式或者非结构化P2P形式。集中式拓扑依赖单个服务器来中介发布者和订阅者(如Colorado大学的SIENA和IBM研究中心的Gryphon以及JEDI等),但其缺点是容易造成性能瓶颈,如果服务器失效,整个系统无法工作;非结构化P2P形式拓扑(如Cambridge大学提出的Hermes)常采用洪泛、Gossiping或者随机游走(random walk)算法来路由信息,但由于其无结构及节点的动态性,事件路由很难维护,系统可扩展性差。The existing publish/subscribe technology research is still in the development stage, there are certain problems in terms of reliability and information distribution efficiency, and there are still many key technologies to be solved urgently. For example, in terms of topology, the existing P/S system middleware is usually designed in a centralized or unstructured P2P form. The centralized topology relies on a single server to intermediary publishers and subscribers (such as SIENA of the University of Colorado and Gryphon of IBM Research Center and JEDI, etc.), but its disadvantage is that it is easy to cause performance bottlenecks. If the server fails, the entire system cannot work; non-structural P2P topology (such as Hermes proposed by Cambridge University) often uses flooding, Gossiping or random walk (random walk) algorithms to route information, but due to its structureless and node dynamics, event routing is difficult to maintain, and the system can Poor scalability.
另一方面,开放网络环境下信息资源的表现形式各异,普遍存在着信息结构异构(不同的用户用不同的结构来表示同一事件,例如有的事件为Map格式,有的事件为XML格式)和语义异构(不同用户使用不同的词汇(术语)表示同一事件,或者使用同一个词汇来表示不同的概念)的问题。而现有的P/S系统(如CORBA、Scribe、Bayeux、JEDI等系统)在表达能力方面尚存在很大不足,根据事件的结构信息对其进行描述,缺乏对事件本身语义的理解,事件与订阅之间的匹配算法属于精确匹配,匹配过程中容易受到同义词或近义词的干扰,可能返回大量偏离用户语义的错误结果,无法实现基于信息语义的智能匹配。On the other hand, in an open network environment, information resources have different forms of expression, and information structure heterogeneity generally exists (different users use different structures to represent the same event, for example, some events are in Map format, and some events are in XML format ) and semantic heterogeneity (different users use different vocabulary (terms) to represent the same event, or use the same vocabulary to represent different concepts). However, the existing P/S systems (such as CORBA, Scribe, Bayeux, JEDI, etc.) still have a lot of shortcomings in terms of expressiveness. They describe events according to their structural information and lack understanding of the semantics of events themselves. The matching algorithm between subscriptions belongs to exact matching, which is easily interfered by synonyms or similar words during the matching process, and may return a large number of erroneous results that deviate from user semantics, and intelligent matching based on information semantics cannot be realized.
为了增强系统的语义表达能力,实现基于信息语义的智能匹配,可以将信息网络中形式各异的数字资源抽象为高维属性空间中的点或者特征向量,通过高维数据点之间的距离或者特征向量之间的夹角来衡量数据对象之间的语义相似性。而高维属性空间的提出也随之引出了“维度灾难”问题,表现在高维属性空间中数据分布稀疏且趋近于高维空间表面分布,从而导致了语义相似搜索代价太大、搜索效率不高。高维数据降维技术能够将数据对象从不易管理的高维空间映射到低维空间,有效减小了搜索空间,提高了数据检索效率,是解决“维灾”问题的有效手段之一。例如:中国专利申请名称为“一种基于图像数据结构保护的嵌入式降维方法”记载的技术方案通过将原始多维数据集内各向量根据两两向量对之间的距离关系进行相似子集与非相似子集的划分,针对不同的子集做不同的嵌入操作达到距离转换的目的,再对新的距离矩阵做投影达到降维目的。但是现实信息网络中的数据对象成分复杂,种类繁多,其表现形式及语义的属性不断改变,很难把它们都统一抽象成固定维数、固定类型的向量;同时在高维属性空间中数字资源可以定义出很多种属性,但实际搜索过程中有很多属性与搜索无关(如医学中的概念不可能在计算机科学中出现),因此有必要将各种数据对象归整映射到固定结构的属性空间并适当减少与搜索无关的属性,从而减少语义相似搜索中的计算量,进一步提高搜索效率。In order to enhance the semantic expression ability of the system and realize intelligent matching based on information semantics, various digital resources in the information network can be abstracted into points or feature vectors in the high-dimensional attribute space, and the distance between high-dimensional data points or The angle between feature vectors is used to measure the semantic similarity between data objects. The proposal of high-dimensional attribute space also leads to the problem of "dimension disaster", which shows that the data distribution in high-dimensional attribute space is sparse and close to the surface distribution of high-dimensional space, which leads to the high cost of semantic similarity search and the low search efficiency. not tall. High-dimensional data dimensionality reduction technology can map data objects from difficult-to-manage high-dimensional spaces to low-dimensional spaces, effectively reducing the search space and improving data retrieval efficiency. It is one of the effective means to solve the "dimension disaster" problem. For example: the technical scheme described in the Chinese patent application titled "An Embedded Dimensionality Reduction Method Based on Image Data Structure Protection" performs similar subset and For the division of non-similar subsets, different embedding operations are performed for different subsets to achieve the purpose of distance conversion, and then the new distance matrix is projected to achieve the purpose of dimensionality reduction. However, the data objects in the real information network have complex components and various types, and their expressions and semantic attributes are constantly changing, so it is difficult to abstract them all into fixed-dimensional and fixed-type vectors; Many kinds of attributes can be defined, but many attributes in the actual search process have nothing to do with the search (for example, the concept in medicine cannot appear in computer science), so it is necessary to map various data objects to a fixed-structure attribute space And appropriately reduce the attributes irrelevant to the search, thereby reducing the amount of calculation in the semantic similarity search and further improving the search efficiency.
综上所述,现有发布订阅系统在动态性、容错性、自组织等方面存在一定的不足,同时系统在表达能力方面尚存在缺陷,缺乏对事件本身语义的理解,无法实现用户间基于语义的智能信息交互。To sum up, the existing publish-subscribe system has certain deficiencies in terms of dynamics, fault tolerance, self-organization, etc. At the same time, the system still has defects in expressive ability, lack of understanding of the semantics of the event itself, and cannot realize semantic-based intelligent information interaction.
发明内容Contents of the invention
本发明要解决的技术问题就在于:针对现有技术存在的技术问题,本发明提供一种原理简单、易实现和推广、提高系统的容错性、动态性及信息分发效率的基于P2P技术的语义智能信息发布订阅方法。The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a P2P technology-based semantic Intelligent information publish and subscribe method.
为解决上述技术问题,本发明采用以下技术方案:In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:
一种基于P2P技术的语义智能信息发布订阅方法,其步骤为:A method for publishing and subscribing semantic intelligent information based on P2P technology, the steps of which are as follows:
(1)构建系统拓扑结构并设立超级节点:利用结构化P2P技术将P/S系统中多个事件代理的拓扑构建成Chord环结构,并在环上设定一个超级节点,用于提取信息网络中数据资源的属性并构造属性空间,信息网络中数据资源抽象为高维属性空间中的点或向量;(1) Construct the system topology and set up super nodes: use structured P2P technology to construct the topology of multiple event agents in the P/S system into a Chord ring structure, and set a super node on the ring to extract information from the network The attributes of the data resources in the information network and construct the attribute space, the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space;
(2)高维属性空间规格化处理:在超级节点上利用向量空间模型将网络中数据信息表示为高维属性空间中的高维点向量,将网络中所有数据信息映射到一个高维属性空间中,数学形式上表示为一个高阶矩阵;利用潜在语义索引去除与信息检索相关性很小的信息属性,用一个属性子空间近似代替原先的高维属性空间;(2) High-dimensional attribute space normalization processing: use the vector space model on the super node to represent the data information in the network as a high-dimensional point vector in the high-dimensional attribute space, and map all the data information in the network to a high-dimensional attribute space In , the mathematical form is expressed as a high-order matrix; the latent semantic index is used to remove the information attributes that have little relevance to information retrieval, and an attribute subspace is used to approximately replace the original high-dimensional attribute space;
(3)高维数据分区并建立维护全局索引树:超级节点SN将属性空间内的高维数据划分为不同的数据聚集区,并将每个数据聚集区分配到不同的节点,超级节点还维护一棵由所有节点信息构造成的索引树,称为全局索引树,用来为Chord环中节点分配事件信息并确定订阅请求需要访问的代理节点;(3) Partition high-dimensional data and establish and maintain a global index tree: the super node SN divides the high-dimensional data in the attribute space into different data aggregation areas, and assigns each data aggregation area to different nodes. The super node also maintains An index tree constructed from all node information, called a global index tree, is used to assign event information to nodes in the Chord ring and determine the proxy nodes that subscription requests need to visit;
(4)高维数据降维:节点将分配给自己的数据聚集区归整为高维属性空间内的多维超立方体,最后利用金字塔降维方法将高维数据对象映射到一维数据空间,并用B+树进行索引;(4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to the one-dimensional data space, and uses B+ tree for indexing;
(5)利用i-Chord方法管理数据对象:将高维数据集映射到一维数据空间后,利用Chord协议组织维护网络数据信息;设置一个保序函数将一维数据空间内数据映射到Chord资源标识符空间;Chord每个节点对应一个数据聚集区,存储管理相应数据聚集区内的高维数据,且每个节点维护一张路由表方便快速查询信息;(5) Use the i-Chord method to manage data objects: After mapping high-dimensional data sets to one-dimensional data space, use the Chord protocol to organize and maintain network data information; set an order-preserving function to map data in one-dimensional data space to Chord resources Identifier space; each node of Chord corresponds to a data aggregation area, stores and manages high-dimensional data in the corresponding data aggregation area, and each node maintains a routing table to facilitate quick information query;
(6)实现基于语义的智能信息订阅与发布:系统中每个节点维护一张订阅表,记录与该节点语义相关的订阅信息;当订阅者向系统发送订阅请求时,首先通过搜索全局索引树确定与该订阅请求语义相关的代理节点,将请求发送给代理节点,在代理节点的订阅表中增加一条订阅记录,注册该订阅请求和该节点的关联关系,然后代理节点根据订阅请求及订阅条件确定高维属性空间中的精确搜索范围,在精确搜索范围内精确查找与订阅请求语义相同的事件信息并返回给订阅者;当发布者向系统发送事件信息时,首先通过搜索全局索引树确定与该事件语义相关的宿主节点,将事件信息发送到宿主节点,并查阅宿主节点的订阅表,如果事件信息与订阅表中某条订阅信息语义匹配成功,则将事件主动推送给用户。(6) Realize semantic-based intelligent information subscription and publishing: each node in the system maintains a subscription table to record subscription information related to the node semantics; when a subscriber sends a subscription request to the system, it first searches the global index tree Determine the proxy node that is semantically related to the subscription request, send the request to the proxy node, add a subscription record to the subscription table of the proxy node, register the association between the subscription request and the node, and then the proxy node Determine the precise search range in the high-dimensional attribute space, and accurately find the event information with the same semantics as the subscription request within the precise search range and return it to the subscriber; when the publisher sends event information to the system, it first determines the event information by searching the global index tree The host node related to the semantics of the event sends the event information to the host node and consults the subscription table of the host node. If the event information successfully matches the semantics of a subscription information in the subscription table, the event is actively pushed to the user.
作为本发明的进一步改进:所述步骤(1)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (1) are:
(1.1)P/S系统中可能存在多个事件代理,将多个事件代理按照Chord环结构组织成事件代理网络,事件代理对应该网络中的各个节点,每个事件代理按照一定规则存储信息网络中的数据资源,并保存有部分其他节点的信息;(1.1) There may be multiple event agents in the P/S system. Multiple event agents are organized into an event agent network according to the Chord ring structure. The event agents correspond to each node in the network, and each event agent stores information in the network according to certain rules. The data resources in , and save the information of some other nodes;
(1.2)在Chord环内选择一个能力最强的节点作为超级节点SN,超级节点SN定期检查环内其他节点的能力,从中选出候选超级节点,候选超级节点备份SN上重要的信息;(1.2) Select a node with the strongest capability in the Chord ring as the super node SN, the super node SN regularly checks the capabilities of other nodes in the ring, selects candidate super nodes, and the candidate super nodes back up important information on the SN;
(1.3)超级节点SN负责提取不同形式数据资源的属性,构造多维属性空间,信息网络中数据资源抽象为高维属性空间中的点或向量。(1.3) The super node SN is responsible for extracting the attributes of different forms of data resources and constructing a multi-dimensional attribute space. The data resources in the information network are abstracted into points or vectors in the high-dimensional attribute space.
作为本发明的进一步改进:所述步骤(2)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (2) are:
(2.1)依据向量空间模型的思想将网络中高维数据信息描述为一个属性向量,然后系统将网络中所有高维信息组织为一个矩阵,如信息网络中通过t个属性描述的d个高维数据信息描述为一个t*d的矩阵A,矩阵的每一列代表一个高维信息数据,矩阵元素aij为数据对象j中属性i的属性值,代表了对象j中属性i的重要性,如果高维信息j不存在属性i,则aij为0;(2.1) According to the idea of the vector space model, the high-dimensional data information in the network is described as an attribute vector, and then the system organizes all the high-dimensional information in the network into a matrix, such as d high-dimensional data described by t attributes in the information network The information is described as a t*d matrix A, each column of the matrix represents a high-dimensional information data, and the matrix element a ij is the attribute value of the attribute i in the data object j, representing the importance of the attribute i in the object j, if high Dimension information j does not have attribute i, then a ij is 0;
(2.2)注意到矩阵中大部分的元素为0,说明信息检索中一个高维信息的大部分的属性为无用信息,用一个低秩矩阵近似代替初始矩阵;假设网络中高维信息集合表示为矩阵A,A的秩为r,利用矩阵的奇异值分解将A分解为三个矩阵的乘积:(2.2) Note that most of the elements in the matrix are 0, indicating that most of the attributes of a high-dimensional information in information retrieval are useless information, and a low-rank matrix is used to approximate the original matrix; assume that the high-dimensional information set in the network is expressed as a matrix A, the rank of A is r, using the singular value decomposition of the matrix to decompose A into the product of three matrices:
A=UΣVT A=UΣV T
其中U=(ul,u2···,ur)是一个t*r矩阵,Σ=diag(σ1,...,σr)是一个r*r对角矩阵,V=(vl,...,vr)是一个d*r矩阵,σi是A的奇异值,σ1≥σ2≥...≥σr;where U=(u l , u 2 , u r ) is a t*r matrix, Σ=diag(σ 1 ,...,σ r ) is a r*r diagonal matrix, V=(v l ,...,v r ) is a d*r matrix, σ i is the singular value of A, σ 1 ≥σ 2 ≥...≥σ r ;
(2.3)仅保留1个最大的矩阵奇异值,省略掉其他的奇异值,将秩为r的矩阵A近似化简为秩为1的矩阵A1:(2.3) Only keep one of the largest matrix singular values, omit other singular values, and approximate the matrix A with rank r into
A1=U1Σ1V1 T A 1 = U 1 Σ 1 V 1 T
其中U1=(ul,u2,...,U1),Σ1=diag(σ1,...,σ1),V1=(Vl,...,V1),V1Σ1的行是高维信息的语义向量。where U 1 =(u l , u 2 ,...,U 1 ), Σ 1 =diag(σ 1 ,...,σ 1 ), V 1 =(V l ,...,V 1 ), The rows of V 1 Σ 1 are semantic vectors of high-dimensional information.
作为本发明的进一步改进:所述步骤(3)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (3) are:
(3.1)信息网络中数据资源抽象为高维属性空间中的点或向量,根据属性空间内距离相近的数据资源具有相似语义的原理,超级节点SN对多维属性空间中分布的高维数据进行聚类分区,将高维数据分成多个互不相交的数据聚集区,并将每个数据聚集区分配到不同的节点;(3.1) The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space. According to the principle that data resources with similar distances in the attribute space have similar semantics, the super node SN aggregates the high-dimensional data distributed in the multi-dimensional attribute space. Class partitioning, which divides high-dimensional data into multiple disjoint data aggregation areas, and assigns each data aggregation area to different nodes;
(3.2)超级节点SN除了自身的路由表外,还维护一棵由环上所有节点的资源标识符范围信息构造而成的索引树,称为全局索引GI;(3.2) In addition to its own routing table, the super node SN also maintains an index tree constructed from the resource identifier range information of all nodes on the ring, called the global index GI;
(3.2.1)快速为节点分配事件信息;环内每一个节点负责一个数据聚集区,该聚集区内所有的数据信息全部分配给这个节点,当一个事件信息请求加入时,通过查询GI确定该事件信息属于哪个数据聚集区,即确定了事件信息的宿主节点,然后以宿主节点的标识符为搜索键,利用Chord路由协议,将该事件信息分配给宿主节点;(3.2.1) Quickly assign event information to nodes; each node in the ring is responsible for a data aggregation area, and all data information in the aggregation area is allocated to this node. When an event information request is added, it is determined by querying GI Which data aggregation area the event information belongs to, that is, the host node of the event information is determined, and then the identifier of the host node is used as the search key, and the event information is distributed to the host node by using the Chord routing protocol;
(3.2.2)确定订阅请求需要访问的节点;当某个节点输入订阅请求时,首先把请求发送给SN搜索GI,确定哪些节点负责的数据聚集区与该订阅请求的语义空间相交并返回这些节点的标识符,返回的节点称为代理节点,然后以代理节点的标识符为搜索键,将订阅请求路由到这些节点,进一步实现语义匹配。(3.2.2) Determine the node that the subscription request needs to visit; when a node inputs a subscription request, it first sends the request to SN to search for GI, determines which nodes are responsible for the data aggregation area that intersects with the semantic space of the subscription request, and returns these The identifier of the node, the returned node is called a proxy node, and then the identifier of the proxy node is used as the search key to route the subscription request to these nodes to further realize semantic matching.
作为本发明的进一步改进:所述步骤(4)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (4) are:
(4.1)将每个数据聚集区归整为高维数据空间内维度为d的多维超立方体,并对多维超立方体进行归一化处理,即每一维的边长为1,超立方体的中心点则表示为(0.5,0.5,...,0.5),然后以超立方体中心为顶点,数据聚集区的(d-1)维超平面作为基,将每个d维数据聚集区划分为2d个金字塔;(4.1) Organize each data aggregation area into a multi-dimensional hypercube with dimension d in the high-dimensional data space, and normalize the multi-dimensional hypercube, that is, the side length of each dimension is 1, and the center of the hypercube Points are expressed as (0.5, 0.5, ..., 0.5), and then the center of the hypercube is used as the vertex, and the (d-1) dimensional hyperplane of the data aggregation area is used as the base, and each d-dimensional data aggregation area is divided into 2d a pyramid;
(4.2)将每个金字塔分成与基平行的几个子划分,每一个子划分与一个B+树数据页相对应,然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值,并用B+树组织管理降维后数据;降维公式为:(4.2) Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area according to the distance from the high-dimensional data point to the pyramid base It is a one-dimensional value, and the data after dimension reduction is organized and managed by B+ tree; the dimension reduction formula is:
yv=i*C+(j+hv)=i*c+(0+|0.5-hv|)y v =i*C+(j+h v )=i*c+(0+|0.5-h v |)
其中某个高维数据对象v所在的立方体数为i,金字塔数为j,v到所在金字塔基平面距离为hv’,则到金字塔顶点所在平面的距离为hv=∣0.5-hv’∣;降维后,每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内,c是一个足够大的常数,确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Among them, the number of cubes where a high-dimensional data object v is located is i, the number of pyramids is j, and the distance between v and the base plane of the pyramid is h v ', then the distance to the plane where the apex of the pyramid is located is h v =∣0.5-h v '∣; After dimensionality reduction, the one-dimensional value of each high-dimensional data object in a high-dimensional cube is limited to the interval [i*c, (i+1)*c], c is a constant large enough to ensure that each Data objects in a data partition have different index keys than other partitions.
作为本发明的进一步改进:所述步骤(5)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (5) are:
(5.1)假设Chord系统资源标识符区间大小为2m,利用一个保序函数h将降维后一维数值按照顺序映射到区间[0,2m)中;即对于属性空间中的某个高维数据点v,点v的i-Chord资源标识符为:(5.1) Assuming that the Chord system resource identifier interval size is 2 m , use an order-preserving function h to map the one-dimensional values after dimensionality reduction to the interval [0,2 m ) in sequence; that is, for a certain height in the attribute space Dimension data point v, the i-Chord resource identifier of point v is:
keyv=ichord(v)=h(yv)=h(i*c+(0+hv))∈[0,2m)key v =ichord(v)=h(y v )=h(i*c+(0+h v ))∈[0, 2 m )
(5.2)Chord环上每个节点Ni负责存储管理一个数据聚集区的高维数据信息,设Ni节点标识符Nkeyi,则该聚集区内高维数据信息的资源标识符分布在区间(Nkeyi-1,Nkeyi]内;(5.2) Each node N i on the Chord ring is responsible for storing and managing the high-dimensional data information of a data aggregation area. Let the N i node identifier Nkey i , then the resource identifiers of the high-dimensional data information in the aggregation area are distributed in the interval ( Nkey i-1 , Nkey i ];
(5.3)每个节点Ni维护一张路由表,即指针表,指向环上的其他节点;在路由表中有m个表项,其中第k(1≤k≤m)行表项为Chord环上标识符等于或大于(Nkeyi+2k-1)mod2m的第一个节点,即successor((Nkeyi+2k-1)mod2m);任何一个节点收到关键字为key的请求时,首先检查自身节点是否等于key,如果是则直接返回;否则,节点查找其路由表,找到表中最大但不超过key的第一个节点,并将查询请求转发给该节点,重复此过程,直到请求到达一个节点Nk,满足key位于Nk和Nk的后续节点Nk+1之间时,节点Nk汇报其后继节点Nk+1作为请求的应答;(5.3) Each node N i maintains a routing table, that is, a pointer table, pointing to other nodes on the ring; there are m entries in the routing table, and the kth (1≤k≤m) row entry is Chord The first node whose identifier on the ring is equal to or greater than (Nkey i +2 k-1 )mod2 m is the successor((Nkey i +2 k-1 )mod2 m ); When requesting, first check whether its own node is equal to the key, and if so, return it directly; otherwise, the node searches its routing table, finds the first node in the table that is the largest but does not exceed the key, and forwards the query request to this node, and repeats this process until the request reaches a node N k , and when the key is located between N k and N k ’s subsequent node N k+1 , the node N k reports its subsequent node N k+1 as the response to the request;
(5.4)当某个数据聚集区内高维数据过于密集导致节点不能有效对其进行管理时,将该聚集区分裂为两个或多个新的聚集区,相应地在Chord环上选择空闲节点以分担原节点的负载;当临近的数据聚集区内数据稀疏导致节点资源浪费时,合并聚集区,选择两个执行节点中的一个作为新的数据聚集区的执行节点,并将另一置为空闲节点。(5.4) When the high-dimensional data in a data aggregation area is too dense and the nodes cannot effectively manage it, split the aggregation area into two or more new aggregation areas, and select idle nodes on the Chord ring accordingly In order to share the load of the original node; when the data in the adjacent data aggregation area is sparse and the node resources are wasted, the aggregation area is merged, and one of the two execution nodes is selected as the execution node of the new data aggregation area, and the other is set as free node.
作为本发明的进一步改进:所述步骤(6)的具体步骤为:As a further improvement of the present invention: the specific steps of the step (6) are:
(6.1)每个节点维护一张订阅表,记录与该节点相关的所有订阅信息,当节点有事件信息到达时,查询该节点的订阅表,如果事件信息与订阅表中某条订阅信息语义匹配成功,则将事件主动推送给用户;(6.1) Each node maintains a subscription table, which records all subscription information related to the node. When the node has event information, query the subscription table of the node. If the event information matches the semantics of a certain subscription information in the subscription table If successful, the event is actively pushed to the user;
(6.2)当订阅者向环上某节点Ni提交订阅请求信息时,该节点搜索全局搜索树确定与订阅请求语义相关的代理节点,将该订阅请求路由到代理节点,并在带代理节点上执行基于订阅语义的相似搜索算法,若搜索到与订阅请求语义相同的事件信息,则将事件信息反向传递给Ni;同时,代理节点将订阅请求、订阅条件、路由路径及搜索范围信息存储在其订阅表中;(6.2) When a subscriber submits subscription request information to a node N i on the ring, the node searches the global search tree to determine the proxy node related to the subscription request semantics, routes the subscription request to the proxy node, and Execute a similar search algorithm based on subscription semantics. If the event information with the same semantics as the subscription request is searched, the event information will be passed back to N i ; at the same time, the proxy node will store the subscription request, subscription conditions, routing path and search range information in its subscription form;
(6.3)当发布者向环上某个节点Nj发布事件信息时,该节点搜索全局搜索树确定与事件信息语义相关的宿主节点,依据Chord路由协议将该事件信息路由到并存储在宿主节点上;同时,宿主节点将事件属性信息与其订阅表中的订阅请求信息进行语义匹配,如果匹配成功则直接将事件信息推送给订阅者;(6.3) When the publisher publishes event information to a node N j on the ring, the node searches the global search tree to determine the host node related to the semantics of the event information, and routes the event information to and stores it in the host node according to the Chord routing protocol At the same time, the host node semantically matches the event attribute information with the subscription request information in the subscription table, and if the match is successful, it will directly push the event information to the subscriber;
(6.4)当订阅者向环上某个节点发布取消订阅信息时,该节点搜索全局搜索树确定与订阅信息语义相关的代理节点,并将请求信息路由到相应的代理节点,在代理节点订阅表中查找到并删除相应的订阅信息。(6.4) When a subscriber publishes unsubscribe information to a node on the ring, the node searches the global search tree to determine the proxy node related to the semantics of the subscription information, and routes the request information to the corresponding proxy node. In the proxy node subscription table Find and delete the corresponding subscription information.
与现有技术相比,本发明的优点在于:Compared with the prior art, the present invention has the advantages of:
本发明原理简单、易实现和推广,在结构化P2P技术的基础上可以构造基于语义的发布/订阅系统,提高了系统的容错性、动态性及信息分发效率;本发明通过将事件信息及订阅信息映射到多维属性空间,系统支持对信息本身语义的理解,增强了系统的表达能力,支持基于信息语义的智能匹配;同时通过归整化简多维属性空间及高维数据降维技术,消除了多维属性空间带来的维灾问题,能够高效实现基于信息语义的数据组织和智能匹配。The principle of the present invention is simple, easy to implement and popularize. On the basis of structured P2P technology, a semantic-based publish/subscribe system can be constructed, which improves the fault tolerance, dynamics and information distribution efficiency of the system; the present invention combines event information and subscription The information is mapped to the multi-dimensional attribute space. The system supports the understanding of the semantics of the information itself, which enhances the expressive ability of the system and supports intelligent matching based on the information semantics. At the same time, it eliminates the The disaster of dimensionality brought by multi-dimensional attribute space can efficiently realize data organization and intelligent matching based on information semantics.
附图说明Description of drawings
图1是本发明方法的流程示意图。Fig. 1 is a schematic flow chart of the method of the present invention.
图2是本发明中高维数据从高维空间映射到一维空间示意图。Fig. 2 is a schematic diagram of mapping high-dimensional data from high-dimensional space to one-dimensional space in the present invention.
图3是本发明中根据查找数据过程及订阅表的示意图。Fig. 3 is a schematic diagram according to the search data process and the subscription table in the present invention.
图4是本发明中订阅请求处理过程的示意图。Fig. 4 is a schematic diagram of a subscription request processing process in the present invention.
图5是本发明基于信息语义处理范围查询的示意图。FIG. 5 is a schematic diagram of processing range queries based on information semantics in the present invention.
具体实施方式Detailed ways
以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.
如图1所示,本发明的基于P2P技术的语义智能信息发布订阅方法,具体步骤为:As shown in Figure 1, the method for publishing and subscribing semantic intelligent information based on P2P technology of the present invention, the specific steps are:
(1)构建系统拓扑结构并设立超级节点:利用结构化P2P技术将P/S系统中多个事件代理的拓扑构建成Chord环结构,并在环上设定一个超级节点,用于提取信息网络中数据资源的属性并构造属性空间,信息网络中数据资源抽象为高维属性空间中的点或向量。(1) Construct the system topology and set up super nodes: use structured P2P technology to construct the topology of multiple event agents in the P/S system into a Chord ring structure, and set a super node on the ring to extract information from the network The attributes of the data resources in the information network and construct the attribute space, and the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.
(2)高维属性空间规格化处理:针对现在信息网络中信息种类繁多,不同信息属性各不一致的问题,在超级节点上利用向量空间模型(VSM)将网络中数据信息表示为高维属性空间中的高维点向量,将网络中所有数据信息映射到一个高维属性空间中,数学形式上表示为一个高阶矩阵;为了减少信息索引过程中的噪声及同义词干扰,利用潜在语义索引(LSI)去除与信息检索相关性很小的信息属性,用一个属性子空间近似代替原先的高维属性空间。(2) High-dimensional attribute space normalization processing: In view of the fact that there are many types of information in the current information network and different information attributes are inconsistent, the vector space model (VSM) is used on the super node to represent the data information in the network as a high-dimensional attribute space The high-dimensional point vector in the network maps all data information in the network to a high-dimensional attribute space, which is expressed as a high-order matrix in mathematical form; in order to reduce noise and synonym interference in the process of information indexing, latent semantic indexing (LSI ) to remove information attributes that have little correlation with information retrieval, and use an attribute subspace to approximately replace the original high-dimensional attribute space.
(3)高维数据分区并建立维护全局索引树:超级节点SN将属性空间内的高维数据划分为不同的数据聚集区,并将每个数据聚集区分配到不同的节点,超级节点还维护一棵由所有节点信息构造成的索引树,称为全局索引树,用来为Chord环中节点分配事件信息并确定订阅请求需要访问的代理节点。(3) Partition high-dimensional data and establish and maintain a global index tree: the super node SN divides the high-dimensional data in the attribute space into different data aggregation areas, and assigns each data aggregation area to different nodes. The super node also maintains An index tree constructed from all node information, called a global index tree, is used to assign event information to nodes in the Chord ring and determine the proxy nodes that subscription requests need to access.
(4)高维数据降维:节点将分配给自己的数据聚集区归整为高维属性空间内的多维超立方体,最后利用金字塔降维方法将高维数据对象映射到一维数据空间,并用B+树进行索引。(4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to the one-dimensional data space, and uses B+ tree for indexing.
(5)利用i-Chord方法管理数据对象:将高维数据集映射到一维数据空间后,利用Chord协议组织维护网络数据信息。针对一维数据空间和Chord资源标识符空间不匹配问题,设置一个保序函数将一维数据空间内数据映射到Chord资源标识符空间;Chord每个节点对应一个数据聚集区,存储管理相应数据聚集区内的高维数据,且每个节点维护一张路由表方便快速查询信息;当一个节点管理的数据量过大时,可以通过节点插入离开等操作维护系统的负载均衡。(5) Use the i-Chord method to manage data objects: After mapping the high-dimensional data set to a one-dimensional data space, use the Chord protocol to organize and maintain network data information. Aiming at the mismatch between the one-dimensional data space and the Chord resource identifier space, set an order-preserving function to map the data in the one-dimensional data space to the Chord resource identifier space; each node of Chord corresponds to a data aggregation area, and stores and manages the corresponding data aggregation High-dimensional data in the area, and each node maintains a routing table to facilitate quick information query; when the amount of data managed by a node is too large, the load balance of the system can be maintained through operations such as node insertion and departure.
(6)实现基于语义的智能信息订阅与发布:系统中每个节点维护一张订阅表,记录与该节点语义相关的订阅信息。当订阅者向系统发送订阅请求时,首先通过搜索全局索引树确定与该订阅请求语义相关的代理节点,将请求发送给代理节点,在代理节点的订阅表中增加一条订阅记录,注册该订阅请求和该节点的关联关系,然后代理节点根据订阅请求及订阅条件确定高维属性空间中的精确搜索范围,在精确搜索范围内精确查找与订阅请求语义相同的事件信息并返回给订阅者。当发布者向系统发送事件信息时,首先通过搜索全局索引树确定与该事件语义相关的宿主节点,将事件信息发送到宿主节点,并查阅宿主节点的订阅表,如果事件信息与订阅表中某条订阅信息语义匹配成功,则将事件主动推送给用户。(6) Realize semantic-based intelligent information subscription and publishing: each node in the system maintains a subscription table to record subscription information related to the semantics of the node. When a subscriber sends a subscription request to the system, it first determines the proxy node related to the subscription request semantics by searching the global index tree, sends the request to the proxy node, adds a subscription record to the subscription table of the proxy node, and registers the subscription request Then the proxy node determines the precise search range in the high-dimensional attribute space according to the subscription request and subscription conditions, and accurately finds the event information with the same semantics as the subscription request within the precise search range and returns it to the subscriber. When the publisher sends event information to the system, it first determines the host node related to the event semantics by searching the global index tree, sends the event information to the host node, and consults the subscription table of the host node. If the semantic matching of the subscription information is successful, the event will be actively pushed to the user.
本实施例中,上述步骤(1)的具体步骤为:In this embodiment, the specific steps of the above step (1) are:
(1.1)P/S系统中可能存在多个事件代理,将多个事件代理按照Chord环结构组织成事件代理网络,事件代理对应该网络中的各个节点,每个事件代理按照一定规则(事件的语义信息)存储信息网络中的数据资源,为一定数量的客户端服务,并保存有部分其他节点的信息,使订阅和事件信息低成本、高效、可靠地到达语义相关的节点。(1.1) There may be multiple event agents in the P/S system. Multiple event agents are organized into an event agent network according to the Chord ring structure. The event agents correspond to each node in the network. Each event agent follows certain rules (event Semantic information) stores data resources in the information network, serves a certain number of clients, and saves some information of other nodes, so that subscription and event information can reach semantically related nodes at low cost, efficiently and reliably.
(1.2)在Chord环内选择一个能力(节点的计算能力、存储空间、在线时间、带宽等多方面考虑)最强的节点作为超级节点SN,超级节点SN定期检查环内其他节点的能力,从中选出候选超级节点,候选超级节点备份SN上重要的信息,以便SN发生故障时代替SN成为新的超级节点。(1.2) In the Chord ring, select a node with the strongest capability (computing power, storage space, online time, bandwidth, etc.) as the super node SN, and the super node SN regularly checks the capabilities of other nodes in the ring, from which Select a candidate super node, and the candidate super node backs up important information on the SN, so that when the SN fails, it will replace the SN and become a new super node.
(1.3)为了获取网络中数据资源的语义信息,超级节点SN负责提取不同形式数据资源的属性,构造多维属性空间,信息网络中数据资源抽象为高维属性空间中的点或向量。(1.3) In order to obtain the semantic information of data resources in the network, the super node SN is responsible for extracting the attributes of different forms of data resources and constructing a multi-dimensional attribute space. The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.
本实施例中,上述步骤(2)的具体步骤为:In this embodiment, the specific steps of the above step (2) are:
(2.1)依据向量空间模型的思想将网络中高维数据信息描述为一个属性向量,属性可以是和给定高维信息相关的概念、关键词、术语等元素,然后系统将网络中所有高维信息组织为一个矩阵,如信息网络中可以通过t个属性描述的d个高维数据信息可以描述为一个t*d的矩阵A,矩阵的每一列代表一个高维信息数据,矩阵元素aij为数据对象j中属性i的属性值,代表了对象j中属性i的重要性(如可通过一个术语在一个文档中出现的频率来计算),如果高维信息j不存在属性i,则aij为0。(2.1) According to the idea of the vector space model, the high-dimensional data information in the network is described as an attribute vector. The attributes can be elements such as concepts, keywords, and terms related to the given high-dimensional information, and then the system combines all the high-dimensional information in the network Organized as a matrix, such as d high-dimensional data information that can be described by t attributes in an information network can be described as a t*d matrix A, each column of the matrix represents a high-dimensional information data, and matrix elements a ij are data The attribute value of attribute i in object j represents the importance of attribute i in object j (for example, it can be calculated by the frequency of a term appearing in a document). If attribute i does not exist in high-dimensional information j, then a ij is 0.
(2.2)注意到矩阵中大部分的元素为0,说明信息检索中一个高维信息的大部分的属性为无用信息,为了减少信息语义检索过程中不必要的计算量,用一个低秩矩阵近似代替初始矩阵。假设网络中高维信息集合可表示为矩阵A,A的秩为r,利用矩阵的奇异值分解将A分解为三个矩阵的乘积:(2.2) Note that most of the elements in the matrix are 0, indicating that most of the attributes of a high-dimensional information in information retrieval are useless information. In order to reduce unnecessary calculations in the process of information semantic retrieval, a low-rank matrix is used to approximate instead of the initial matrix. Assuming that the high-dimensional information set in the network can be expressed as a matrix A, the rank of A is r, and the singular value decomposition of the matrix is used to decompose A into the product of three matrices:
A=U∑VT A=U∑V T
其中U=(u1,u2,...,ur)是一个t*r矩阵,∑=diag(σ1,...,σr)是一个r*r对角矩阵,V=(v1,...,vr)是一个d*r矩阵,σi是A的奇异值,σ1≥σ2≥...≥σr。Where U=(u 1 ,u 2 ,...,u r ) is a t*r matrix, ∑=diag(σ 1 ,...,σ r ) is a r*r diagonal matrix, V=( v 1 ,...,v r ) is a d*r matrix, σ i is the singular value of A, σ 1 ≥σ 2 ≥...≥σ r .
(2.3)为了减少计算量,加快信息检索速度,同时避免语义匹配过程中的同义词干扰,仅仅保留l个最大的矩阵奇异值,省略掉其他的奇异值,将秩为r的矩阵A近似化简为秩为l的矩阵Al:(2.3) In order to reduce the amount of calculation, speed up information retrieval, and avoid the interference of synonyms in the process of semantic matching, only the l largest singular values of the matrix are reserved, and other singular values are omitted, and the matrix A with rank r is approximately simplified is a matrix A l of rank l:
A1=U1Σ1V1 T A 1 = U 1 Σ 1 V 1 T
其中U1=(ul,u2,...,U1),Σ1=diag(σ1,...,σ1),V1=(Vl,...,V1),V1Σ1的行是高维信息的语义向量。这样,关于高维信息检索的处理可以在这个低秩语义矩阵内进行。where U 1 =(u l , u 2 ,...,U 1 ), Σ 1 =diag(σ 1 ,...,σ 1 ), V 1 =(V l ,...,V 1 ), The rows of V 1 Σ 1 are semantic vectors of high-dimensional information. In this way, processing on high-dimensional information retrieval can be performed within this low-rank semantic matrix.
本实施例中,上述步骤(3)的具体步骤为:In this embodiment, the specific steps of the above step (3) are:
(3.1)信息网络中数据资源抽象为高维属性空间中的点或向量,根据属性空间内距离相近的数据资源具有相似语义的原理,超级节点SN对多维属性空间中分布的高维数据进行聚类分区,将高维数据分成多个互不相交的数据聚集区,并将每个数据聚集区分配到不同的节点。(3.1) The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space. According to the principle that data resources with similar distances in the attribute space have similar semantics, the super node SN aggregates the high-dimensional data distributed in the multi-dimensional attribute space. Class partitioning divides high-dimensional data into multiple disjoint data aggregation areas, and assigns each data aggregation area to different nodes.
(3.2)超级节点SN除了自身的路由表外,还维护一棵由环上所有节点的资源标识符范围信息构造而成的索引树,称为全局索引GI。GI具有以下两个用途:(3.2) In addition to its own routing table, the super node SN also maintains an index tree constructed from the resource identifier range information of all nodes on the ring, called the global index GI. GI serves two purposes:
(3.2.1)快速为节点分配事件信息。环内每一个节点负责一个数据聚集区,该聚集区内所有的数据信息全部分配给这个节点,当一个事件信息请求加入时,通过查询GI确定该事件信息属于哪个数据聚集区,即确定了事件信息的宿主节点,然后以宿主节点的标识符为搜索键,利用Chord路由协议,将该事件信息分配给宿主节点。(3.2.1) Quickly assign event information to nodes. Each node in the ring is responsible for a data aggregation area, and all data information in the aggregation area is allocated to this node. When an event information request is added, it is determined which data aggregation area the event information belongs to by querying GI, that is, the event is determined. The host node of the information, then uses the identifier of the host node as a search key, and uses the Chord routing protocol to distribute the event information to the host node.
(3.2.2)确定订阅请求需要访问的节点。当某个节点输入订阅请求时,首先把请求发送给SN搜索GI,确定哪些节点负责的数据聚集区与该订阅请求的语义空间相交并返回这些节点的标识符,返回的节点称为代理节点,可能包含订阅结果,然后以代理节点的标识符为搜索键,将订阅请求路由到这些节点,进一步实现语义匹配。(3.2.2) Determine the nodes that the subscription request needs to visit. When a node inputs a subscription request, it first sends the request to SN to search GI to determine which nodes are responsible for the data aggregation area intersecting with the semantic space of the subscription request and return the identifiers of these nodes. The returned nodes are called proxy nodes. It may contain subscription results, and then use the identifier of the proxy node as the search key to route the subscription request to these nodes to further achieve semantic matching.
本实施例中,上述步骤(4)的具体步骤为:In this embodiment, the specific steps of the above step (4) are:
(4.1)将每个数据聚集区归整为高维数据空间内维度为d的多维超立方体,并对多维超立方体进行归一化处理,即每一维的边长为1,超立方体的中心点则表示为(0.5,0.5,...,0.5),然后以超立方体中心为顶点,数据聚集区的(d-1)维超平面作为基,将每个d维数据聚集区(超立方体)划分为2d个金字塔。(4.1) Organize each data aggregation area into a multi-dimensional hypercube with dimension d in the high-dimensional data space, and normalize the multi-dimensional hypercube, that is, the side length of each dimension is 1, and the center of the hypercube Points are expressed as (0.5, 0.5, ..., 0.5), and then take the center of the hypercube as the vertex, and the (d-1) dimensional hyperplane of the data aggregation area as the base, and each d-dimensional data aggregation area (hypercube ) is divided into 2d pyramids.
(4.2)将每个金字塔分成与基平行的几个子划分,每一个子划分与一个B+树数据页相对应,然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值,并用B+树组织管理降维后数据。降维公式为:(4.2) Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area according to the distance from the high-dimensional data point to the pyramid base It is a one-dimensional value, and the data after dimension reduction is organized and managed by B+ tree. The dimensionality reduction formula is:
yv=i*C+(j+hv)=i*c+(j+|0.5-hv|)y v =i*C+(j+h v )=i*c+(j+|0.5-h v |)
其中某个高维数据对象v所在的立方体数为i(区分不同的数据聚集区),金字塔数为j,v到所在金字塔基平面距离为hv’,则到金字塔顶点所在平面的距离为hv=∣0.5-hv’∣。降维后,每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内,c是一个足够大的常数,从而确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Among them, the number of cubes where a certain high-dimensional data object v is located is i (to distinguish different data aggregation areas), the number of pyramids is j, and the distance between v and the base plane of the pyramid where v is located is h v ', then the distance to the plane where the apex of the pyramid is located is h v =∣0.5-h v'∣ . After dimensionality reduction, the one-dimensional value of the high-dimensional data object in each high-dimensional cube is limited to the interval [i*c, (i+1)*c], c is a constant large enough to ensure that each data Data objects in a partition have different index keys than other partitions.
本实施例中,上述步骤(5)的具体步骤为:In this embodiment, the specific steps of the above step (5) are:
(5.1)假设Chord系统资源标识符区间大小为2m,为了利用Chord的资源标识符管理降维后一维数据,利用一个保序函数h将降维后一维数值按照顺序映射到区间[0,2m)中。即对于属性空间中的某个高维数据点v(v属于立方体数位i的超立方体),点v的i-Chord资源标识符为:(5.1) Assuming that the resource identifier interval size of the Chord system is 2 m , in order to use the resource identifier of Chord to manage the one-dimensional data after dimension reduction, an order-preserving function h is used to map the one-dimensional values after dimension reduction to the interval [0 ,2 m ). That is, for a certain high-dimensional data point v in the attribute space (v belongs to the hypercube of cube digit i), the i-Chord resource identifier of point v is:
keyv=ichord(v)=h(yv)=h(i*c+(j+hv))∈[0,2m)key v =ichord(v)=h(y v )=h(i*c+(j+h v ))∈[0, 2 m )
(5.2)Chord环上每个节点Ni负责存储管理一个数据聚集区的高维数据信息,设Ni节点标识符Nkeyi,则该聚集区内高维数据信息的资源标识符分布在区间(Nkeyi-1,Nkeyi]内。(5.2) Each node N i on the Chord ring is responsible for storing and managing the high-dimensional data information of a data aggregation area. Let the N i node identifier Nkey i , then the resource identifiers of the high-dimensional data information in the aggregation area are distributed in the interval ( Nkey i-1 , Nkey i ].
(5.3)每个节点Ni维护一张路由表,即指针表(finger table),指向环上的其他节点。在路由表中有m(标识符的位数)个表项,其中第k(1≤k≤m)行表项为Chord环上标识符等于或大于(Nkeyi+2k-1)mod2m的第一个节点,即successor((Nkeyi+2k-1)mod2m)。任何一个节点收到关键字为key的请求时,首先检查自身节点是否等于key,如果是则直接返回;否则,节点查找其路由表,找到表中最大但不超过key的第一个节点,并将查询请求转发给该节点,重复此过程,直到请求到达一个节点Nk,满足key位于Nk和Nk的后续节点Nk+1之间时,节点Nk汇报其后继节点Nk+1作为请求的应答。(5.3) Each node N i maintains a routing table, that is, a finger table, pointing to other nodes on the ring. There are m (number of identifiers) entries in the routing table, among which the entry k (1≤k≤m) is the identifier on the Chord ring equal to or greater than (Nkey i +2 k-1 )mod2 m The first node of , namely successor((Nkey i +2 k-1 )mod2 m ). When any node receives a request whose keyword is key, it first checks whether its own node is equal to key, and if so, it returns directly; otherwise, the node searches its routing table, finds the first node in the table that is the largest but does not exceed key, and Forward the query request to this node, and repeat this process until the request reaches a node N k , and when the key is located between N k and N k ’s subsequent node N k+1 , node N k reports its successor node N k+1 in response to a request.
(5.4)节点负责的资源标识符有限,因此节点无法维护无限多的高维数据信息。当某个数据聚集区内高维数据过于密集导致节点不能有效对其进行管理时,将该聚集区分裂为两个或多个新的聚集区,相应地在Chord环上选择空闲节点以分担原节点的负载;当临近的数据聚集区内数据稀疏导致节点资源浪费时,合并聚集区,选择两个执行节点中的一个作为新的数据聚集区的执行节点,并将另一置为空闲节点。通过执行上述操作,有效的实现了系统负载均衡。(5.4) Nodes are responsible for limited resource identifiers, so nodes cannot maintain infinite high-dimensional data information. When the high-dimensional data in a data aggregation area is too dense and the nodes cannot effectively manage it, the aggregation area is split into two or more new aggregation areas, and correspondingly select idle nodes on the Chord ring to share the original data. The load of the node; when the data in the adjacent data aggregation area is sparse and the node resources are wasted, the aggregation area is merged, and one of the two execution nodes is selected as the execution node of the new data aggregation area, and the other is set as an idle node. By performing the above operations, the system load balance is effectively realized.
本实施例中,上述步骤(6)的具体步骤为:In this embodiment, the specific steps of the above step (6) are:
(6.1)为了实现系统基于语义的信息发布订阅功能,每个节点维护一张订阅表,记录与该节点相关的所有订阅信息,当节点有事件信息到达时,查询该节点的订阅表,如果事件信息与订阅表中某条订阅信息语义匹配成功,则将事件主动推送给用户。(6.1) In order to realize the semantic-based information publishing and subscription function of the system, each node maintains a subscription table to record all subscription information related to the node. When the node has event information, query the subscription table of the node. If the event If the information successfully matches the semantics of a piece of subscription information in the subscription table, the event will be actively pushed to the user.
(6.2)当订阅者向环上某节点Ni提交订阅请求信息时,该节点搜索全局搜索树确定与订阅请求语义相关的代理节点,将该订阅请求路由到代理节点,并在带代理节点上执行基于订阅语义的相似搜索算法(如点查询、范围查询、k近邻查询、最近邻查询),若搜索到与订阅请求语义相同的事件信息,则将事件信息反向传递给Ni;同时,代理节点将订阅请求、订阅条件、路由路径及搜索范围等信息存储在其订阅表中。(6.2) When a subscriber submits subscription request information to a node N i on the ring, the node searches the global search tree to determine the proxy node related to the subscription request semantics, routes the subscription request to the proxy node, and Execute a similar search algorithm based on subscription semantics (such as point query, range query, k-nearest neighbor query, nearest neighbor query), and if the event information with the same semantics as the subscription request is found, the event information will be reversely passed to N i ; at the same time, The agent node stores information such as subscription request, subscription condition, routing path and search scope in its subscription table.
(6.3)当发布者向环上某个节点Nj发布事件信息时,该节点搜索全局搜索树确定与事件信息语义相关的宿主节点,依据Chord路由协议将该事件信息路由到并存储在宿主节点上;同时,宿主节点将事件属性信息与其订阅表中的订阅请求信息进行语义匹配,如果匹配成功则直接将事件信息推送给订阅者。(6.3) When the publisher publishes event information to a node N j on the ring, the node searches the global search tree to determine the host node related to the semantics of the event information, and routes the event information to and stores it in the host node according to the Chord routing protocol At the same time, the host node semantically matches the event attribute information with the subscription request information in the subscription table, and if the match is successful, it directly pushes the event information to the subscribers.
(6.4)当订阅者向环上某个节点发布取消订阅信息时,该节点搜索全局搜索树确定与订阅信息语义相关的代理节点,并将请求信息路由到相应的代理节点,在代理节点订阅表中查找到并删除相应的订阅信息。(6.4) When a subscriber publishes unsubscribe information to a node on the ring, the node searches the global search tree to determine the proxy node related to the semantics of the subscription information, and routes the request information to the corresponding proxy node. In the proxy node subscription table Find and delete the corresponding subscription information.
以下将以一个具体应用实例对本发明进行详细的描述,其详细的实施步骤为:The present invention will be described in detail below with a specific application example, and its detailed implementation steps are:
1)网络数据采集:现在网络中存在海量形式各异的数字资源(如文本、图像、音乐、视频等),系统不能直观理解上述资源的语义信息,首先对网络中数字资源进行采集并进行形式化处理,便于机器识别及进一步处理。1) Network data collection: There are a large number of digital resources in various forms (such as text, images, music, videos, etc.) in the network, and the system cannot intuitively understand the semantic information of the above resources. Chemical processing for machine identification and further processing.
2)构建系统拓扑并设立超级节点:将多个事件代理按照Chord环结构组织成事件代理网络,事件代理对应该网络中的各个节点,每个事件代理按照一定规则(事件的语义信息)存储信息网络中的数据资源,为一定数量的客户端服务,并保存有部分其他节点的信息,使订阅和事件信息低成本、高效、可靠地到达语义相关的节点。设立系统中能力最强的节点(计算能力、在线时间、存储空间、带宽等多方面考虑)超级节点SN,并确定超级节点的备份节点,以确保超级节点发生故障时系统的正常工作。为了获取网络中数据资源的语义信息,超级节点SN负责提取不同形式数据资源的属性,获取多维属性空间,信息网络中数据资源则抽象为高维属性空间中的点或向量。2) Construct the system topology and set up super nodes: organize multiple event agents into an event agent network according to the Chord ring structure, the event agents correspond to each node in the network, and each event agent stores information according to certain rules (semantic information of events) The data resources in the network serve a certain number of clients and store some information of other nodes, so that subscription and event information can reach semantically related nodes at low cost, efficiently and reliably. Set up the most capable node in the system (computing power, online time, storage space, bandwidth and other considerations) super node SN, and determine the backup node of the super node to ensure the normal operation of the system when the super node fails. In order to obtain the semantic information of the data resources in the network, the super node SN is responsible for extracting the attributes of different forms of data resources and obtaining the multi-dimensional attribute space, and the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.
3)高维属性空间规格化处理:依照向量空间模型(VSM),提取数据对象的属性信息,将网络中数据对象描述为一个高维属性向量,则网络数字资源映射到高维属性空间中,网络数据资源集合可表示为一个矩阵,向量中的元素值表示一个属性在该数据对象中的重要性,如网络中d个数据对象集合可通过t个属性描述为一个t*d的矩阵A。3) High-dimensional attribute space normalization processing: According to the vector space model (VSM), the attribute information of the data object is extracted, and the data object in the network is described as a high-dimensional attribute vector, and the network digital resources are mapped to the high-dimensional attribute space. A collection of network data resources can be expressed as a matrix, and the element value in the vector indicates the importance of an attribute in the data object. For example, a collection of d data objects in the network can be described as a matrix A of t*d by t attributes.
在信息检索过程中通过高维属性空间表示的网络数据资源有很多与查询无关的属性(表现为矩阵中很多元素为0),为了加快查询速度,同时避免同义词和噪声干扰,化简矩阵,用一个低秩矩阵近似代替初始矩阵。假设网络中数据资源集合表示为矩阵A,A的秩为r,利用矩阵的奇异值分解将A分解为三个矩阵的乘积:A=U∑VT,其中U=(u1,u2,...,ur)是一个t*r矩阵,∑=diag(σ1,...,σr)是一个r*r对角矩阵,V=(v1,...,vr)是一个d*r矩阵,σi是A的奇异值,σ1≥σ2≥...≥σr;然后仅仅保留l个最大的矩阵奇异值,省略掉其他的奇异值,将秩为r的矩阵A近似化简为秩为l的矩阵Al:Al=Ul∑lVl T,其中Ul=(u1,u2,...,ul),∑l=diag(σ1,...,σl),Vl=(v1,...,vl),Vl∑l的行是高维信息的语义向量,则网络资源可以表示为l维属性空间中的高维向量。In the process of information retrieval, the network data resources represented by the high-dimensional attribute space have many attributes irrelevant to the query (many elements in the matrix are 0), in order to speed up the query and avoid synonyms and noise interference, the matrix is simplified, using A low-rank matrix approximates the original matrix. Assuming that the data resource set in the network is expressed as a matrix A, and the rank of A is r, use the singular value decomposition of the matrix to decompose A into the product of three matrices: A=U∑V T , where U=(u 1 ,u 2 , ...,u r ) is a t*r matrix, ∑=diag(σ 1 ,...,σ r ) is a r*r diagonal matrix, V=(v 1 ,...,v r ) is a d*r matrix, σ i is the singular value of A, σ 1 ≥σ 2 ≥...≥σ r ; then only keep the l largest matrix singular values, omit other singular values, and rank r The matrix A of is approximated as a matrix A l with rank l : A l = U l ∑ l V l T , where U l =(u 1 ,u 2 ,...,u l ),∑ l =diag( σ 1 ,...,σ l ), V l = (v 1 ,...,v l ), the row of V l ∑ l is the semantic vector of high-dimensional information, then the network resource can be expressed as an l-dimensional attribute space High-dimensional vectors in .
4)数据聚集分区并建立全局索引树:依据属性空间中距离接近的数据对象更可能有相似语义的原理,对属性空间中的数据对象进行聚集分区,将空间中高维数据分成多个互不相交的数据聚集区,尽量保证每个聚集区内的数据趋于均匀分布。4) Data aggregation and partitioning and establishment of a global index tree: According to the principle that data objects with close distances in the attribute space are more likely to have similar semantics, the data objects in the attribute space are aggregated and partitioned, and the high-dimensional data in the space are divided into multiple disjoint The data aggregation area, try to ensure that the data in each aggregation area tends to be evenly distributed.
此外,超级节点维护有全局索引树GI,GI中有每个节点对应的数据聚集区范围信息,当有新的数据对象加入系统时,GI根据数据对象坐标快速确定数据对象应该属于哪个节点,以便快速地将新加入的数据对象分配给该节点;当有新的订阅请求时,确定订阅请求的语义空间范围,查询GI快速确认与订阅请求相交的数据聚集区,并将请求发送到相关节点进一步精确查询,加快了查询速度。此外应定时刷新全局索引GI以避免系统节点变动带来的影响。In addition, the super node maintains a global index tree GI, which contains information on the range of data aggregation areas corresponding to each node. When a new data object is added to the system, GI quickly determines which node the data object should belong to according to the coordinates of the data object, so that Quickly assign newly added data objects to this node; when there is a new subscription request, determine the semantic space range of the subscription request, query GI to quickly confirm the data aggregation area intersected with the subscription request, and send the request to the relevant node for further Accurate query speeds up the query speed. In addition, the global index GI should be refreshed regularly to avoid the impact of system node changes.
5)高维数据降维处理:为消除高维属性空间中信息检索受到的“维度灾难”影响,对高维数据信息进行降维处理,如图2所示。5) Dimensionality reduction processing of high-dimensional data: In order to eliminate the impact of the “curse of dimensionality” on information retrieval in high-dimensional attribute spaces, dimensionality reduction processing is performed on high-dimensional data information, as shown in Figure 2.
数据分区得到的数据聚集区可能形状不规则,将每个数据聚集区归整为高维超立方体,每一个超立方体确定一个立方体数i,并进行归一化处理,使立方体边长为1,然后以超立方体中心为顶点,数据聚集区的(d-1)维超平面作为基,将每个d维数据聚集区(超立方体)划分为2d个金字塔,并为每个金字塔赋予一个金字塔值i。The data aggregation area obtained by data partitioning may be irregular in shape, and each data aggregation area is normalized into a high-dimensional hypercube. Each hypercube determines a cube number i, and performs normalization processing so that the side length of the cube is 1, and then The center of the hypercube is the vertex, and the (d-1)-dimensional hyperplane of the data aggregation area is used as the base. Each d-dimensional data aggregation area (hypercube) is divided into 2d pyramids, and a pyramid value i is assigned to each pyramid.
将每个金字塔分成与基平行的几个子划分,每一个子划分与一个B+树数据页相对应,然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值,如图2所示,并用B+树组织管理降维后数据。降维公式为:yv=i*c+(j+hv)=i*c+(j+∣0.5-hv’∣),降维后,每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内,c是一个足够大的常数,从而确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area to one-dimensional according to the distance from the high-dimensional data point to the pyramid base Value, as shown in Figure 2, and use B+ tree to organize and manage the data after dimension reduction. The dimension reduction formula is: y v =i*c+(j+h v )=i*c+(j+∣0.5-h v '∣), after dimension reduction, the one-dimensional value of the high-dimensional data object in each high-dimensional cube It is limited in the interval [i*c, (i+1)*c], where c is a constant large enough to ensure that the data objects in each data partition have index keys different from other partitions.
6)用i-Chord方法管理数据对象:利用Chord方法存储管理网络数据信息,在信息检索中实现了信息检索分布式处理,提高了系统的效率。首先依据第5)部生成的一维数据构造数据对象的资源标识符,假设Chord系统资源标识符区间大小为2m,利用一个保序函数h将一维数据值按照顺序映射到区间[0,2m)中,映射公式为:keyv=ichord(v)=h(yv)=h(i*c+(j+hv))∈[0,2m)。6) Manage data objects with the i-Chord method: use the Chord method to store and manage network data information, realize distributed processing of information retrieval in information retrieval, and improve system efficiency. First, construct the resource identifier of the data object based on the one-dimensional data generated in Part 5. Assume that the resource identifier interval size of the Chord system is 2 m , and use an order-preserving function h to map the one-dimensional data values to the interval [0, 2 m ), the mapping formula is: key v =ichord(v)=h(y v )=h(i*c+(j+h v ))∈[0,2 m ).
为了便于数据查询,Chord环上每一个节点存储管理一个数据聚集区内的数据对象,环上一段资源标识符对应的数据信息存储在其后继节点上,如节点Ni的节点标识符Nkeyi,则资源标识符在区间(Nkeyi-1,Nkeyi]内的高维数据对象存储在节点Ni上。每个节点维护一张路由表,系统依据路由表查找给定资源标识符对应的数据对象(图3演示了从节点N4查找key=28的过程)。同时为了维护系统负载均衡,当数据聚集区内数据对象过于密集或过于稀疏时,分裂或合并数据聚集器,并执行相应的节点操作,以确保系统性能稳定及资源优化。In order to facilitate data query, each node on the Chord ring stores and manages data objects in a data aggregation area, and the data information corresponding to a resource identifier on the ring is stored on its successor nodes, such as the node identifier Nkey i of node N i , Then the high-dimensional data object whose resource identifier is in the interval (Nkey i-1 , Nkey i ] is stored on the node N i . Each node maintains a routing table, and the system searches for the data corresponding to the given resource identifier according to the routing table object (Figure 3 demonstrates the process of finding key=28 from node N4). At the same time, in order to maintain system load balance, when the data objects in the data aggregation area are too dense or too sparse, split or merge the data aggregator and execute the corresponding node operation to ensure stable system performance and resource optimization.
7)基于语义实现信息的发布订阅:处理订阅请求过程如图4所示,每个节点维护有订阅表(如图3所示),当用户向某个节点发送订阅请求时,系统检查订阅表,如果没有相同的订阅请求,系统搜索全局索引树确定与订阅请求语义相关的代理节点并将订阅请求路由到这些代理节点,并在这些语义相关节点上执行相似搜索算法,查找到与订阅请求语义相同的事件信息返回给用户,同时将用户订阅请求、搜索结果等相关信息加入代理节点中的订阅表,如果订阅表中有相同的用户请求,则依据订阅表快速查找到结果返回并更新订阅表;当系统中有新的事件信息加入时,依据其资源标识符存储在相应的宿主节点上,并检查宿主节点订阅表确定是否有与之语义相匹配的订阅,如果存在则将数据对象主动推送给相应的用户;当用户发送取消订阅消息时,系统查找订阅表并删除相应订阅信息。7) Publish and subscribe information based on semantics: the process of processing subscription requests is shown in Figure 4. Each node maintains a subscription table (as shown in Figure 3). When a user sends a subscription request to a node, the system checks the subscription table , if there is no same subscription request, the system searches the global index tree to determine the proxy nodes related to the subscription request semantics and routes the subscription request to these proxy nodes, and executes a similar search algorithm on these semantically related nodes to find out the semantics related to the subscription request The same event information is returned to the user, and relevant information such as user subscription requests and search results are added to the subscription table in the proxy node. If there is the same user request in the subscription table, the results are quickly found according to the subscription table and the subscription table is updated. ; When new event information is added to the system, store it on the corresponding host node according to its resource identifier, and check the subscription table of the host node to determine whether there is a subscription that matches its semantics, and if so, actively push the data object To the corresponding user; when the user sends an unsubscribe message, the system searches the subscription table and deletes the corresponding subscription information.
相似搜索方法分为四类:点查询(Point Query),在数据空间S中查找到与给定查询点q相同的目标对象p;范围查询(Range Query),对于给定的查询点q及阀值r,在数据空间S中查找到满足d(p,q)≤r的所有目标对象p;最近邻查询(Nearest Neighbor Query),在数据空间S中查找到与给定查询点q最近的目标对象p;k-近邻查询(KNN Query),在数据空间S中查找到与给定查询点q最近的k个目标对象p。Similar search methods are divided into four categories: point query (Point Query), to find the same target object p as a given query point q in the data space S; range query (Range Query), for a given query point q and valve Value r, find all target objects p satisfying d(p,q)≤r in the data space S; nearest neighbor query (Nearest Neighbor Query), find the closest target to the given query point q in the data space S Object p; k-nearest neighbor query (KNN Query), find the k target objects p closest to a given query point q in the data space S.
以范围查询为例,如图5所示(以二维空间为例),首先根据查询点坐标及其查询半径得出查询范围(图中的圆形区域),并判断是否与数据空间内数据聚集区相交,如果存在数据聚集区与查询范围相交,确定聚集区内每个金字塔与查询范围相交的两个边界索引关键字ylow和yhigh,根据这两个索引关键字在B+树种进行范围查找,获得一个点集(图中阴影区域内的点集),然后精确判断点集中的数据点是否属于查询范围,最后输出结果。Taking range query as an example, as shown in Figure 5 (taking two-dimensional space as an example), firstly, the query range (circular area in the figure) is obtained according to the coordinates of the query point and its query radius, and it is judged whether it is consistent with the data in the data space The aggregation area intersects. If there is a data aggregation area that intersects the query range, determine the two boundary index keywords y low and y high that each pyramid in the aggregation area intersects with the query range, and perform ranges in the B+ tree species based on these two index keywords Search to obtain a point set (the point set in the shaded area in the figure), and then accurately judge whether the data points in the point set belong to the query range, and finally output the result.
以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.
Claims (7)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310302187.6A CN103412883B (en) | 2013-07-17 | 2013-07-17 | Semantic intelligent information distribution subscription method based on P2P technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310302187.6A CN103412883B (en) | 2013-07-17 | 2013-07-17 | Semantic intelligent information distribution subscription method based on P2P technology |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103412883A true CN103412883A (en) | 2013-11-27 |
CN103412883B CN103412883B (en) | 2016-09-28 |
Family
ID=49605895
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310302187.6A Active CN103412883B (en) | 2013-07-17 | 2013-07-17 | Semantic intelligent information distribution subscription method based on P2P technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103412883B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106060154A (en) * | 2016-06-30 | 2016-10-26 | 江苏省现代企业信息化应用支撑软件工程技术研发中心 | Subscribing-publishing matching method and device based on topic model |
CN106326295A (en) * | 2015-07-01 | 2017-01-11 | 中兴通讯股份有限公司 | Method and device for storing semantic data |
CN107404512A (en) * | 2016-05-19 | 2017-11-28 | 华为技术有限公司 | Resource subscription method, resource subscription device and resource subscription Xi System |
CN109558410A (en) * | 2018-12-14 | 2019-04-02 | 北京邮电大学 | Event matches algorithm based on multi-dimensional content in a kind of information distribution system |
CN112765207A (en) * | 2021-04-07 | 2021-05-07 | 中国人民解放军国防科技大学 | Resource big data representation, storage and query method |
CN114844948A (en) * | 2021-12-14 | 2022-08-02 | 合肥哈工轩辕智能科技有限公司 | Client cache optimization method and device of real-time distribution system |
CN115037624A (en) * | 2021-03-06 | 2022-09-09 | 瞻博网络公司 | Global network state management |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1625119A (en) * | 2004-12-09 | 2005-06-08 | 中国科学院软件研究所 | Routing method of pub/sub system built on structured P2P network |
US20120166556A1 (en) * | 2010-12-23 | 2012-06-28 | Electronics And Telecommunications Research Institute | Method, device and system for real-time publish subscribe discovery based on distributed hash table |
CN102547471A (en) * | 2010-12-08 | 2012-07-04 | 中国科学院声学研究所 | Method and system for obtaining candidate cooperation node in P2P streaming media system |
-
2013
- 2013-07-17 CN CN201310302187.6A patent/CN103412883B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1625119A (en) * | 2004-12-09 | 2005-06-08 | 中国科学院软件研究所 | Routing method of pub/sub system built on structured P2P network |
CN102547471A (en) * | 2010-12-08 | 2012-07-04 | 中国科学院声学研究所 | Method and system for obtaining candidate cooperation node in P2P streaming media system |
US20120166556A1 (en) * | 2010-12-23 | 2012-06-28 | Electronics And Telecommunications Research Institute | Method, device and system for real-time publish subscribe discovery based on distributed hash table |
Non-Patent Citations (1)
Title |
---|
沈燕玉等: "基于结构化P2P的发布订阅系统", 《计算机系统应用》, vol. 21, no. 2, 15 February 2012 (2012-02-15), pages 130 - 134 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106326295B (en) * | 2015-07-01 | 2021-12-14 | 中兴通讯股份有限公司 | Semantic data storage method and device |
CN106326295A (en) * | 2015-07-01 | 2017-01-11 | 中兴通讯股份有限公司 | Method and device for storing semantic data |
CN107404512B (en) * | 2016-05-19 | 2021-03-05 | 华为技术有限公司 | Resource subscription method, resource subscription device and resource subscription system |
US10637794B2 (en) | 2016-05-19 | 2020-04-28 | Huawei Technologies Co., Ltd. | Resource subscription method, resource subscription apparatus, and resource subscription system |
CN107404512A (en) * | 2016-05-19 | 2017-11-28 | 华为技术有限公司 | Resource subscription method, resource subscription device and resource subscription Xi System |
CN106060154B (en) * | 2016-06-30 | 2019-04-19 | 江苏省现代企业信息化应用支撑软件工程技术研发中心 | Subscription publication matching process and device based on topic model |
CN106060154A (en) * | 2016-06-30 | 2016-10-26 | 江苏省现代企业信息化应用支撑软件工程技术研发中心 | Subscribing-publishing matching method and device based on topic model |
CN109558410A (en) * | 2018-12-14 | 2019-04-02 | 北京邮电大学 | Event matches algorithm based on multi-dimensional content in a kind of information distribution system |
CN115037624A (en) * | 2021-03-06 | 2022-09-09 | 瞻博网络公司 | Global network state management |
CN112765207A (en) * | 2021-04-07 | 2021-05-07 | 中国人民解放军国防科技大学 | Resource big data representation, storage and query method |
CN112765207B (en) * | 2021-04-07 | 2021-06-18 | 中国人民解放军国防科技大学 | Resource big data processing, storing and inquiring method |
CN114844948A (en) * | 2021-12-14 | 2022-08-02 | 合肥哈工轩辕智能科技有限公司 | Client cache optimization method and device of real-time distribution system |
CN114844948B (en) * | 2021-12-14 | 2024-05-31 | 合肥哈工轩辕智能科技有限公司 | Client cache optimization method and device of real-time distribution system |
Also Published As
Publication number | Publication date |
---|---|
CN103412883B (en) | 2016-09-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103412883B (en) | Semantic intelligent information distribution subscription method based on P2P technology | |
Wang et al. | Indexing multi-dimensional data in a cloud system | |
Deng et al. | Best keyword cover search | |
Ding et al. | Efficient and progressive algorithms for distributed skyline queries over uncertain data | |
US8688708B2 (en) | Storing and retrieving objects on a computer network in a distributed database | |
Doulkeridis et al. | Peer-to-peer similarity search in metric spaces | |
Yang et al. | Distributed similarity queries in metric spaces | |
Li et al. | SES-LSH: Shuffle-efficient locality sensitive hashing for distributed similarity search | |
Tian et al. | A survey of spatio-temporal big data indexing methods in distributed environment | |
Amagata et al. | Space filling approach for distributed processing of top-k dominating queries | |
Ji et al. | Scalable nearest neighbor query processing based on inverted grid index | |
Lubbe et al. | DiSCO: A distributed semantic cache overlay for location-based services | |
Cheng et al. | A Multi-dimensional Index Structure Based on Improved VA-file and CAN in the Cloud | |
Akdogan et al. | ToSS-it: A cloud-based throwaway spatial index structure for dynamic location data | |
Vlachou et al. | Efficient routing of subspace skyline queries over highly distributed data | |
Deng et al. | Spatial-keyword skyline publish/subscribe query processing over distributed sliding window streaming data | |
Thi-To-Quyen et al. | Optimization for large-scale fuzzy joins using fuzzy filters in mapreduce | |
Mühleisen et al. | A self-organized semantic storage service | |
Zhang et al. | Sextant: Grab's Scalable In-Memory Spatial Data Store for Real-Time K-Nearest Neighbour Search | |
Li et al. | An efficient scheme for probabilistic skyline queries over distributed uncertain data | |
Zhu et al. | Computing the Split Points for Learning Decision Tree in MapReduce | |
CN117689451B (en) | Flink-based stream vector search method, device and system | |
Yang et al. | Efficient Spatial Dataset Search over Multiple Data Sources | |
Deng et al. | An indexing approach for efficient supporting of continuous spatial approximate keyword queries | |
Huang et al. | Towards progressive and load balancing distributed computation: a case study on skyline analysis |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant |