CN103412883A

CN103412883A - Semantic intelligent information publishing and subscribing method based on P2P technology

Info

Publication number: CN103412883A
Application number: CN2013103021876A
Authority: CN
Inventors: 王小峰; 吴纯青; 任沛阁; 胡晓峰; 黄杰; 虞万荣; 彭伟; 陶静; 孙浩
Original assignee: National University of Defense Technology
Current assignee: National University of Defense Technology
Priority date: 2013-07-17
Filing date: 2013-07-17
Publication date: 2013-11-27
Anticipated expiration: 2033-07-17
Also published as: CN103412883B

Abstract

A method for publishing and subscribing semantic intelligent information based on P2P technology, the steps of which are: (1) Construct the system topology and set up super nodes; (2) Normalize the high-dimensional attribute space; (3) Partition high-dimensional data and establish and maintain Global index tree; (4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to one-dimensional Data space, and use B+ tree to index; (5) Use i-Chord method to manage data objects; (6) Realize semantic-based intelligent information subscription and release. The invention has the advantages of simple principle, easy realization and popularization, improvement of system fault tolerance, dynamicity, information distribution efficiency and the like.

Description

Semantic Intelligent Information Publishing and Subscribing Method Based on P2P Technology

技术领域technical field

本发明主要涉及到大规模信息网络中基于语义的智能信息交互方法领域，特指一种基于P2P技术的语义智能信息发布订阅方法。The invention mainly relates to the field of semantic-based intelligent information interaction methods in large-scale information networks, in particular to a P2P technology-based semantic intelligent information publishing and subscribing method.

背景技术Background technique

随着计算机网络技术的飞速发展及广泛应用，网络中的数字资源呈现出指数增长的态势，表现形式日趋多样化，用户对信息获取的需求也在日益提高，如何从海量繁杂的网络资源中高效获取用户“感兴趣”的信息成为人们越来越关注的问题。With the rapid development and wide application of computer network technology, the digital resources in the network are showing an exponential growth trend, the forms of expression are becoming more and more diversified, and the user's demand for information acquisition is also increasing. Obtaining information that users are "interested in" has become a problem that people pay more and more attention to.

网络环境呈现出规模庞大、分散控制、松散耦合、自治性、动态性等特点，为此研究者提出了发布/订阅（Publish/Subscribe，以下简称P/S）技术。P/S系统由发布者、订阅者和事件代理三部分组成。发布者是指产生事件的对象，即信息生产者；订阅者是指消费事件的对象，即信息消费者；而事件代理是发布/订阅的中间件，发布者以“事件”形式发布信息到事件代理，订阅者向事件代理订阅感兴趣的事件，事件代理把发布的事件及时可靠地路由给感兴趣的订阅。P/S技术是一种能够同时提供一对多和多对多信息交互的异步通信模式,它能使信息交互的各个参与者在时间、空间和控制流上完全解耦，同时还具有匿名通信等特征，能够很好地满足大规模分布式网络系统松散通信的需求。The network environment presents the characteristics of large scale, decentralized control, loose coupling, autonomy, and dynamics. For this reason, researchers have proposed the Publish/Subscribe (hereinafter referred to as P/S) technology. The P/S system consists of three parts: publisher, subscriber and event agent. The publisher refers to the object that generates the event, that is, the information producer; the subscriber refers to the object that consumes the event, that is, the information consumer; and the event agent is the middleware for publishing/subscribing, and the publisher publishes information to the event in the form of an "event" Agents, subscribers subscribe interested events to the event agent, and the event agent routes the published events to interested subscriptions in a timely and reliable manner. P/S technology is an asynchronous communication mode that can provide one-to-many and many-to-many information interaction at the same time. It can completely decouple the participants of information interaction in time, space and control flow, and also has anonymous communication. and other features, which can well meet the needs of loose communication in large-scale distributed network systems.

现有的发布/订阅技术研究仍处于发展阶段，在可靠性和信息分发效率等方面都存在着一定问题，仍有许多关键技术亟待解决。例如，在拓扑结构方面，现有P/S系统中间件通常设计为集中式或者非结构化P2P形式。集中式拓扑依赖单个服务器来中介发布者和订阅者（如Colorado大学的SIENA和IBM研究中心的Gryphon以及JEDI等），但其缺点是容易造成性能瓶颈，如果服务器失效，整个系统无法工作；非结构化P2P形式拓扑（如Cambridge大学提出的Hermes）常采用洪泛、Gossiping或者随机游走（random walk）算法来路由信息，但由于其无结构及节点的动态性，事件路由很难维护，系统可扩展性差。The existing publish/subscribe technology research is still in the development stage, there are certain problems in terms of reliability and information distribution efficiency, and there are still many key technologies to be solved urgently. For example, in terms of topology, the existing P/S system middleware is usually designed in a centralized or unstructured P2P form. The centralized topology relies on a single server to intermediary publishers and subscribers (such as SIENA of the University of Colorado and Gryphon of IBM Research Center and JEDI, etc.), but its disadvantage is that it is easy to cause performance bottlenecks. If the server fails, the entire system cannot work; non-structural P2P topology (such as Hermes proposed by Cambridge University) often uses flooding, Gossiping or random walk (random walk) algorithms to route information, but due to its structureless and node dynamics, event routing is difficult to maintain, and the system can Poor scalability.

另一方面，开放网络环境下信息资源的表现形式各异，普遍存在着信息结构异构（不同的用户用不同的结构来表示同一事件，例如有的事件为Map格式，有的事件为XML格式）和语义异构（不同用户使用不同的词汇(术语)表示同一事件，或者使用同一个词汇来表示不同的概念）的问题。而现有的P/S系统（如CORBA、Scribe、Bayeux、JEDI等系统）在表达能力方面尚存在很大不足，根据事件的结构信息对其进行描述，缺乏对事件本身语义的理解，事件与订阅之间的匹配算法属于精确匹配，匹配过程中容易受到同义词或近义词的干扰，可能返回大量偏离用户语义的错误结果，无法实现基于信息语义的智能匹配。On the other hand, in an open network environment, information resources have different forms of expression, and information structure heterogeneity generally exists (different users use different structures to represent the same event, for example, some events are in Map format, and some events are in XML format ) and semantic heterogeneity (different users use different vocabulary (terms) to represent the same event, or use the same vocabulary to represent different concepts). However, the existing P/S systems (such as CORBA, Scribe, Bayeux, JEDI, etc.) still have a lot of shortcomings in terms of expressiveness. They describe events according to their structural information and lack understanding of the semantics of events themselves. The matching algorithm between subscriptions belongs to exact matching, which is easily interfered by synonyms or similar words during the matching process, and may return a large number of erroneous results that deviate from user semantics, and intelligent matching based on information semantics cannot be realized.

为了增强系统的语义表达能力，实现基于信息语义的智能匹配，可以将信息网络中形式各异的数字资源抽象为高维属性空间中的点或者特征向量，通过高维数据点之间的距离或者特征向量之间的夹角来衡量数据对象之间的语义相似性。而高维属性空间的提出也随之引出了“维度灾难”问题，表现在高维属性空间中数据分布稀疏且趋近于高维空间表面分布，从而导致了语义相似搜索代价太大、搜索效率不高。高维数据降维技术能够将数据对象从不易管理的高维空间映射到低维空间，有效减小了搜索空间，提高了数据检索效率，是解决“维灾”问题的有效手段之一。例如：中国专利申请名称为“一种基于图像数据结构保护的嵌入式降维方法”记载的技术方案通过将原始多维数据集内各向量根据两两向量对之间的距离关系进行相似子集与非相似子集的划分，针对不同的子集做不同的嵌入操作达到距离转换的目的，再对新的距离矩阵做投影达到降维目的。但是现实信息网络中的数据对象成分复杂，种类繁多，其表现形式及语义的属性不断改变，很难把它们都统一抽象成固定维数、固定类型的向量；同时在高维属性空间中数字资源可以定义出很多种属性，但实际搜索过程中有很多属性与搜索无关（如医学中的概念不可能在计算机科学中出现），因此有必要将各种数据对象归整映射到固定结构的属性空间并适当减少与搜索无关的属性，从而减少语义相似搜索中的计算量，进一步提高搜索效率。In order to enhance the semantic expression ability of the system and realize intelligent matching based on information semantics, various digital resources in the information network can be abstracted into points or feature vectors in the high-dimensional attribute space, and the distance between high-dimensional data points or The angle between feature vectors is used to measure the semantic similarity between data objects. The proposal of high-dimensional attribute space also leads to the problem of "dimension disaster", which shows that the data distribution in high-dimensional attribute space is sparse and close to the surface distribution of high-dimensional space, which leads to the high cost of semantic similarity search and the low search efficiency. not tall. High-dimensional data dimensionality reduction technology can map data objects from difficult-to-manage high-dimensional spaces to low-dimensional spaces, effectively reducing the search space and improving data retrieval efficiency. It is one of the effective means to solve the "dimension disaster" problem. For example: the technical scheme described in the Chinese patent application titled "An Embedded Dimensionality Reduction Method Based on Image Data Structure Protection" performs similar subset and For the division of non-similar subsets, different embedding operations are performed for different subsets to achieve the purpose of distance conversion, and then the new distance matrix is projected to achieve the purpose of dimensionality reduction. However, the data objects in the real information network have complex components and various types, and their expressions and semantic attributes are constantly changing, so it is difficult to abstract them all into fixed-dimensional and fixed-type vectors; Many kinds of attributes can be defined, but many attributes in the actual search process have nothing to do with the search (for example, the concept in medicine cannot appear in computer science), so it is necessary to map various data objects to a fixed-structure attribute space And appropriately reduce the attributes irrelevant to the search, thereby reducing the amount of calculation in the semantic similarity search and further improving the search efficiency.

综上所述，现有发布订阅系统在动态性、容错性、自组织等方面存在一定的不足，同时系统在表达能力方面尚存在缺陷，缺乏对事件本身语义的理解，无法实现用户间基于语义的智能信息交互。To sum up, the existing publish-subscribe system has certain deficiencies in terms of dynamics, fault tolerance, self-organization, etc. At the same time, the system still has defects in expressive ability, lack of understanding of the semantics of the event itself, and cannot realize semantic-based intelligent information interaction.

发明内容Contents of the invention

本发明要解决的技术问题就在于：针对现有技术存在的技术问题，本发明提供一种原理简单、易实现和推广、提高系统的容错性、动态性及信息分发效率的基于P2P技术的语义智能信息发布订阅方法。The technical problem to be solved by the present invention is: aiming at the technical problems existing in the prior art, the present invention provides a P2P technology-based semantic Intelligent information publish and subscribe method.

为解决上述技术问题，本发明采用以下技术方案：In order to solve the problems of the technologies described above, the present invention adopts the following technical solutions:

一种基于P2P技术的语义智能信息发布订阅方法，其步骤为：A method for publishing and subscribing semantic intelligent information based on P2P technology, the steps of which are as follows:

（1）构建系统拓扑结构并设立超级节点：利用结构化P2P技术将P/S系统中多个事件代理的拓扑构建成Chord环结构，并在环上设定一个超级节点，用于提取信息网络中数据资源的属性并构造属性空间，信息网络中数据资源抽象为高维属性空间中的点或向量；(1) Construct the system topology and set up super nodes: use structured P2P technology to construct the topology of multiple event agents in the P/S system into a Chord ring structure, and set a super node on the ring to extract information from the network The attributes of the data resources in the information network and construct the attribute space, the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space;

（2）高维属性空间规格化处理：在超级节点上利用向量空间模型将网络中数据信息表示为高维属性空间中的高维点向量，将网络中所有数据信息映射到一个高维属性空间中，数学形式上表示为一个高阶矩阵；利用潜在语义索引去除与信息检索相关性很小的信息属性，用一个属性子空间近似代替原先的高维属性空间；(2) High-dimensional attribute space normalization processing: use the vector space model on the super node to represent the data information in the network as a high-dimensional point vector in the high-dimensional attribute space, and map all the data information in the network to a high-dimensional attribute space In , the mathematical form is expressed as a high-order matrix; the latent semantic index is used to remove the information attributes that have little relevance to information retrieval, and an attribute subspace is used to approximately replace the original high-dimensional attribute space;

（3）高维数据分区并建立维护全局索引树：超级节点SN将属性空间内的高维数据划分为不同的数据聚集区，并将每个数据聚集区分配到不同的节点，超级节点还维护一棵由所有节点信息构造成的索引树，称为全局索引树，用来为Chord环中节点分配事件信息并确定订阅请求需要访问的代理节点；(3) Partition high-dimensional data and establish and maintain a global index tree: the super node SN divides the high-dimensional data in the attribute space into different data aggregation areas, and assigns each data aggregation area to different nodes. The super node also maintains An index tree constructed from all node information, called a global index tree, is used to assign event information to nodes in the Chord ring and determine the proxy nodes that subscription requests need to visit;

（4）高维数据降维：节点将分配给自己的数据聚集区归整为高维属性空间内的多维超立方体，最后利用金字塔降维方法将高维数据对象映射到一维数据空间，并用B+树进行索引；(4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to the one-dimensional data space, and uses B+ tree for indexing;

（5）利用i-Chord方法管理数据对象：将高维数据集映射到一维数据空间后，利用Chord协议组织维护网络数据信息；设置一个保序函数将一维数据空间内数据映射到Chord资源标识符空间；Chord每个节点对应一个数据聚集区，存储管理相应数据聚集区内的高维数据，且每个节点维护一张路由表方便快速查询信息；(5) Use the i-Chord method to manage data objects: After mapping high-dimensional data sets to one-dimensional data space, use the Chord protocol to organize and maintain network data information; set an order-preserving function to map data in one-dimensional data space to Chord resources Identifier space; each node of Chord corresponds to a data aggregation area, stores and manages high-dimensional data in the corresponding data aggregation area, and each node maintains a routing table to facilitate quick information query;

（6）实现基于语义的智能信息订阅与发布：系统中每个节点维护一张订阅表，记录与该节点语义相关的订阅信息；当订阅者向系统发送订阅请求时，首先通过搜索全局索引树确定与该订阅请求语义相关的代理节点，将请求发送给代理节点，在代理节点的订阅表中增加一条订阅记录，注册该订阅请求和该节点的关联关系，然后代理节点根据订阅请求及订阅条件确定高维属性空间中的精确搜索范围，在精确搜索范围内精确查找与订阅请求语义相同的事件信息并返回给订阅者；当发布者向系统发送事件信息时，首先通过搜索全局索引树确定与该事件语义相关的宿主节点，将事件信息发送到宿主节点，并查阅宿主节点的订阅表，如果事件信息与订阅表中某条订阅信息语义匹配成功，则将事件主动推送给用户。(6) Realize semantic-based intelligent information subscription and publishing: each node in the system maintains a subscription table to record subscription information related to the node semantics; when a subscriber sends a subscription request to the system, it first searches the global index tree Determine the proxy node that is semantically related to the subscription request, send the request to the proxy node, add a subscription record to the subscription table of the proxy node, register the association between the subscription request and the node, and then the proxy node Determine the precise search range in the high-dimensional attribute space, and accurately find the event information with the same semantics as the subscription request within the precise search range and return it to the subscriber; when the publisher sends event information to the system, it first determines the event information by searching the global index tree The host node related to the semantics of the event sends the event information to the host node and consults the subscription table of the host node. If the event information successfully matches the semantics of a subscription information in the subscription table, the event is actively pushed to the user.

作为本发明的进一步改进：所述步骤（1）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (1) are:

（1.1）P/S系统中可能存在多个事件代理，将多个事件代理按照Chord环结构组织成事件代理网络，事件代理对应该网络中的各个节点，每个事件代理按照一定规则存储信息网络中的数据资源，并保存有部分其他节点的信息；(1.1) There may be multiple event agents in the P/S system. Multiple event agents are organized into an event agent network according to the Chord ring structure. The event agents correspond to each node in the network, and each event agent stores information in the network according to certain rules. The data resources in , and save the information of some other nodes;

（1.2）在Chord环内选择一个能力最强的节点作为超级节点SN，超级节点SN定期检查环内其他节点的能力，从中选出候选超级节点，候选超级节点备份SN上重要的信息；(1.2) Select a node with the strongest capability in the Chord ring as the super node SN, the super node SN regularly checks the capabilities of other nodes in the ring, selects candidate super nodes, and the candidate super nodes back up important information on the SN;

（1.3）超级节点SN负责提取不同形式数据资源的属性，构造多维属性空间，信息网络中数据资源抽象为高维属性空间中的点或向量。(1.3) The super node SN is responsible for extracting the attributes of different forms of data resources and constructing a multi-dimensional attribute space. The data resources in the information network are abstracted into points or vectors in the high-dimensional attribute space.

作为本发明的进一步改进：所述步骤（2）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (2) are:

（2.1）依据向量空间模型的思想将网络中高维数据信息描述为一个属性向量，然后系统将网络中所有高维信息组织为一个矩阵，如信息网络中通过t个属性描述的d个高维数据信息描述为一个t*d的矩阵A，矩阵的每一列代表一个高维信息数据，矩阵元素a_ij为数据对象j中属性i的属性值，代表了对象j中属性i的重要性，如果高维信息j不存在属性i，则a_ij为0；(2.1) According to the idea of the vector space model, the high-dimensional data information in the network is described as an attribute vector, and then the system organizes all the high-dimensional information in the network into a matrix, such as d high-dimensional data described by t attributes in the information network The information is described as a t*d matrix A, each column of the matrix represents a high-dimensional information data, and the matrix element a _ij is the attribute value of the attribute i in the data object j, representing the importance of the attribute i in the object j, if high Dimension information j does not have attribute i, then a _ij is 0;

（2.2）注意到矩阵中大部分的元素为0，说明信息检索中一个高维信息的大部分的属性为无用信息，用一个低秩矩阵近似代替初始矩阵；假设网络中高维信息集合表示为矩阵A，A的秩为r，利用矩阵的奇异值分解将A分解为三个矩阵的乘积：(2.2) Note that most of the elements in the matrix are 0, indicating that most of the attributes of a high-dimensional information in information retrieval are useless information, and a low-rank matrix is used to approximate the original matrix; assume that the high-dimensional information set in the network is expressed as a matrix A, the rank of A is r, using the singular value decomposition of the matrix to decompose A into the product of three matrices:

A=UΣV^T A=UΣV ^T

其中U=(u_l，u₂···，u_r)是一个t*r矩阵，Σ=diag(σ₁，...，σ_r)是一个r*r对角矩阵，V=(v_l，...，v_r)是一个d*r矩阵，σ_i是A的奇异值，σ₁≥σ₂≥...≥σ_r;where U=(u _l , u ₂ , u _r ) is a t*r matrix, Σ=diag(σ ₁ ,...,σ _r ) is a r*r diagonal matrix, V=(v _l ,...,v _r ) is a d*r matrix, σ _i is the singular value of A, σ ₁ ≥σ ₂ ≥...≥σ _r ;

(2.3)仅保留1个最大的矩阵奇异值，省略掉其他的奇异值，将秩为r的矩阵A近似化简为秩为1的矩阵A₁:(2.3) Only keep one of the largest matrix singular values, omit other singular values, and approximate the matrix A with rank r into matrix A 1 with rank ₁ :

A₁=U₁Σ₁V₁ ^T A ₁ = U ₁ Σ ₁ V ₁ ^T

其中U₁=(u_l，u₂,...，U₁),Σ₁=diag(σ₁，...，σ₁)，V₁=(V_l，...，V₁)，V₁Σ₁的行是高维信息的语义向量。where U ₁ =(u _l , u ₂ ,...,U ₁ ), Σ ₁ =diag(σ ₁ ,...,σ ₁ ), V ₁ =(V _l ,...,V ₁ ), The rows of V ₁ Σ ₁ are semantic vectors of high-dimensional information.

作为本发明的进一步改进：所述步骤（3）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (3) are:

（3.1）信息网络中数据资源抽象为高维属性空间中的点或向量，根据属性空间内距离相近的数据资源具有相似语义的原理，超级节点SN对多维属性空间中分布的高维数据进行聚类分区，将高维数据分成多个互不相交的数据聚集区，并将每个数据聚集区分配到不同的节点；(3.1) The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space. According to the principle that data resources with similar distances in the attribute space have similar semantics, the super node SN aggregates the high-dimensional data distributed in the multi-dimensional attribute space. Class partitioning, which divides high-dimensional data into multiple disjoint data aggregation areas, and assigns each data aggregation area to different nodes;

（3.2）超级节点SN除了自身的路由表外，还维护一棵由环上所有节点的资源标识符范围信息构造而成的索引树，称为全局索引GI；(3.2) In addition to its own routing table, the super node SN also maintains an index tree constructed from the resource identifier range information of all nodes on the ring, called the global index GI;

（3.2.1）快速为节点分配事件信息；环内每一个节点负责一个数据聚集区，该聚集区内所有的数据信息全部分配给这个节点，当一个事件信息请求加入时，通过查询GI确定该事件信息属于哪个数据聚集区，即确定了事件信息的宿主节点，然后以宿主节点的标识符为搜索键，利用Chord路由协议，将该事件信息分配给宿主节点；(3.2.1) Quickly assign event information to nodes; each node in the ring is responsible for a data aggregation area, and all data information in the aggregation area is allocated to this node. When an event information request is added, it is determined by querying GI Which data aggregation area the event information belongs to, that is, the host node of the event information is determined, and then the identifier of the host node is used as the search key, and the event information is distributed to the host node by using the Chord routing protocol;

（3.2.2）确定订阅请求需要访问的节点；当某个节点输入订阅请求时，首先把请求发送给SN搜索GI，确定哪些节点负责的数据聚集区与该订阅请求的语义空间相交并返回这些节点的标识符，返回的节点称为代理节点，然后以代理节点的标识符为搜索键，将订阅请求路由到这些节点，进一步实现语义匹配。(3.2.2) Determine the node that the subscription request needs to visit; when a node inputs a subscription request, it first sends the request to SN to search for GI, determines which nodes are responsible for the data aggregation area that intersects with the semantic space of the subscription request, and returns these The identifier of the node, the returned node is called a proxy node, and then the identifier of the proxy node is used as the search key to route the subscription request to these nodes to further realize semantic matching.

作为本发明的进一步改进：所述步骤（4）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (4) are:

（4.1）将每个数据聚集区归整为高维数据空间内维度为d的多维超立方体，并对多维超立方体进行归一化处理，即每一维的边长为1，超立方体的中心点则表示为（0.5,0.5，...,0.5），然后以超立方体中心为顶点，数据聚集区的（d-1）维超平面作为基，将每个d维数据聚集区划分为2d个金字塔；(4.1) Organize each data aggregation area into a multi-dimensional hypercube with dimension d in the high-dimensional data space, and normalize the multi-dimensional hypercube, that is, the side length of each dimension is 1, and the center of the hypercube Points are expressed as (0.5, 0.5, ..., 0.5), and then the center of the hypercube is used as the vertex, and the (d-1) dimensional hyperplane of the data aggregation area is used as the base, and each d-dimensional data aggregation area is divided into 2d a pyramid;

（4.2）将每个金字塔分成与基平行的几个子划分，每一个子划分与一个B+树数据页相对应，然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值，并用B+树组织管理降维后数据；降维公式为：(4.2) Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area according to the distance from the high-dimensional data point to the pyramid base It is a one-dimensional value, and the data after dimension reduction is organized and managed by B+ tree; the dimension reduction formula is:

y_v=i*C+(j+h_v)=i*c+(0+|0.5-h_v|)y _v =i*C+(j+h _v )=i*c+(0+|0.5-h _v |)

其中某个高维数据对象v所在的立方体数为i，金字塔数为j，v到所在金字塔基平面距离为h_v’，则到金字塔顶点所在平面的距离为h_v=∣0.5-h_v’∣；降维后，每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内，c是一个足够大的常数，确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Among them, the number of cubes where a high-dimensional data object v is located is i, the number of pyramids is j, and the distance between v and the base plane of the pyramid is h _v ', then the distance to the plane where the apex of the pyramid is located is h _v =∣0.5-h _v '∣; After dimensionality reduction, the one-dimensional value of each high-dimensional data object in a high-dimensional cube is limited to the interval [i*c, (i+1)*c], c is a constant large enough to ensure that each Data objects in a data partition have different index keys than other partitions.

作为本发明的进一步改进：所述步骤（5）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (5) are:

（5.1）假设Chord系统资源标识符区间大小为2^m，利用一个保序函数h将降维后一维数值按照顺序映射到区间[0,2^m)中；即对于属性空间中的某个高维数据点v，点v的i-Chord资源标识符为：(5.1) Assuming that the Chord system resource identifier interval size is 2 ^m , use an order-preserving function h to map the one-dimensional values after dimensionality reduction to the interval [0,2 ^m ) in sequence; that is, for a certain height in the attribute space Dimension data point v, the i-Chord resource identifier of point v is:

key_v=ichord(v)=h(y_v)=h(i*c+(0+h_v))∈[0，2^m)key _v =ichord(v)=h(y _v )=h(i*c+(0+h _v ))∈[0, 2 ^m )

（5.2）Chord环上每个节点N_i负责存储管理一个数据聚集区的高维数据信息，设N_i节点标识符Nkey_i，则该聚集区内高维数据信息的资源标识符分布在区间(Nkey_i-1,Nkey_i]内；(5.2) Each node N _i on the Chord ring is responsible for storing and managing the high-dimensional data information of a data aggregation area. Let the N _i node identifier Nkey _i , then the resource identifiers of the high-dimensional data information in the aggregation area are distributed in the interval ( Nkey _i-1 , Nkey _i ];

（5.3）每个节点N_i维护一张路由表，即指针表，指向环上的其他节点；在路由表中有m个表项，其中第k(1≤k≤m)行表项为Chord环上标识符等于或大于(Nkey_i+2^k-1)mod2^m的第一个节点，即successor((Nkey_i+2^k-1)mod2^m)；任何一个节点收到关键字为key的请求时，首先检查自身节点是否等于key，如果是则直接返回；否则，节点查找其路由表，找到表中最大但不超过key的第一个节点，并将查询请求转发给该节点，重复此过程，直到请求到达一个节点N_k，满足key位于N_k和N_k的后续节点N_k+1之间时，节点N_k汇报其后继节点N_k+1作为请求的应答；(5.3) Each node N _i maintains a routing table, that is, a pointer table, pointing to other nodes on the ring; there are m entries in the routing table, and the kth (1≤k≤m) row entry is Chord The first node whose identifier on the ring is equal to or greater than (Nkey _i +2 ^k-1 )mod2 ^m is the successor((Nkey _i +2 ^k-1 )mod2 ^m ); When requesting, first check whether its own node is equal to the key, and if so, return it directly; otherwise, the node searches its routing table, finds the first node in the table that is the largest but does not exceed the key, and forwards the query request to this node, and repeats this process until the request reaches a node N _k , and when the key is located between N _k and N _k ’s subsequent node N _k+1 , the node N _k reports its subsequent node N _k+1 as the response to the request;

（5.4）当某个数据聚集区内高维数据过于密集导致节点不能有效对其进行管理时，将该聚集区分裂为两个或多个新的聚集区，相应地在Chord环上选择空闲节点以分担原节点的负载；当临近的数据聚集区内数据稀疏导致节点资源浪费时，合并聚集区，选择两个执行节点中的一个作为新的数据聚集区的执行节点，并将另一置为空闲节点。(5.4) When the high-dimensional data in a data aggregation area is too dense and the nodes cannot effectively manage it, split the aggregation area into two or more new aggregation areas, and select idle nodes on the Chord ring accordingly In order to share the load of the original node; when the data in the adjacent data aggregation area is sparse and the node resources are wasted, the aggregation area is merged, and one of the two execution nodes is selected as the execution node of the new data aggregation area, and the other is set as free node.

作为本发明的进一步改进：所述步骤（6）的具体步骤为：As a further improvement of the present invention: the specific steps of the step (6) are:

（6.1）每个节点维护一张订阅表，记录与该节点相关的所有订阅信息，当节点有事件信息到达时，查询该节点的订阅表，如果事件信息与订阅表中某条订阅信息语义匹配成功，则将事件主动推送给用户；(6.1) Each node maintains a subscription table, which records all subscription information related to the node. When the node has event information, query the subscription table of the node. If the event information matches the semantics of a certain subscription information in the subscription table If successful, the event is actively pushed to the user;

（6.2）当订阅者向环上某节点N_i提交订阅请求信息时，该节点搜索全局搜索树确定与订阅请求语义相关的代理节点，将该订阅请求路由到代理节点，并在带代理节点上执行基于订阅语义的相似搜索算法，若搜索到与订阅请求语义相同的事件信息，则将事件信息反向传递给N_i；同时，代理节点将订阅请求、订阅条件、路由路径及搜索范围信息存储在其订阅表中；(6.2) When a subscriber submits subscription request information to a node N _i on the ring, the node searches the global search tree to determine the proxy node related to the subscription request semantics, routes the subscription request to the proxy node, and Execute a similar search algorithm based on subscription semantics. If the event information with the same semantics as the subscription request is searched, the event information will be passed back to N _i ; at the same time, the proxy node will store the subscription request, subscription conditions, routing path and search range information in its subscription form;

（6.3）当发布者向环上某个节点N_j发布事件信息时，该节点搜索全局搜索树确定与事件信息语义相关的宿主节点，依据Chord路由协议将该事件信息路由到并存储在宿主节点上；同时，宿主节点将事件属性信息与其订阅表中的订阅请求信息进行语义匹配，如果匹配成功则直接将事件信息推送给订阅者；(6.3) When the publisher publishes event information to a node N _j on the ring, the node searches the global search tree to determine the host node related to the semantics of the event information, and routes the event information to and stores it in the host node according to the Chord routing protocol At the same time, the host node semantically matches the event attribute information with the subscription request information in the subscription table, and if the match is successful, it will directly push the event information to the subscriber;

（6.4）当订阅者向环上某个节点发布取消订阅信息时，该节点搜索全局搜索树确定与订阅信息语义相关的代理节点，并将请求信息路由到相应的代理节点，在代理节点订阅表中查找到并删除相应的订阅信息。(6.4) When a subscriber publishes unsubscribe information to a node on the ring, the node searches the global search tree to determine the proxy node related to the semantics of the subscription information, and routes the request information to the corresponding proxy node. In the proxy node subscription table Find and delete the corresponding subscription information.

与现有技术相比，本发明的优点在于：Compared with the prior art, the present invention has the advantages of:

本发明原理简单、易实现和推广，在结构化P2P技术的基础上可以构造基于语义的发布/订阅系统，提高了系统的容错性、动态性及信息分发效率；本发明通过将事件信息及订阅信息映射到多维属性空间，系统支持对信息本身语义的理解，增强了系统的表达能力，支持基于信息语义的智能匹配；同时通过归整化简多维属性空间及高维数据降维技术，消除了多维属性空间带来的维灾问题，能够高效实现基于信息语义的数据组织和智能匹配。The principle of the present invention is simple, easy to implement and popularize. On the basis of structured P2P technology, a semantic-based publish/subscribe system can be constructed, which improves the fault tolerance, dynamics and information distribution efficiency of the system; the present invention combines event information and subscription The information is mapped to the multi-dimensional attribute space. The system supports the understanding of the semantics of the information itself, which enhances the expressive ability of the system and supports intelligent matching based on the information semantics. At the same time, it eliminates the The disaster of dimensionality brought by multi-dimensional attribute space can efficiently realize data organization and intelligent matching based on information semantics.

附图说明Description of drawings

图1是本发明方法的流程示意图。Fig. 1 is a schematic flow chart of the method of the present invention.

图2是本发明中高维数据从高维空间映射到一维空间示意图。Fig. 2 is a schematic diagram of mapping high-dimensional data from high-dimensional space to one-dimensional space in the present invention.

图3是本发明中根据查找数据过程及订阅表的示意图。Fig. 3 is a schematic diagram according to the search data process and the subscription table in the present invention.

图4是本发明中订阅请求处理过程的示意图。Fig. 4 is a schematic diagram of a subscription request processing process in the present invention.

图5是本发明基于信息语义处理范围查询的示意图。FIG. 5 is a schematic diagram of processing range queries based on information semantics in the present invention.

具体实施方式Detailed ways

以下将结合说明书附图和具体实施例对本发明做进一步详细说明。The present invention will be further described in detail below in conjunction with the accompanying drawings and specific embodiments.

如图1所示，本发明的基于P2P技术的语义智能信息发布订阅方法，具体步骤为：As shown in Figure 1, the method for publishing and subscribing semantic intelligent information based on P2P technology of the present invention, the specific steps are:

（1）构建系统拓扑结构并设立超级节点：利用结构化P2P技术将P/S系统中多个事件代理的拓扑构建成Chord环结构，并在环上设定一个超级节点，用于提取信息网络中数据资源的属性并构造属性空间，信息网络中数据资源抽象为高维属性空间中的点或向量。(1) Construct the system topology and set up super nodes: use structured P2P technology to construct the topology of multiple event agents in the P/S system into a Chord ring structure, and set a super node on the ring to extract information from the network The attributes of the data resources in the information network and construct the attribute space, and the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.

（2）高维属性空间规格化处理：针对现在信息网络中信息种类繁多，不同信息属性各不一致的问题，在超级节点上利用向量空间模型（VSM）将网络中数据信息表示为高维属性空间中的高维点向量，将网络中所有数据信息映射到一个高维属性空间中，数学形式上表示为一个高阶矩阵；为了减少信息索引过程中的噪声及同义词干扰，利用潜在语义索引（LSI）去除与信息检索相关性很小的信息属性，用一个属性子空间近似代替原先的高维属性空间。(2) High-dimensional attribute space normalization processing: In view of the fact that there are many types of information in the current information network and different information attributes are inconsistent, the vector space model (VSM) is used on the super node to represent the data information in the network as a high-dimensional attribute space The high-dimensional point vector in the network maps all data information in the network to a high-dimensional attribute space, which is expressed as a high-order matrix in mathematical form; in order to reduce noise and synonym interference in the process of information indexing, latent semantic indexing (LSI ) to remove information attributes that have little correlation with information retrieval, and use an attribute subspace to approximately replace the original high-dimensional attribute space.

（3）高维数据分区并建立维护全局索引树：超级节点SN将属性空间内的高维数据划分为不同的数据聚集区，并将每个数据聚集区分配到不同的节点，超级节点还维护一棵由所有节点信息构造成的索引树，称为全局索引树，用来为Chord环中节点分配事件信息并确定订阅请求需要访问的代理节点。(3) Partition high-dimensional data and establish and maintain a global index tree: the super node SN divides the high-dimensional data in the attribute space into different data aggregation areas, and assigns each data aggregation area to different nodes. The super node also maintains An index tree constructed from all node information, called a global index tree, is used to assign event information to nodes in the Chord ring and determine the proxy nodes that subscription requests need to access.

（4）高维数据降维：节点将分配给自己的数据聚集区归整为高维属性空间内的多维超立方体，最后利用金字塔降维方法将高维数据对象映射到一维数据空间，并用B+树进行索引。(4) Dimensionality reduction of high-dimensional data: the node organizes the data aggregation area assigned to itself into a multi-dimensional hypercube in the high-dimensional attribute space, and finally uses the pyramid dimensionality reduction method to map the high-dimensional data objects to the one-dimensional data space, and uses B+ tree for indexing.

（5）利用i-Chord方法管理数据对象：将高维数据集映射到一维数据空间后，利用Chord协议组织维护网络数据信息。针对一维数据空间和Chord资源标识符空间不匹配问题，设置一个保序函数将一维数据空间内数据映射到Chord资源标识符空间；Chord每个节点对应一个数据聚集区，存储管理相应数据聚集区内的高维数据，且每个节点维护一张路由表方便快速查询信息；当一个节点管理的数据量过大时，可以通过节点插入离开等操作维护系统的负载均衡。(5) Use the i-Chord method to manage data objects: After mapping the high-dimensional data set to a one-dimensional data space, use the Chord protocol to organize and maintain network data information. Aiming at the mismatch between the one-dimensional data space and the Chord resource identifier space, set an order-preserving function to map the data in the one-dimensional data space to the Chord resource identifier space; each node of Chord corresponds to a data aggregation area, and stores and manages the corresponding data aggregation High-dimensional data in the area, and each node maintains a routing table to facilitate quick information query; when the amount of data managed by a node is too large, the load balance of the system can be maintained through operations such as node insertion and departure.

（6）实现基于语义的智能信息订阅与发布：系统中每个节点维护一张订阅表，记录与该节点语义相关的订阅信息。当订阅者向系统发送订阅请求时，首先通过搜索全局索引树确定与该订阅请求语义相关的代理节点，将请求发送给代理节点，在代理节点的订阅表中增加一条订阅记录，注册该订阅请求和该节点的关联关系，然后代理节点根据订阅请求及订阅条件确定高维属性空间中的精确搜索范围，在精确搜索范围内精确查找与订阅请求语义相同的事件信息并返回给订阅者。当发布者向系统发送事件信息时，首先通过搜索全局索引树确定与该事件语义相关的宿主节点，将事件信息发送到宿主节点，并查阅宿主节点的订阅表，如果事件信息与订阅表中某条订阅信息语义匹配成功，则将事件主动推送给用户。(6) Realize semantic-based intelligent information subscription and publishing: each node in the system maintains a subscription table to record subscription information related to the semantics of the node. When a subscriber sends a subscription request to the system, it first determines the proxy node related to the subscription request semantics by searching the global index tree, sends the request to the proxy node, adds a subscription record to the subscription table of the proxy node, and registers the subscription request Then the proxy node determines the precise search range in the high-dimensional attribute space according to the subscription request and subscription conditions, and accurately finds the event information with the same semantics as the subscription request within the precise search range and returns it to the subscriber. When the publisher sends event information to the system, it first determines the host node related to the event semantics by searching the global index tree, sends the event information to the host node, and consults the subscription table of the host node. If the semantic matching of the subscription information is successful, the event will be actively pushed to the user.

本实施例中，上述步骤（1）的具体步骤为：In this embodiment, the specific steps of the above step (1) are:

（1.1）P/S系统中可能存在多个事件代理，将多个事件代理按照Chord环结构组织成事件代理网络，事件代理对应该网络中的各个节点，每个事件代理按照一定规则（事件的语义信息）存储信息网络中的数据资源，为一定数量的客户端服务，并保存有部分其他节点的信息，使订阅和事件信息低成本、高效、可靠地到达语义相关的节点。(1.1) There may be multiple event agents in the P/S system. Multiple event agents are organized into an event agent network according to the Chord ring structure. The event agents correspond to each node in the network. Each event agent follows certain rules (event Semantic information) stores data resources in the information network, serves a certain number of clients, and saves some information of other nodes, so that subscription and event information can reach semantically related nodes at low cost, efficiently and reliably.

（1.2）在Chord环内选择一个能力（节点的计算能力、存储空间、在线时间、带宽等多方面考虑）最强的节点作为超级节点SN，超级节点SN定期检查环内其他节点的能力，从中选出候选超级节点，候选超级节点备份SN上重要的信息，以便SN发生故障时代替SN成为新的超级节点。(1.2) In the Chord ring, select a node with the strongest capability (computing power, storage space, online time, bandwidth, etc.) as the super node SN, and the super node SN regularly checks the capabilities of other nodes in the ring, from which Select a candidate super node, and the candidate super node backs up important information on the SN, so that when the SN fails, it will replace the SN and become a new super node.

（1.3）为了获取网络中数据资源的语义信息，超级节点SN负责提取不同形式数据资源的属性，构造多维属性空间，信息网络中数据资源抽象为高维属性空间中的点或向量。(1.3) In order to obtain the semantic information of data resources in the network, the super node SN is responsible for extracting the attributes of different forms of data resources and constructing a multi-dimensional attribute space. The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.

本实施例中，上述步骤（2）的具体步骤为：In this embodiment, the specific steps of the above step (2) are:

（2.1）依据向量空间模型的思想将网络中高维数据信息描述为一个属性向量，属性可以是和给定高维信息相关的概念、关键词、术语等元素，然后系统将网络中所有高维信息组织为一个矩阵，如信息网络中可以通过t个属性描述的d个高维数据信息可以描述为一个t*d的矩阵A，矩阵的每一列代表一个高维信息数据，矩阵元素a_ij为数据对象j中属性i的属性值，代表了对象j中属性i的重要性（如可通过一个术语在一个文档中出现的频率来计算），如果高维信息j不存在属性i，则a_ij为0。(2.1) According to the idea of the vector space model, the high-dimensional data information in the network is described as an attribute vector. The attributes can be elements such as concepts, keywords, and terms related to the given high-dimensional information, and then the system combines all the high-dimensional information in the network Organized as a matrix, such as d high-dimensional data information that can be described by t attributes in an information network can be described as a t*d matrix A, each column of the matrix represents a high-dimensional information data, and matrix elements a _ij are data The attribute value of attribute i in object j represents the importance of attribute i in object j (for example, it can be calculated by the frequency of a term appearing in a document). If attribute i does not exist in high-dimensional information j, then a _ij is 0.

（2.2）注意到矩阵中大部分的元素为0，说明信息检索中一个高维信息的大部分的属性为无用信息，为了减少信息语义检索过程中不必要的计算量，用一个低秩矩阵近似代替初始矩阵。假设网络中高维信息集合可表示为矩阵A，A的秩为r，利用矩阵的奇异值分解将A分解为三个矩阵的乘积：(2.2) Note that most of the elements in the matrix are 0, indicating that most of the attributes of a high-dimensional information in information retrieval are useless information. In order to reduce unnecessary calculations in the process of information semantic retrieval, a low-rank matrix is used to approximate instead of the initial matrix. Assuming that the high-dimensional information set in the network can be expressed as a matrix A, the rank of A is r, and the singular value decomposition of the matrix is used to decompose A into the product of three matrices:

A=U∑V^T A=U∑V ^T

其中U=（u₁,u₂,...,u_r）是一个t*r矩阵，∑=diag(σ₁,...,σ_r)是一个r*r对角矩阵，V=（v₁,...,v_r）是一个d*r矩阵，σ_i是A的奇异值，σ₁≥σ₂≥...≥σ_r。Where U=(u ₁ ,u ₂ ,...,u _r ) is a t*r matrix, ∑=diag(σ ₁ ,...,σ _r ) is a r*r diagonal matrix, V=( v ₁ ,...,v _r ) is a d*r matrix, σ _i is the singular value of A, σ ₁ ≥σ ₂ ≥...≥σ _r .

（2.3）为了减少计算量，加快信息检索速度，同时避免语义匹配过程中的同义词干扰，仅仅保留l个最大的矩阵奇异值，省略掉其他的奇异值，将秩为r的矩阵A近似化简为秩为l的矩阵A_l：(2.3) In order to reduce the amount of calculation, speed up information retrieval, and avoid the interference of synonyms in the process of semantic matching, only the l largest singular values of the matrix are reserved, and other singular values are omitted, and the matrix A with rank r is approximately simplified is a matrix A _l of rank l:

A₁=U₁Σ₁V₁ ^T A ₁ = U ₁ Σ ₁ V ₁ ^T

其中U₁=(u_l，u₂，...，U₁),Σ₁=diag(σ₁，...，σ₁)，V₁=(V_l，...，V₁)，V₁Σ₁的行是高维信息的语义向量。这样，关于高维信息检索的处理可以在这个低秩语义矩阵内进行。where U ₁ =(u _l , u ₂ ,...,U ₁ ), Σ ₁ =diag(σ ₁ ,...,σ ₁ ), V ₁ =(V _l ,...,V ₁ ), The rows of V ₁ Σ ₁ are semantic vectors of high-dimensional information. In this way, processing on high-dimensional information retrieval can be performed within this low-rank semantic matrix.

本实施例中，上述步骤（3）的具体步骤为：In this embodiment, the specific steps of the above step (3) are:

（3.1）信息网络中数据资源抽象为高维属性空间中的点或向量，根据属性空间内距离相近的数据资源具有相似语义的原理，超级节点SN对多维属性空间中分布的高维数据进行聚类分区，将高维数据分成多个互不相交的数据聚集区，并将每个数据聚集区分配到不同的节点。(3.1) The data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space. According to the principle that data resources with similar distances in the attribute space have similar semantics, the super node SN aggregates the high-dimensional data distributed in the multi-dimensional attribute space. Class partitioning divides high-dimensional data into multiple disjoint data aggregation areas, and assigns each data aggregation area to different nodes.

（3.2）超级节点SN除了自身的路由表外，还维护一棵由环上所有节点的资源标识符范围信息构造而成的索引树，称为全局索引GI。GI具有以下两个用途：(3.2) In addition to its own routing table, the super node SN also maintains an index tree constructed from the resource identifier range information of all nodes on the ring, called the global index GI. GI serves two purposes:

（3.2.1）快速为节点分配事件信息。环内每一个节点负责一个数据聚集区，该聚集区内所有的数据信息全部分配给这个节点，当一个事件信息请求加入时，通过查询GI确定该事件信息属于哪个数据聚集区，即确定了事件信息的宿主节点，然后以宿主节点的标识符为搜索键，利用Chord路由协议，将该事件信息分配给宿主节点。(3.2.1) Quickly assign event information to nodes. Each node in the ring is responsible for a data aggregation area, and all data information in the aggregation area is allocated to this node. When an event information request is added, it is determined which data aggregation area the event information belongs to by querying GI, that is, the event is determined. The host node of the information, then uses the identifier of the host node as a search key, and uses the Chord routing protocol to distribute the event information to the host node.

（3.2.2）确定订阅请求需要访问的节点。当某个节点输入订阅请求时，首先把请求发送给SN搜索GI，确定哪些节点负责的数据聚集区与该订阅请求的语义空间相交并返回这些节点的标识符，返回的节点称为代理节点，可能包含订阅结果，然后以代理节点的标识符为搜索键，将订阅请求路由到这些节点，进一步实现语义匹配。(3.2.2) Determine the nodes that the subscription request needs to visit. When a node inputs a subscription request, it first sends the request to SN to search GI to determine which nodes are responsible for the data aggregation area intersecting with the semantic space of the subscription request and return the identifiers of these nodes. The returned nodes are called proxy nodes. It may contain subscription results, and then use the identifier of the proxy node as the search key to route the subscription request to these nodes to further achieve semantic matching.

本实施例中，上述步骤（4）的具体步骤为：In this embodiment, the specific steps of the above step (4) are:

（4.1）将每个数据聚集区归整为高维数据空间内维度为d的多维超立方体，并对多维超立方体进行归一化处理，即每一维的边长为1，超立方体的中心点则表示为（0.5,0.5，...,0.5），然后以超立方体中心为顶点，数据聚集区的（d-1）维超平面作为基，将每个d维数据聚集区（超立方体）划分为2d个金字塔。(4.1) Organize each data aggregation area into a multi-dimensional hypercube with dimension d in the high-dimensional data space, and normalize the multi-dimensional hypercube, that is, the side length of each dimension is 1, and the center of the hypercube Points are expressed as (0.5, 0.5, ..., 0.5), and then take the center of the hypercube as the vertex, and the (d-1) dimensional hyperplane of the data aggregation area as the base, and each d-dimensional data aggregation area (hypercube ) is divided into 2d pyramids.

（4.2）将每个金字塔分成与基平行的几个子划分，每一个子划分与一个B+树数据页相对应，然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值，并用B+树组织管理降维后数据。降维公式为：(4.2) Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area according to the distance from the high-dimensional data point to the pyramid base It is a one-dimensional value, and the data after dimension reduction is organized and managed by B+ tree. The dimensionality reduction formula is:

y_v=i*C+(j+h_v)=i*c+(j+|0.5-h_v|)y _v =i*C+(j+h _v )=i*c+(j+|0.5-h _v |)

其中某个高维数据对象v所在的立方体数为i（区分不同的数据聚集区），金字塔数为j，v到所在金字塔基平面距离为h_v’，则到金字塔顶点所在平面的距离为h_v=∣0.5-h_v’∣。降维后，每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内，c是一个足够大的常数，从而确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Among them, the number of cubes where a certain high-dimensional data object v is located is i (to distinguish different data aggregation areas), the number of pyramids is j, and the distance between v and the base plane of the pyramid where v is located is h _v ', then the distance to the plane where the apex of the pyramid is located is h _v =∣0.5-h _v'∣ . After dimensionality reduction, the one-dimensional value of the high-dimensional data object in each high-dimensional cube is limited to the interval [i*c, (i+1)*c], c is a constant large enough to ensure that each data Data objects in a partition have different index keys than other partitions.

本实施例中，上述步骤（5）的具体步骤为：In this embodiment, the specific steps of the above step (5) are:

（5.1）假设Chord系统资源标识符区间大小为2^m，为了利用Chord的资源标识符管理降维后一维数据，利用一个保序函数h将降维后一维数值按照顺序映射到区间[0,2^m)中。即对于属性空间中的某个高维数据点v（v属于立方体数位i的超立方体），点v的i-Chord资源标识符为：(5.1) Assuming that the resource identifier interval size of the Chord system is 2 ^m , in order to use the resource identifier of Chord to manage the one-dimensional data after dimension reduction, an order-preserving function h is used to map the one-dimensional values after dimension reduction to the interval [0 ,2 ^m ). That is, for a certain high-dimensional data point v in the attribute space (v belongs to the hypercube of cube digit i), the i-Chord resource identifier of point v is:

key_v=ichord(v)=h(y_v)=h(i*c+(j+h_v))∈[0，2^m)key _v =ichord(v)=h(y _v )=h(i*c+(j+h _v ))∈[0, 2 ^m )

（5.2）Chord环上每个节点N_i负责存储管理一个数据聚集区的高维数据信息，设N_i节点标识符Nkey_i，则该聚集区内高维数据信息的资源标识符分布在区间(Nkey_i-1,Nkey_i]内。(5.2) Each node N _i on the Chord ring is responsible for storing and managing the high-dimensional data information of a data aggregation area. Let the N _i node identifier Nkey _i , then the resource identifiers of the high-dimensional data information in the aggregation area are distributed in the interval ( Nkey _i-1 , Nkey _i ].

（5.3）每个节点N_i维护一张路由表，即指针表（finger table），指向环上的其他节点。在路由表中有m（标识符的位数）个表项，其中第k(1≤k≤m)行表项为Chord环上标识符等于或大于(Nkey_i+2^k-1)mod2^m的第一个节点，即successor((Nkey_i+2^k-1)mod2^m)。任何一个节点收到关键字为key的请求时，首先检查自身节点是否等于key，如果是则直接返回；否则，节点查找其路由表，找到表中最大但不超过key的第一个节点，并将查询请求转发给该节点，重复此过程，直到请求到达一个节点N_k，满足key位于N_k和N_k的后续节点N_k+1之间时，节点N_k汇报其后继节点N_k+1作为请求的应答。(5.3) Each node N _i maintains a routing table, that is, a finger table, pointing to other nodes on the ring. There are m (number of identifiers) entries in the routing table, among which the entry k (1≤k≤m) is the identifier on the Chord ring equal to or greater than (Nkey _i +2 ^k-1 )mod2 ^m The first node of , namely successor((Nkey _i +2 ^k-1 )mod2 ^m ). When any node receives a request whose keyword is key, it first checks whether its own node is equal to key, and if so, it returns directly; otherwise, the node searches its routing table, finds the first node in the table that is the largest but does not exceed key, and Forward the query request to this node, and repeat this process until the request reaches a node N _k , and when the key is located between N _k and N _k ’s subsequent node N _k+1 , node N _k reports its successor node N _k+1 in response to a request.

（5.4）节点负责的资源标识符有限，因此节点无法维护无限多的高维数据信息。当某个数据聚集区内高维数据过于密集导致节点不能有效对其进行管理时，将该聚集区分裂为两个或多个新的聚集区，相应地在Chord环上选择空闲节点以分担原节点的负载；当临近的数据聚集区内数据稀疏导致节点资源浪费时，合并聚集区，选择两个执行节点中的一个作为新的数据聚集区的执行节点，并将另一置为空闲节点。通过执行上述操作，有效的实现了系统负载均衡。(5.4) Nodes are responsible for limited resource identifiers, so nodes cannot maintain infinite high-dimensional data information. When the high-dimensional data in a data aggregation area is too dense and the nodes cannot effectively manage it, the aggregation area is split into two or more new aggregation areas, and correspondingly select idle nodes on the Chord ring to share the original data. The load of the node; when the data in the adjacent data aggregation area is sparse and the node resources are wasted, the aggregation area is merged, and one of the two execution nodes is selected as the execution node of the new data aggregation area, and the other is set as an idle node. By performing the above operations, the system load balance is effectively realized.

本实施例中，上述步骤（6）的具体步骤为：In this embodiment, the specific steps of the above step (6) are:

（6.1）为了实现系统基于语义的信息发布订阅功能，每个节点维护一张订阅表，记录与该节点相关的所有订阅信息，当节点有事件信息到达时，查询该节点的订阅表，如果事件信息与订阅表中某条订阅信息语义匹配成功，则将事件主动推送给用户。(6.1) In order to realize the semantic-based information publishing and subscription function of the system, each node maintains a subscription table to record all subscription information related to the node. When the node has event information, query the subscription table of the node. If the event If the information successfully matches the semantics of a piece of subscription information in the subscription table, the event will be actively pushed to the user.

（6.2）当订阅者向环上某节点N_i提交订阅请求信息时，该节点搜索全局搜索树确定与订阅请求语义相关的代理节点，将该订阅请求路由到代理节点，并在带代理节点上执行基于订阅语义的相似搜索算法（如点查询、范围查询、k近邻查询、最近邻查询），若搜索到与订阅请求语义相同的事件信息，则将事件信息反向传递给N_i；同时，代理节点将订阅请求、订阅条件、路由路径及搜索范围等信息存储在其订阅表中。(6.2) When a subscriber submits subscription request information to a node N _i on the ring, the node searches the global search tree to determine the proxy node related to the subscription request semantics, routes the subscription request to the proxy node, and Execute a similar search algorithm based on subscription semantics (such as point query, range query, k-nearest neighbor query, nearest neighbor query), and if the event information with the same semantics as the subscription request is found, the event information will be reversely passed to N _i ; at the same time, The agent node stores information such as subscription request, subscription condition, routing path and search scope in its subscription table.

（6.3）当发布者向环上某个节点N_j发布事件信息时，该节点搜索全局搜索树确定与事件信息语义相关的宿主节点，依据Chord路由协议将该事件信息路由到并存储在宿主节点上；同时，宿主节点将事件属性信息与其订阅表中的订阅请求信息进行语义匹配，如果匹配成功则直接将事件信息推送给订阅者。(6.3) When the publisher publishes event information to a node N _j on the ring, the node searches the global search tree to determine the host node related to the semantics of the event information, and routes the event information to and stores it in the host node according to the Chord routing protocol At the same time, the host node semantically matches the event attribute information with the subscription request information in the subscription table, and if the match is successful, it directly pushes the event information to the subscribers.

以下将以一个具体应用实例对本发明进行详细的描述，其详细的实施步骤为：The present invention will be described in detail below with a specific application example, and its detailed implementation steps are:

1）网络数据采集：现在网络中存在海量形式各异的数字资源（如文本、图像、音乐、视频等），系统不能直观理解上述资源的语义信息，首先对网络中数字资源进行采集并进行形式化处理，便于机器识别及进一步处理。1) Network data collection: There are a large number of digital resources in various forms (such as text, images, music, videos, etc.) in the network, and the system cannot intuitively understand the semantic information of the above resources. Chemical processing for machine identification and further processing.

2）构建系统拓扑并设立超级节点：将多个事件代理按照Chord环结构组织成事件代理网络，事件代理对应该网络中的各个节点，每个事件代理按照一定规则（事件的语义信息）存储信息网络中的数据资源，为一定数量的客户端服务，并保存有部分其他节点的信息，使订阅和事件信息低成本、高效、可靠地到达语义相关的节点。设立系统中能力最强的节点（计算能力、在线时间、存储空间、带宽等多方面考虑）超级节点SN，并确定超级节点的备份节点，以确保超级节点发生故障时系统的正常工作。为了获取网络中数据资源的语义信息，超级节点SN负责提取不同形式数据资源的属性，获取多维属性空间，信息网络中数据资源则抽象为高维属性空间中的点或向量。2) Construct the system topology and set up super nodes: organize multiple event agents into an event agent network according to the Chord ring structure, the event agents correspond to each node in the network, and each event agent stores information according to certain rules (semantic information of events) The data resources in the network serve a certain number of clients and store some information of other nodes, so that subscription and event information can reach semantically related nodes at low cost, efficiently and reliably. Set up the most capable node in the system (computing power, online time, storage space, bandwidth and other considerations) super node SN, and determine the backup node of the super node to ensure the normal operation of the system when the super node fails. In order to obtain the semantic information of the data resources in the network, the super node SN is responsible for extracting the attributes of different forms of data resources and obtaining the multi-dimensional attribute space, and the data resources in the information network are abstracted as points or vectors in the high-dimensional attribute space.

3）高维属性空间规格化处理：依照向量空间模型（VSM），提取数据对象的属性信息，将网络中数据对象描述为一个高维属性向量，则网络数字资源映射到高维属性空间中，网络数据资源集合可表示为一个矩阵，向量中的元素值表示一个属性在该数据对象中的重要性，如网络中d个数据对象集合可通过t个属性描述为一个t*d的矩阵A。3) High-dimensional attribute space normalization processing: According to the vector space model (VSM), the attribute information of the data object is extracted, and the data object in the network is described as a high-dimensional attribute vector, and the network digital resources are mapped to the high-dimensional attribute space. A collection of network data resources can be expressed as a matrix, and the element value in the vector indicates the importance of an attribute in the data object. For example, a collection of d data objects in the network can be described as a matrix A of t*d by t attributes.

在信息检索过程中通过高维属性空间表示的网络数据资源有很多与查询无关的属性（表现为矩阵中很多元素为0），为了加快查询速度，同时避免同义词和噪声干扰，化简矩阵，用一个低秩矩阵近似代替初始矩阵。假设网络中数据资源集合表示为矩阵A，A的秩为r，利用矩阵的奇异值分解将A分解为三个矩阵的乘积：A=U∑V^T，其中U=（u₁,u₂,...,u_r）是一个t*r矩阵，∑=diag(σ₁,...,σ_r)是一个r*r对角矩阵，V=（v₁,...,v_r）是一个d*r矩阵，σ_i是A的奇异值，σ₁≥σ₂≥...≥σ_r；然后仅仅保留l个最大的矩阵奇异值，省略掉其他的奇异值，将秩为r的矩阵A近似化简为秩为l的矩阵A_l：A_l=U_l∑_lV_l ^T，其中U_l=(u₁,u₂,...,u_l）,∑_l=diag(σ₁,...,σ_l),V_l=（v₁,...,v_l），V_l∑_l的行是高维信息的语义向量，则网络资源可以表示为l维属性空间中的高维向量。In the process of information retrieval, the network data resources represented by the high-dimensional attribute space have many attributes irrelevant to the query (many elements in the matrix are 0), in order to speed up the query and avoid synonyms and noise interference, the matrix is simplified, using A low-rank matrix approximates the original matrix. Assuming that the data resource set in the network is expressed as a matrix A, and the rank of A is r, use the singular value decomposition of the matrix to decompose A into the product of three matrices: A=U∑V ^T , where U=(u ₁ ,u ₂ , ...,u _r ) is a t*r matrix, ∑=diag(σ ₁ ,...,σ _r ) is a r*r diagonal matrix, V=(v ₁ ,...,v _r ) is a d*r matrix, σ _i is the singular value of A, σ ₁ ≥σ ₂ ≥...≥σ _r ; then only keep the l largest matrix singular values, omit other singular values, and rank r The matrix A of is approximated as a matrix A l with rank _l : A _l = U _l ∑ _l V _l ^T , where U _l =(u ₁ ,u ₂ ,...,u _l ),∑ _l =diag( σ ₁ ,...,σ _l ), V _l = (v ₁ ,...,v _l ), the row of V _l ∑ _l is the semantic vector of high-dimensional information, then the network resource can be expressed as an l-dimensional attribute space High-dimensional vectors in .

4）数据聚集分区并建立全局索引树：依据属性空间中距离接近的数据对象更可能有相似语义的原理，对属性空间中的数据对象进行聚集分区，将空间中高维数据分成多个互不相交的数据聚集区，尽量保证每个聚集区内的数据趋于均匀分布。4) Data aggregation and partitioning and establishment of a global index tree: According to the principle that data objects with close distances in the attribute space are more likely to have similar semantics, the data objects in the attribute space are aggregated and partitioned, and the high-dimensional data in the space are divided into multiple disjoint The data aggregation area, try to ensure that the data in each aggregation area tends to be evenly distributed.

此外，超级节点维护有全局索引树GI，GI中有每个节点对应的数据聚集区范围信息，当有新的数据对象加入系统时，GI根据数据对象坐标快速确定数据对象应该属于哪个节点，以便快速地将新加入的数据对象分配给该节点；当有新的订阅请求时，确定订阅请求的语义空间范围，查询GI快速确认与订阅请求相交的数据聚集区，并将请求发送到相关节点进一步精确查询，加快了查询速度。此外应定时刷新全局索引GI以避免系统节点变动带来的影响。In addition, the super node maintains a global index tree GI, which contains information on the range of data aggregation areas corresponding to each node. When a new data object is added to the system, GI quickly determines which node the data object should belong to according to the coordinates of the data object, so that Quickly assign newly added data objects to this node; when there is a new subscription request, determine the semantic space range of the subscription request, query GI to quickly confirm the data aggregation area intersected with the subscription request, and send the request to the relevant node for further Accurate query speeds up the query speed. In addition, the global index GI should be refreshed regularly to avoid the impact of system node changes.

5）高维数据降维处理：为消除高维属性空间中信息检索受到的“维度灾难”影响，对高维数据信息进行降维处理，如图2所示。5) Dimensionality reduction processing of high-dimensional data: In order to eliminate the impact of the “curse of dimensionality” on information retrieval in high-dimensional attribute spaces, dimensionality reduction processing is performed on high-dimensional data information, as shown in Figure 2.

数据分区得到的数据聚集区可能形状不规则，将每个数据聚集区归整为高维超立方体，每一个超立方体确定一个立方体数i，并进行归一化处理，使立方体边长为1，然后以超立方体中心为顶点，数据聚集区的（d-1）维超平面作为基，将每个d维数据聚集区（超立方体）划分为2d个金字塔，并为每个金字塔赋予一个金字塔值i。The data aggregation area obtained by data partitioning may be irregular in shape, and each data aggregation area is normalized into a high-dimensional hypercube. Each hypercube determines a cube number i, and performs normalization processing so that the side length of the cube is 1, and then The center of the hypercube is the vertex, and the (d-1)-dimensional hyperplane of the data aggregation area is used as the base. Each d-dimensional data aggregation area (hypercube) is divided into 2d pyramids, and a pyramid value i is assigned to each pyramid.

将每个金字塔分成与基平行的几个子划分，每一个子划分与一个B+树数据页相对应，然后根据高维数据点到金字塔基的距离将数据聚集区内的高维数据映射为一维数值，如图2所示，并用B+树组织管理降维后数据。降维公式为：y_v=i*c+(j+h_v)=i*c+(j+∣0.5-h_v’∣)，降维后，每一个高维立方体内高维数据对象的一维值被限定在[i*c,(i+1)*c]区间内，c是一个足够大的常数，从而确保每一个数据分区中的数据对象具有不同于别的分区的索引关键字。Divide each pyramid into several sub-divisions parallel to the base, each sub-division corresponds to a B+ tree data page, and then map the high-dimensional data in the data aggregation area to one-dimensional according to the distance from the high-dimensional data point to the pyramid base Value, as shown in Figure 2, and use B+ tree to organize and manage the data after dimension reduction. The dimension reduction formula is: y _v =i*c+(j+h _v )=i*c+(j+∣0.5-h _v '∣), after dimension reduction, the one-dimensional value of the high-dimensional data object in each high-dimensional cube It is limited in the interval [i*c, (i+1)*c], where c is a constant large enough to ensure that the data objects in each data partition have index keys different from other partitions.

6）用i-Chord方法管理数据对象：利用Chord方法存储管理网络数据信息，在信息检索中实现了信息检索分布式处理，提高了系统的效率。首先依据第5）部生成的一维数据构造数据对象的资源标识符，假设Chord系统资源标识符区间大小为2^m，利用一个保序函数h将一维数据值按照顺序映射到区间[0,2^m)中，映射公式为：key_v=ichord(v)=h(y_v)=h(i*c+(j+h_v))∈[0,2^m)。6) Manage data objects with the i-Chord method: use the Chord method to store and manage network data information, realize distributed processing of information retrieval in information retrieval, and improve system efficiency. First, construct the resource identifier of the data object based on the one-dimensional data generated in Part 5. Assume that the resource identifier interval size of the Chord system is 2 ^m , and use an order-preserving function h to map the one-dimensional data values to the interval [0, 2 ^m ), the mapping formula is: key _v =ichord(v)=h(y _v )=h(i*c+(j+h _v ))∈[0,2 ^m ).

为了便于数据查询，Chord环上每一个节点存储管理一个数据聚集区内的数据对象，环上一段资源标识符对应的数据信息存储在其后继节点上，如节点N_i的节点标识符Nkey_i，则资源标识符在区间(Nkey_i-1,Nkey_i]内的高维数据对象存储在节点N_i上。每个节点维护一张路由表，系统依据路由表查找给定资源标识符对应的数据对象（图3演示了从节点N4查找key=28的过程）。同时为了维护系统负载均衡，当数据聚集区内数据对象过于密集或过于稀疏时，分裂或合并数据聚集器，并执行相应的节点操作，以确保系统性能稳定及资源优化。In order to facilitate data query, each node on the Chord ring stores and manages data objects in a data aggregation area, and the data information corresponding to a resource identifier on the ring is stored on its successor nodes, such as the node identifier Nkey _i of node N _i , Then the high-dimensional data object whose resource identifier is in the interval (Nkey _i-1 , Nkey _i ] is stored on the node N _i . Each node maintains a routing table, and the system searches for the data corresponding to the given resource identifier according to the routing table object (Figure 3 demonstrates the process of finding key=28 from node N4). At the same time, in order to maintain system load balance, when the data objects in the data aggregation area are too dense or too sparse, split or merge the data aggregator and execute the corresponding node operation to ensure stable system performance and resource optimization.

7）基于语义实现信息的发布订阅：处理订阅请求过程如图4所示，每个节点维护有订阅表（如图3所示），当用户向某个节点发送订阅请求时，系统检查订阅表，如果没有相同的订阅请求，系统搜索全局索引树确定与订阅请求语义相关的代理节点并将订阅请求路由到这些代理节点，并在这些语义相关节点上执行相似搜索算法，查找到与订阅请求语义相同的事件信息返回给用户，同时将用户订阅请求、搜索结果等相关信息加入代理节点中的订阅表，如果订阅表中有相同的用户请求，则依据订阅表快速查找到结果返回并更新订阅表；当系统中有新的事件信息加入时，依据其资源标识符存储在相应的宿主节点上，并检查宿主节点订阅表确定是否有与之语义相匹配的订阅，如果存在则将数据对象主动推送给相应的用户；当用户发送取消订阅消息时，系统查找订阅表并删除相应订阅信息。7) Publish and subscribe information based on semantics: the process of processing subscription requests is shown in Figure 4. Each node maintains a subscription table (as shown in Figure 3). When a user sends a subscription request to a node, the system checks the subscription table , if there is no same subscription request, the system searches the global index tree to determine the proxy nodes related to the subscription request semantics and routes the subscription request to these proxy nodes, and executes a similar search algorithm on these semantically related nodes to find out the semantics related to the subscription request The same event information is returned to the user, and relevant information such as user subscription requests and search results are added to the subscription table in the proxy node. If there is the same user request in the subscription table, the results are quickly found according to the subscription table and the subscription table is updated. ; When new event information is added to the system, store it on the corresponding host node according to its resource identifier, and check the subscription table of the host node to determine whether there is a subscription that matches its semantics, and if so, actively push the data object To the corresponding user; when the user sends an unsubscribe message, the system searches the subscription table and deletes the corresponding subscription information.

相似搜索方法分为四类：点查询(Point Query)，在数据空间S中查找到与给定查询点q相同的目标对象p；范围查询(Range Query)，对于给定的查询点q及阀值r，在数据空间S中查找到满足d(p,q)≤r的所有目标对象p；最近邻查询(Nearest Neighbor Query)，在数据空间S中查找到与给定查询点q最近的目标对象p；k-近邻查询(KNN Query)，在数据空间S中查找到与给定查询点q最近的k个目标对象p。Similar search methods are divided into four categories: point query (Point Query), to find the same target object p as a given query point q in the data space S; range query (Range Query), for a given query point q and valve Value r, find all target objects p satisfying d(p,q)≤r in the data space S; nearest neighbor query (Nearest Neighbor Query), find the closest target to the given query point q in the data space S Object p; k-nearest neighbor query (KNN Query), find the k target objects p closest to a given query point q in the data space S.

以范围查询为例，如图5所示（以二维空间为例），首先根据查询点坐标及其查询半径得出查询范围（图中的圆形区域），并判断是否与数据空间内数据聚集区相交，如果存在数据聚集区与查询范围相交，确定聚集区内每个金字塔与查询范围相交的两个边界索引关键字y_low和y_high，根据这两个索引关键字在B+树种进行范围查找，获得一个点集（图中阴影区域内的点集），然后精确判断点集中的数据点是否属于查询范围，最后输出结果。Taking range query as an example, as shown in Figure 5 (taking two-dimensional space as an example), firstly, the query range (circular area in the figure) is obtained according to the coordinates of the query point and its query radius, and it is judged whether it is consistent with the data in the data space The aggregation area intersects. If there is a data aggregation area that intersects the query range, determine the two boundary index keywords y _low and y _high that each pyramid in the aggregation area intersects with the query range, and perform ranges in the B+ tree species based on these two index keywords Search to obtain a point set (the point set in the shaded area in the figure), and then accurately judge whether the data points in the point set belong to the query range, and finally output the result.

以上仅是本发明的优选实施方式，本发明的保护范围并不仅局限于上述实施例，凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出，对于本技术领域的普通技术人员来说，在不脱离本发明原理前提下的若干改进和润饰，应视为本发明的保护范围。The above are only preferred implementations of the present invention, and the protection scope of the present invention is not limited to the above-mentioned embodiments, and all technical solutions under the idea of the present invention belong to the protection scope of the present invention. It should be pointed out that for those skilled in the art, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims

1. the semantic intelligent information distribution subscription method based on the P2P technology, is characterized in that, step is:

(1) constructing system topological structure set up super node: utilize the structural P 2 P technology that the topology constructing of a plurality of event agents in the P/S system is become to the Chord ring structure, and on ring, set a super node, for attribute the structure attribute space of information extraction network data resource, in information network data resource abstract be point or vector in the higher-dimension attribute space;

(2) normalization of higher-dimension attribute space is processed: on super node, utilize vector space model that data message in network is expressed as to the higher-dimension point vector in the higher-dimension attribute space, all data messages in network are mapped in a higher-dimension attribute space, on mathematical form, are expressed as a high level matrix; Utilize potential semantic indexing to remove the information attribute very little with Relevance in Information Retrieval, by the original higher-dimension attribute space of the approximate replacement of an attribute subspace;

(3) the global index tree is safeguarded in high dimensional data subregion foundation: super node SN is divided into different data gathering districts by the high dimensional data in attribute space, and each data gathering is distinguished and is fitted on different nodes, super node is also safeguarded an index tree be configured to by all nodal informations, be called the global index tree, be used for as node dispense event information in the Chord ring and determine the agent node that subscribe request need to be accessed;

(4) high dimensional data dimensionality reduction: the data gathering district consolidation that node will be distributed to oneself is the multidimensional hypercube in the higher-dimension attribute space, finally utilizes the pyramid dimension reduction method that the high dimensional data object map is arrived to the one-dimensional data space, and carries out index with the B+ tree;

(5) utilize i-Chord management by methods data object: after High Dimensional Data Set is mapped to the one-dimensional data space, utilize Chord agreement organizations maintaining network data message; Arrange an isotonic function by data-mapping in the one-dimensional data space to Chord resource identifier space; The corresponding data accumulation area of each node of Chord, the high dimensional data in storage administration corresponding data accumulation area, and routing table fast and easy Query Information of each node maintenance;

(6) intelligent information that realizes semantic-based is subscribed to and issue: subscription table of each node maintenance in system, record and the semantic relevant subscription information of this node; When the subscriber sends subscribe request to system, at first by search global index tree, determine and the semantic relevant agent node of this subscribe request, request is sent to agent node, in the subscription table of agent node, increase by one and subscribe to record, register the incidence relation of this subscribe request and this node, then agent node is determined the precise search scope in the higher-dimension attribute space according to subscribe request and subscription condition, in the precise search scope, accurately searches with the semantic identical event information of subscribe request and returns to the subscriber; When the publisher sends event information to system, at first by search global index tree, determine the host node relevant to this event semantics, event information is sent to host's node, and the subscription table of consulting host's node, if certain subscription information semantic matches success in event information and subscription table, by the event active push to the user.

2. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (1) are:

(1.1) in the P/S system, may there be a plurality of event agents, a plurality of event agents are organized into to event agent's network according to the Chord ring structure, the event agent is to each node in should network, each event agent is according to the data resource in certain rale store information network, and the information of preserving other nodes of part;

(1.2) in the Chord ring, select the node that ability is the strongest as super node SN, super node SN makes regular check on the ability of other nodes in ring, therefrom selects candidate's super node, the upper important information of candidate's super node backup SN;

(1.3) super node SN is responsible for extracting the attribute of multi-form data resource, structure multidimensional property space, in information network data resource abstract be point or vector in the higher-dimension attribute space.

3. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (2) are:

(2.1) thought according to vector space model is described as an attribute vector by high dimensional data information in network, then system is a matrix by all higher-dimension Information Organizations in network, if in information network, be described as the matrix A of a t*d by d high dimensional data information of t attribute description, each row of matrix represent a higher-dimension information data, matrix element a _IjFor the property value of attribute i in data object j, represented the importance of attribute i in object j, if higher-dimension information j does not exist attribute i, a _IjBe 0;

(2.2) notice that in matrix, most element is 0, in the descriptive information retrieval, the most attribute of a higher-dimension information is garbage, replaces initial matrix with a low-rank approximate matrix; Suppose that in network, the higher-dimension information aggregate is expressed as matrix A, the order of A is r, utilizes the svd of matrix A to be decomposed into to the product of three matrixes:

A=U∑V ^T

U=(u wherein ₁, u ₂..., u _r) be a t*r matrix, ∑=diag (σ ₁..., σ _r) be a r*r diagonal matrix, V=(v ₁..., v _r) be a d*r matrix, σ _iThe singular value of A, σ ₁>=σ ₂>=...>=σ _r

(2.3) only retain l maximum singular values of a matrix, dispense other singular value, the approximate abbreviation of matrix A that is r by order is that order is the matrix A of l _l:

A _l=U _l∑ _lV _l ^T

U wherein _l=(u ₁, u ₂..., u _l), ∑ _l=diag (σ ₁..., σ _l), V _l=(v ₁..., v _l), V _l∑ _lRow be the semantic vector of higher-dimension information.

4. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (3) are:

(3.1) in information network data resource abstract be point or vector in the higher-dimension attribute space, according to the close data resource of distance in attribute space, has the principle of similar semanteme, super node SN carries out the cluster subregion to the high dimensional data distributed in the multidimensional property space, high dimensional data is divided into to a plurality of mutually disjoint data gatherings district, and each data gathering is distinguished and is fitted on different nodes;

(3.2) super node SN, except the routing table of self, also safeguards the index tree that a structure of the resource identifier range information by the upper all nodes of ring forms, and is called the GI of global index;

(3.2.1) be fast node dispense event information; In ring, each node is responsible for a data accumulation area, in this accumulation area, all data messages are all distributed to this node, when an event information request adds fashionable, GI determines which data gathering district this event information belongs to by inquiry, namely determined host's node of event information, then the identifier of host's node of take is search key, utilizes the Chord Routing Protocol, and this event information is distributed to host's node;

(3.2.2) determine the node that subscribe request need to be accessed; When certain node input subscribe request, at first request is sent to SN search GI, determine that the semantic space of data gathering district which node is responsible for and this subscribe request intersects and return the identifier of these nodes, the node returned is called agent node, then the identifier of agent node of take is search key, subscribe request is routed to these nodes, further realizes semantic matches.

5. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (4) are:

(4.1) by each data gathering district consolidation, be that in high-dimensional data space, dimension is the multidimensional hypercube of d, and the multidimensional hypercube is carried out to normalized, the length of side that is every one dimension is 1, the central point of hypercube be expressed as (0.5,0.5 ..., 0.5), then take the hypercube center is summit, and (d-1) dimension lineoid in data gathering district, as base, is divided into 2d pyramid by each d dimension data accumulation area;

(4.2) each pyramid being divided into to several height parallel with base divides, each height is divided corresponding with a B+ tree data page, then the distance of putting the pyramid base according to high dimensional data is mapped as one dimensional numerical by the high dimensional data in the data gathering district, and by data after B+ tree organization and administration dimensionality reduction; The dimensionality reduction formula is:

y _v=i*c+(j+h _v)=i*c+(j+∣0.5-h _v’∣)

Wherein the cube number at certain high dimensional data object v place is i, and the pyramid number is j, and v is h to place pyramid base plane distance _v', the distance that arrives plane, place, pyramid summit is h _v=∣ 0.5-h _v’ ∣; After dimensionality reduction, in each High Dimension Cubes, the one dimension value of high dimensional data object is limited at [i*c, (i+1) * c] in interval, c is enough large constant, guarantees that the data object in each data partition has the index key that is different from other subregion.

6. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (5) are:

(5.1) suppose that between Chord system resource identifier field, size is 2 ^m, utilize an isotonic function h that one dimensional numerical after dimensionality reduction is mapped to interval [0,2 in order ^m) in; Namely for certain the high dimensional data point v in attribute space, the i-Chord resource identifier of some v is:

key _v=ichord(v)=h(y _v)=h(i*c+(j+h _v))∈[0,2 ^m)

(5.2) each node N on the Chord ring _iBe responsible for the high dimensional data information of a data accumulation area of storage administration, establish N _iNode identifier Nkey _i, in this accumulation area, the resource identifier of high dimensional data information is distributed in interval (Nkey _I-1, Nkey _i] in;

(5.3) each node N _iSafeguard a routing table, namely pointer gauge, point to other nodes on ring; M list item arranged in routing table, wherein (it is that on the Chord ring, identifier is equal to or greater than (Nkeyi+2 that 1≤k≤m) goes list item to k ^K-1) mod2 ^mFirst node, i.e. successor ((Nkey _i+ 2 ^K-1) mod2 ^m); Any one node receives when key word is the request of key, checks that at first whether self node equals key, if it is directly returns; Otherwise node is searched its routing table, find in table maximum but be no more than first node of key, and inquiry request is transmitted to this node, repeat this process, until request arrives a node N _k, meet key and be positioned at N _kAnd N _kSubsequent node N _K+1Between the time, node N _kReport its descendant node N _K+1As replying of request;

(5.4) when high dimensional data in certain data gathering district is too intensive while causing node effectively to it, not manage, this accumulation area is split into to two or more new accumulation area, correspondingly on the Chord ring, selects idle node to share the load of origin node; When Sparse in the data gathering district of closing on causes node resource when waste, merge accumulation area, select the XM in the new data gathering district of a conduct in two XM, and another is set to idle node.

7. the semantic intelligent information distribution subscription method based on the P2P technology according to claim 1, is characterized in that, the concrete steps of described step (6) are:

(6.1) subscription table of each node maintenance, record all subscription information relevant to this node, when node has event information to arrive, the subscription table of inquiring about this node, if certain subscription information semantic matches success in event information and subscription table, by the event active push to the user;

(6.2) as the subscriber to upper certain the node N of ring _iWhile submitting subscribe request information to, this node searching global search tree is determined and the semantic relevant agent node of subscribe request, this subscribe request is routed to agent node, and carry out based on subscribing to semantic similarity algorithm on agent node, if search and the semantic identical event information of subscribe request, by the event information back transfer to N _iSimultaneously, agent node is stored in subscribe request, subscription condition, routed path and hunting zone information in its subscription table;

(6.3) as the publisher to upper certain the node N of ring _jDuring the issue event information, this node searching global search tree is determined and the semantic relevant host's node of event information, according to the Chord Routing Protocol, this event information is routed to and is stored on host's node; Simultaneously, host's node carries out semantic matches by the subscribe request information in event attribute information and its subscription table, if the match is successful directly event information is pushed to the subscriber;

(6.4) when the subscriber cancels subscriptions information to upper certain the node issue of ring, this node searching global search tree is determined and the semantic relevant agent node of subscription information, and solicited message is routed to corresponding agent node, in the agent node subscription table, find and delete corresponding subscription information.