KR20060090810A

KR20060090810A - Group-to-group communication over a single connection and fault tolerant symmetric multi-computing system

Info

Publication number: KR20060090810A
Application number: KR1020067005585A
Authority: KR
Inventors: 안니쿠마 도미닉
Original assignee: 트란세임 테크놀로지스
Priority date: 2003-09-22
Filing date: 2004-09-21
Publication date: 2006-08-16
Also published as: EP1668527A1; CA2538084A1; JP2007520093A; AU2004277204A1; EP1668527A4; WO2005031588A1

Abstract

A system enabled for reliable and ordered data communication between two sets of nodes with atomic multi- point delivery and multi-point transmission, for example, extending TCP/IP is described hereon. When multiple nodes must be delivered with data, the delivery is performed atomically. A system enabled for fault- tolerant symmetric multi-computing using a group of nodes is also described hereon. A symmetrical group of nodes networked using a reliable, ordered, and atomic group-to-group TCP communication system is used in providing fault-tolerance and single system image to client applications. The communication between the client and the group is standards based in that any standard TCP/IP endpoint is able to seamlessly communicate with the group. The processing load is shared among a group of nodes with transparent distribution of tasks to application segments. The system is fault-tolerant in that if a node fails remaining replicas if any continue service without disruption of service or connection.

Description

GROUP-TO-GROUP COMMUNICATION OVER A SINGLE CONNECTION AND FAULT TOLERANT SYMMETRIC MULTI-COMPUTING SYSTEM}

본 발명은 네트워크 내의 n 대 n 지점간의 네트워크 통신에 관한 것으로 여기서 n은 임의의 정수 값이다.The present invention relates to network communication between n to n points in a network, where n is any integer value.

최적의 자원 이용 유연성 및 절감된 관리 비용을 위해 산업계는 "유틸리티 컴퓨팅(utility computing)" 모델에 근거한 솔루션을 요구한다. 상기 모델은 처리 능력과 저장 용량이 필요한 만큼 부가될 수 있고 자원들이 변화하는 요구에 부합하기 위해 동적으로 공급된다. 기존의 메인프레임 솔루션은 고 비용으로 인하여 일반 기업들의 접근영역에서 벗어나 있다. 시장에서 활용 가능한 고성능 저가의 "블레이드 서버(blade server)"와 네트워킹 기술이 다수 존재한다. 그러나, 이러한 자원들을 효율적이고 유연하게 취합하고, 유틸리티 컴퓨팅 요구에 부합하기 위해 폭넓은 응용분야에서 동작할 수 있는 솔루션은 아직 존재하지 않는다.For optimal resource utilization flexibility and reduced management costs, the industry demands a solution based on the "utility computing" model. The model can add as much processing power and storage capacity as needed and dynamically supply resources to meet changing needs. Existing mainframe solutions are out of reach of companies because of their high cost. There are many high performance, low cost "blade servers" and networking technologies available on the market. However, no solution yet exists that can efficiently and flexibly aggregate these resources and operate in a wide range of applications to meet utility computing needs.

클라이언트-서버 패러다임은 클라이언트가 요청하고 서버가 응답하는 단순성에 기인하여 산업계에 일반화되어 있다. 이 패러다임을 가능케 하기 위하여 통신 네트워크에서 클라이언트와 서버 간에 사용되는 일반적인 통신 프로토콜은 전송 제 어 프로토콜/인터넷 프로토콜(transmission control protocol/Internet Protocol) 또는 간단히, "TCP/IP" 이다. 통신네트워크에서 클라이언트(또는 클라이언트 시스템 또는 머쉰)는 서버(또는 서버 시스템 또는 머쉰)를 단일 논리 호스트나 엔티티로 간주한다. 단일 물리 서버는 종종 수많은 클라이언트의 요구를 효과적으로 만족시키지 못한다. 결국, 고장난 서버는 클라이언트의 실행을 불가능하게 한다.The client-server paradigm is common in the industry due to the simplicity that clients request and servers respond. A common communication protocol used between clients and servers in a communication network to enable this paradigm is transmission control protocol / Internet Protocol, or simply "TCP / IP". In a communication network, a client (or client system or machine) sees the server (or server system or machine) as a single logical host or entity. A single physical server often does not effectively meet the needs of numerous clients. As a result, the failed server makes the client unable to run.

단일 물리 서버의 결함을 처리하기 위해, 클라이언트에 서비스를 제공하기 위한 병렬 또는 격자로 동작하는 많은 서버를 갖는 클러스터 구성이 부하 분산기(load balancer)를 사용하여 개발되었다. 이러한 구성은 메인프레임에 필적할 만한 장애극복(fault-tolerance), 저비용, 효율성과 유연성과 같은 잠재적인 이점을 제공한다. 그러나, 이러한 또는 다른 이점들은 본래의 제한과 대부분의 어플리케이션에서 이루어지는 표준 플랫폼의 결여로 인해 널리 실현되지 못하고 있다. In order to deal with the defects of a single physical server, a cluster configuration with many servers operating in parallel or grid to provide services to clients has been developed using a load balancer. This configuration offers potential advantages such as fault-tolerance, low cost, efficiency and flexibility that are comparable to mainframes. However, these or other benefits are not widely realized due to inherent limitations and the lack of a standard platform in most applications.

물리적인 클러스터링에 더하여, 기존의 소프트웨어 시스템은 운영체제 레벨과 어플리케이션 레벨에서 클러스터링을 도입하기 위한 시도를 했다. 그러나, 어플리케이션에 클러스터링이 도입된 인스턴스를 포함하는 이러한 소프트웨어 구성의 단점은 이러한 어플리케이션의 사용을 제한한다는 것이다. 유사하게, 운영체제 레벨의 클러스터링이 매력적이지만, 이 분야의 기존의 노력들은 가상화되어야 하는 많은 추상적 개념에 기인하여 아직 성공적이지 못하다.In addition to physical clustering, existing software systems have attempted to introduce clustering at the operating system and application levels. However, a disadvantage of this software configuration, including instances in which clustering has been introduced into applications, is that they limit the use of such applications. Similarly, operating system level clustering is attractive, but existing efforts in this field are not yet successful due to the many abstractions that need to be virtualized.

물리 서버, 소프트웨어 어플리케이션 및 운영체제의 클러스터링에 비해, 네트워크 레벨 클러스터링은 이러한 문제점들에 대해 고민하지 않고 약간의 매력적인 이점을 제공한다. 예를 들어, 서버 노드들의 클러스트를 단일 가상 엔티티로 처리 하는 능력은 클라이언트 서버 프로그래밍에서 유용한 요구사항이다. 또한,노드들의 풀로 가상의 클러스트를 쉽게 생성하는 능력은 좀 더 나은 활용과 메인프레임 클래스의 유연성에 부가된다.Compared to the clustering of physical servers, software applications and operating systems, network level clustering offers some attractive advantages without worrying about these problems. For example, the ability to treat clusters of server nodes as a single virtual entity is a useful requirement in client server programming. In addition, the ability to easily create virtual clusters from a pool of nodes adds to the greater utilization and flexibility of mainframe classes.

기존의 네트워크 레벨 클러스터링 플랫폼은 넓은 응용 분야에 일반적이어야 하고 적용가능해야 한다. 이러한 어플리케이션은 웹 서버, 스토리지 서버, 데이터 베이스 서버, 과학적인 그리드(grid) 컴퓨팅의 응용 범위에 이른다. 이러한 기존의 네트워크 레벨 클러스터는 계산 능력과 노드들의 용량의 집합을 가능케 하여 어플리케이션 비율에 따라 고르게 정한다. 현재의 어플리케이션은 최소한의 거부(no) 또는 변화로 동작할 수 있어야 한다. 그러나, 기존의 네트워크 레벨 클러스터는 단지 제한적인 성공을 거두었을 뿐이다.Existing network level clustering platforms should be generic and applicable to a wide range of applications. These applications range from web servers, storage servers, database servers, and scientific grid computing. These existing network-level clusters enable a set of computational power and capacity of nodes to be evenly distributed according to the application rate. Current applications should be able to operate with minimal no or change. However, existing network level clusters have only limited success.

대칭 멀티 프로세서(Symmetric Multi-Processor: SMP) 구조의 일정 정도의 성공은 프로세서와 메모리의 위치를 어플리케이션에 투명하게 만드는 버스의 단순성에 기인할 수 있다. 클러스터링에서도 역시 노드 위치의 투명성과 노드 자체의 투명성은 서버 노드들을 연결하는 가상의 버스의 단순성에 기인할 수 있다. 그러나, 이러한 기존의 시스템들은 클라이언트의 어플리케이션이 효율성을 위해 버스를 직접 탭핑시킬 수 있는 능력이 부족하다. 마찬가지로, 유저 데이터그램 프로토콜(User Datagram Protoclol: "UDP") 패킷 브로드캐스트 및 멀티캐스트에 기반한 버스는 데이터 전송의 보증이 결여되어 어플리케이션 레벨 클러스터링을 야기한다.The degree of success of the Symmetric Multi-Processor (SMP) architecture can be attributed to the simplicity of the bus, making the location of the processor and memory transparent to the application. In clustering too, the transparency of the node location and the transparency of the node itself can be attributed to the simplicity of the virtual bus connecting the server nodes. However, these existing systems lack the ability of the client's application to directly tap the bus for efficiency. Similarly, buses based on User Datagram Protocol (“UDP”) packet broadcasts and multicasts lack the guarantee of data transmission, leading to application level clustering.

산업체에 의해 전송 보증을 갖는 가장 많이 사용되는 프로토콜이 TCP/IP이다. TCP의 데이터 전송 보증 요구된 전송 보증및 편재로 가상화 TCP가 바람직하 다. 그러나, 접속 당 두 개의 엔드포인트만의 TCP의 지원은 그것의 잠재력을 제한한다. 들어오는 요구를 클러스터에 분산하는 것과 같은 미리 지정된 작업을 갖는 프로세싱 요소/노드들의 비대칭적인 구성은 본질적으로 유연하지 않고 관리나 로드 밸런싱이 어렵다. 비대칭 노드들은 주로 단일 지점에서 자주 실패하고 병목 현상을 일으킨다. 멀티 컴퓨팅(Multi-Computing: MC)을 성공하기 위하여 비대칭 노드 구성에 반대되는 대칭 구성이 요구된다.TCP / IP is the most used protocol with transmission guarantees by industry. Data Transmission Guarantee of TCP Virtualized TCP is desirable for the required transmission guarantees and ubiquity. However, TCP's support of only two endpoints per connection limits its potential. An asymmetrical configuration of processing elements / nodes with predetermined tasks, such as distributing incoming requests across a cluster, is inherently inflexible and difficult to manage or load balance. Asymmetric nodes often fail and bottleneck frequently at a single point. In order to succeed in multi-computing (MC), a symmetric configuration as opposed to an asymmetric node configuration is required.

클라이언트 서버 환경에서 비대칭에 대한 또 다른 문제는 레이턴시(latency)이다. 스위치와 라우터는 데이터가 지나가는 동안에 레이턴시를 감소시키기 위한 특별한 하드웨어를 채택한다. 테이터가 노드의 UDP/TCP/IP 스택을 통과해야만 할때, 복사와 처리로 인한 심각한 레이턴시가 추가된다. 그러므로, 최적의 성능을 달성하기 위하여, 시스템은 비대칭적 구조를 가지는 매개 노드를 데이터가 통과하는 것을 반드시 피해야 한다. 그러나, 만일 서버 노드의 CPU들이 다량의 네트워크 트래픽을 취급해야 한다면, 어플리케이션의 처리량과 프로세싱은 이를 감수해야 한다. 그래서, 기존의 시스템은 어플리케이션의 성능을 개선하고 엔드포인트에서의 레이턴시를 감소시키기 위하여 특수한 집적회로 칩이나 어댑터 카드와 같은 하드웨어 가속기를 사용해야만 한다. 이는 시스템의 비용과 복잡성을 증가시킨다.Another problem with asymmetry in a client server environment is latency. Switches and routers employ special hardware to reduce latency as data passes through. When data has to go through the node's UDP / TCP / IP stack, the serious latency of copying and processing is added. Therefore, to achieve optimal performance, the system must avoid passing data through intermediary nodes with an asymmetric structure. However, if the server node's CPUs must handle a large amount of network traffic, the application's throughput and processing must bear this. Thus, existing systems must use hardware accelerators such as specialized integrated circuit chips or adapter cards to improve application performance and reduce latency at the endpoint. This increases the cost and complexity of the system.

저비용의 장애 극복은 많은 산업에서 높게 요구된다. 정해진 수의 추가 하드웨어 부품이 사용된 솔루션은 유연성의 부족, 쉽게 복구할 수 있는 능력의 부족과 복잡성에 기인한 고 비용을 대가로 한다. 오늘날의 솔루션은 장애 발생 이후 스탠바이 서버로 신속히 서비스를 스위칭함으로써 높은 유용성을 제공한다. 스탠바이 시스템은 수동적이기 때문에 그 자원들은 고 비용 때문에 활용되지 못한다. 단순하지만 복제(replication)에 의한 장애 극복의 강력한 형태로 노드의 장애 시 중단없이 접속을 통한 서비스를 지속한다.Low cost barriers are highly demanded in many industries. Solutions with a fixed number of additional hardware components are costly due to the lack of flexibility, lack of ability to recover and complexity. Today's solutions provide high availability by quickly switching services to standby servers after a failure. Because standby systems are passive, their resources are not available due to high costs. Simple but powerful form of failover by replication, it maintains service through connection without interruption in case of node failure.

기존의 클러스터에서는 능동 노드는 태스크를 수행하고 수동 노드는 변경에 대해 나중에 업데이트 한다. 다수의 인스턴스에 있어서, 쿼리(query)와 같은 다른 태스크(task)에 비교해서 적은 수의 업데이트가 존재한다. 머쉰은 업데이트가 레플리카(replica)에 반영되는 동안에 로드가 모든 레플리카 사이에서 공유될 때 가장 잘 활용된다. 업데이트는 동기적이어야 하고 일관성을 위하여 같은 순서로 만들어져야 한다. 원자적 전송(atomic delivery)으로, 클라이언트가 데이터 수신을 표시하는 TCP ACK 신호를 수신하기 전에 데이터는 모든 목적지에 전송되는 것을 보증한다. 실패가 발생하면, 나머지는 장애 극복에 영향을 주는 접속 중단을 피하면서 서비스를 지속할 수 있다. 비 원자적 복제(Non atomic replication)는 유용성이 부족하다. 특히, 클라이언트의 요청이 서비스에 의해 수신될 때 각각은 응답을 생성한다. 클라이언트는 서버를 단일의 엔티티로 보기 때문에 단지 하나의 응답 인스턴스만이 클라이언트에게 되돌아 가는 것이 보증되어야 한다. 비슷하게, 다수의 클라이언트 레플리카가 동일한 요청 발송을 시도하는 경우에도, 단지 하나의 인스턴스만이 서버로 보내지는 것이 보증되어야 한다. 기존의 시스템은 종종 이러한 원자성을 제공하는데 실패한다. 그래서, 유용성과 접속 중단을 피하는 장애 극복이 부족하다. In a traditional cluster, active nodes perform tasks and passive nodes update later on changes. In many instances, there are fewer updates compared to other tasks, such as queries. Machines are best used when loads are shared between all replicas, while updates are reflected in replicas. Updates must be synchronous and made in the same order for consistency. Atomic delivery ensures that data is sent to all destinations before the client receives a TCP ACK signal indicating data reception. If a failure occurs, the rest can continue service while avoiding connection interruptions that affect failover. Non atomic replication lacks usefulness. In particular, each generates a response when the client's request is received by the service. Since the client sees the server as a single entity, it must be guaranteed that only one instance of the response goes back to the client. Similarly, even if multiple client replicas attempt to send the same request, it should be guaranteed that only one instance is sent to the server. Existing systems often fail to provide this atomicity. Thus, there is a lack of usability and overcoming obstacles to avoid disconnection.

기존의 클러스터링 시스템의 다른 문제는 로드 밸런싱이다. 다른 어떠한 시 스템에서처럼, 노드 사이의 부하를 고르게 맞추는 능력이 최적의 어플리케이션 성능을 위해 필요하다. 그러나, 기존의 클러스터링 시스템은 예를 들어, 라운드-로빈(round-robin), 컨텐츠 해시드(content hashed) 및 가중치기반 우선순위(weighted priority)와 같은 표준 로드 밸런싱 스키마(scheme)에 대해서만 제한적으로 지원한다. 게다가, 많은 기존의 클러스터링 시스템은 특정한 로드 밸런싱 스키마의 어플리케이션의 구현을 지원할 수 없다.Another problem with existing clustering systems is load balancing. As with any other system, the ability to balance the load between nodes is necessary for optimal application performance. However, existing clustering systems have limited support for standard load balancing schemes such as, for example, round-robin, content hashed, and weighted priority. do. In addition, many existing clustering systems cannot support the implementation of an application of a particular load balancing scheme.

많은 서비스들은 시간에 의존하는 클러스터에서 많은 변화를 갖는 로드 레벨을 갖는다. 동작하는 프로세스는 능동 서버를 멈추게 하기 위해 이동할 필요가 있다. 기존의 클러스터 시스템은 서비스 중단없이 쉽게 수행되는 방법으로 클러스트에서 노드를 추가하거나 제거하는 지원이 부족하다.Many services have load levels with many variations in time-dependent clusters. The running process needs to move to stop the active server. Existing cluster systems lack the support to add or remove nodes from the cluster in a way that is easy to perform without service interruption.

네트워크 레벨의 가상화를 처리하기 위한 많은 시도들이 있어 왔다. 그러나, 각 시도들은 아직 중대한 단점들을 가지고 있다. 예를 들어, 하나의 기존 솔루션은 업계에서 인기 있는 웹서버의 클러스터에서 로드 밸런싱을 위한 디바이스이다. 미국 특허번호 6,006,264와 6,449,647에 또한 공개된 이 로드 밸런싱 디바이스는 들어오는 클라이언트 TCP 접속을 서버 풀 내의 한 서버로 스위칭한다. 이 처리를 위한 기존의 서버는 클라이언트 패킷을 스위치나 라우터로 모든 노드에 브로드캐스트 하거나 멀티캐스트 하는 마이크로소프트의 네트워크 부하 분산기 소프트웨어이다. 그러나, 접속이 한번 맵핑되면, 동일한 서버가 기존의 일 대 일 관계의 TCP 접속 라이프 동안 모든 클라이언트의 요청을 취급한다.There have been many attempts to deal with network level virtualization. However, each attempt still has significant drawbacks. For example, one existing solution is a device for load balancing in a cluster of web servers that are popular in the industry. This load balancing device, also disclosed in US Pat. Nos. 6,006,264 and 6,449,647, switches incoming client TCP connections to one server in a server pool. The existing server for this process is Microsoft's network load balancer software that broadcasts or multicasts client packets to all nodes as switches or routers. However, once a connection is mapped, the same server handles all client requests for the life of the existing one-to-one TCP connection.

위와 같은 전통적 시스템의 문제점은 서비스가 노드에서 동작하는 다른 유형 의 작업으로 구성된 때, 클라이언트가 접속을 통해 요청하는 모든 서비스를 하지 않는 어떠한 맵핑된 서버는 서비스 실패로 인해 완전한 솔루션을 제공하지 못한다는 것이다. 이는 페이지를 제공하는 단지 하나의 태스크가 많은 노드들에 복제되는 웹페이지를 제공하는 이러한 시스템의 사용을 제한한다. 게다가, 서버에 외부적으로 구현된 디바이스의 맵핑도 병목 현상과 한 지점의 장애를 초래한다. 또한 접속이 오직 두 엔드포인트를 갖기 때문에 복제는 지원되지 않는다. 그러므로, 이러한 단일 종단 TCP를 이용한 업데이트는 레플리카에 반영되지 못하여 유용성에 상당한 한계가 있다.The problem with the traditional system above is that when a service consists of different types of tasks running on a node, any mapped server that does not have all the services the client requests over the connection will not provide a complete solution due to service failure. . This limits the use of such a system to provide a web page where only one task of serving the page is replicated to many nodes. In addition, the mapping of devices externally implemented on the server also creates bottlenecks and a single point of failure. Also, replication is not supported because the connection has only two endpoints. Therefore, such updates using single-ended TCP are not reflected in the replica and thus have a significant limitation in usability.

위의 기존의 시스템의 단점을 처리하기 위하여 클라이언트의 요청을 다른 태스크를 제공하는 노드로의 접속을 통해 분산하기 위한 또 다른 전통적 시스템들이 시도되었다. 라비 코쿠(Ravi Kokku) 등은 "Half Pipe Anchoring."이란 그의 논문에서 이러한 시스템을 발표했다. Half pipe anchoring은 백엔드 포워딩에 기반한다. 이 스키마에서는, 클라이언트의 요청이 서버의 클러스터에 도달하였을 때, 지정된 서버는 요청을 수락하고 데이터의 시험이 끝난 후에 최적의 서버에 전송한다. 이후 접속 상태 정보가 주어진 최적의 서버는 원래의 타겟 어드레스로 어드레스를 교체한 후 클라이언트에 직접적으로 응답한다. 여기서 단일 TCP 엔드포인트가 요청을 분산하기 위해 노드에 동적으로 맵핑된다. 이 스키마는 매개 노드(intervening node)가 데이터를 가로채고 데이터 내용에 기반하여 이를 분배시키는 "비대칭적"인 접근 방법의 한 예이다.To address the shortcomings of the existing system above, other traditional systems have been attempted to distribute the client's request via a connection to a node providing another task. Ravi Kokku et al. Presented this system in his article "Half Pipe Anchoring." Half pipe anchoring is based on backend forwarding. In this schema, when a client's request reaches a cluster of servers, the designated server accepts the request and sends it to the optimal server after the data has been tested. The optimal server given the connection status information then responds directly to the client after replacing the address with the original target address. Here a single TCP endpoint is dynamically mapped to nodes to distribute requests. This schema is an example of an "asymmetric" approach where an intervening node intercepts data and distributes it based on the data content.

비대칭 구조를 만들어 내려는 또 다른 전통적 시스템은 EMIC 네트워크 사의 두 개의 백서에 발표되었다. 이 기존의 시스템에서는 독점적 프로토콜을 사용하여 지정된 노드가 들어오는 데이터를 가로채고 캡쳐하여 여러 노드에 이를 확실하게 전달한다. 때때로 오직 하나의 노드만이 데이터 전송이 허용되고 데이터는 지정된 서버로 먼저 전송되고 이후 서버는 클라이언트에 데이터를 전송한다. 여기서, 또한 단일 엔드포인트는 동적으로 맵핑되고 TCP 접속은 복제가 초기화되는 매개 노드에서 종결된다. 이 스키마는 매개 노드가 데이터를 가로채어 복제하는 "비대칭적" 시도의 또 다른 예이다.Another traditional system for building asymmetric structures is presented in two white papers from EMIC Networks. This existing system uses proprietary protocols that allow a given node to intercept and capture incoming data and forward it to multiple nodes. Sometimes only one node is allowed to send data, the data is sent to the designated server first, and then the server sends the data to the client. Here, also a single endpoint is dynamically mapped and the TCP connection is terminated at each node where replication is initiated. This schema is another example of an "asymmetric" attempt at which each node intercepts and replicates data.

위에서 기술된 두 스키마는 비록 다른 노드에서 매핑 되었다 하여도 두 엔드포인트의 TCP 정의를 유지한다. 이러한 기존의 스키마에서의 복제는 독점적 프로토콜을 사용하여 어플리케이션 레벨에서 수행된다. 게다가, 이러한 기존의 스키마는 비대칭적 노드 구조를 채택한다. 여기서 선택 노드는 요청을 분산하는 어플리케이션 레벨 라우터로 동작한다. 그러나, 이러한 비대칭성은 Aaron 등에 의해 "Scalable Content Aware Request Distribution in Cluster Based Network Servers" 에서 언급된 것과 같은 범용성이 제한된다. 이러한 제한점은 단일지점에서의 장애, 데이터 처리량의 병목현상, 높은 레이턴시로 인한 차선 성능 및 위치 투명성의 부족 등을 포함한다.The two schemas described above maintain the TCP definitions of both endpoints, even if they are mapped from different nodes. Replication in this existing schema is done at the application level using proprietary protocols. In addition, this existing schema adopts an asymmetric node structure. The selection node here acts as an application level router that distributes requests. However, this asymmetry limits the versatility as mentioned by Aaron et al. In "Scalable Content Aware Request Distribution in Cluster Based Network Servers". These limitations include single points of failure, bottlenecks in data throughput, lane performance and lack of location transparency due to high latency.

그러므로, m 대 n 접속을 제공하기 위해 TCP의 두 엔드포인트의 현재 정의를 사용하는 방법 및 대칭시스템이 요구된다.(m, n은 임의의 정수로 같을 수도 있고, 같지 않을 수도 있다.)Therefore, a method and a symmetric system are needed that use the current definitions of the two endpoints of TCP to provide m to n connections. (M, n may or may not be equal to any integer.)

위에서 언급된 요구사항 및 다른 요구사항은 TCP의 현재 호스트 대 호스트간 통신 범위를 그룹 대 그룹간 통신으로 확장함으로써, 보다 상세하게는 두 접속 엔드포인트(connection endpoint)에 대한 현재의 정의를 대칭적으로 구성된 노드들로 연결된 두 그룹의 엔드포인트의 정의로 확장함으로써 달성된다. 각각의 엔드포인트는 TCP의 순서화된 전송을 유지하면서 독립적이고 병렬로 수신과 전송할 권리를 갖는다. 데이터는 구성에 따라 전체그룹이나 서브세트(subset)에 전달된다. 필요한 타겟 엔드포인트만이 TCP의 ACK를 피어(peer) 그룹에 전송하기 전에 데이터를 수신하도록 요구된다. The above mentioned and other requirements extend TCP's current host-to-host communication range to group-to-group communication, more specifically the symmetrical definition of the current definition of two connection endpoints. This is accomplished by extending the definition of two groups of endpoints connected by configured nodes. Each endpoint has the right to receive and transmit independently and in parallel while maintaining the ordered transmission of TCP. Data is passed to the entire group or subset, depending on the configuration. Only the required target endpoints are required to receive data before sending an ACK of TCP to the peer group.

일실시예에서, 본 발명은 하나 이상의 IP 어드레스를 가진 단일 가상 엔티티로서 노드의 클러스터를 어드레싱하는 것을 허용한다. 클라이언트 그룹과 서버 그룹간의 통신은 어떠한 표준 TCP/IP 엔드포인트라도 그룹과 결함없이 통신할 수 있어야 한다는 점에서 엄격하게 표준기반이다. 데이터는 대칭적으로 구성된 그룹의 노드에서 종결되는 엔드포인트로 원자적으로 전달된다.In one embodiment, the present invention allows addressing a cluster of nodes as a single virtual entity with one or more IP addresses. Communication between client and server groups is strictly standards-based in that any standard TCP / IP endpoint must be able to communicate seamlessly with the group. Data is atomically passed to endpoints that terminate at nodes in a symmetrically organized group.

엔드포인트에 설치된 필터는 어플리케이션 세그먼트에 무관한 도착한 데이터를 필터링한다. 어플리케이션 세그먼트로의 데이터 전달은 적절하게 구성되어 설치된 필터에 의해 동적으로 제어된다. 추가적으로 필터는 복사본의 개입 없이 들어오는 데이터를 직접 타겟 어플리케이션 메모리에 배치하는 것을 선택적으로 수행한다.A filter installed on the endpoint filters out data that arrives independent of the application segment. Data delivery to application segments is dynamically controlled by appropriately configured and installed filters. In addition, the filter optionally performs the placement of incoming data directly into the target application memory without the intervention of a copy.

접속을 통한 입력과 출력은 노드가 서로 독립적으로 및 병렬로 수신하고 전송함으로써 분리된다. 모든 전송은 TCP 명세에 따라 순차적이고 노드간의 전송제어는 그룹 노드간의 라운드 로빈 방식이나 전송 요청자 간의 라운드 로빈 방식 또는 어플리케이션 특정 스키마에 기반하여 순서화된다. 노드들은 병렬로 재전송하고 이를 수행하는 데 있어 이들 간에 추가적인 동기화를 필요로 하지 않는다.The inputs and outputs over the connection are separated by the nodes receiving and transmitting each other independently and in parallel. All transmissions are sequential according to the TCP specification and transmission control between nodes is ordered based on the round robin method between group nodes, the round robin method between requestors, or application specific schema. Nodes retransmit in parallel and do not need additional synchronization between them to do this.

범용성 및 부하 공유를 위해 어플리케이션 기능은 그룹 노드에 분산된다. 이를 달성하기 위해 어플리케이션은 논리적으로 분할되어 있으며, 각각은 어플리케이션 기능의 서브세트로 동작한다. TCP 접속에 도달한 들어오는 요청은 효과적으로 부하를 분배하는 그룹상에 상주하는 세그먼트에 전달된다. 어플리케이션 인스턴스에 특정한 요청 세트만을 전달함으로써 논리 세그먼트화는 어플리케이션 코드 변경 없이 가능하다. 어플리케이션 세그먼트는 노드에서 노드로 접속 중단없이 동적으로 이동할 수 있다.Application functionality is distributed across group nodes for versatility and load sharing. To accomplish this, the applications are logically partitioned, each operating as a subset of the application's functionality. Incoming requests that arrive at the TCP connection are forwarded to the segment residing on the group that effectively distributes the load. By passing only a specific set of requests to an application instance, logical segmentation is possible without changing the application code. Application segments can move dynamically from node to node without disrupting connectivity.

게다가, 노드는 그룹에 의해 대표되는 가상의 엔티티로 접속을 형성함으로써 그룹의 다른 노드들과 통신할 수 있다. 이는 그룹 노드들 간의 통신을 위한 위의 모든 특징을 제공한다. In addition, a node can communicate with other nodes in the group by forming a connection with a virtual entity represented by the group. This provides all the above features for communication between group nodes.

이 시스템은 노드가 어플리케이션 구동을 실패한 경우 그룹 내에 남아 있는 어플리케이션 레플리카 세트가 접속 및 서비스의 중단 없이 서비스를 계속할 수 있다는 점에서 장애 극복성을 갖는다. 노드는 어플리케이션에 투명한 방식의 서비스의 확실한 품질을 유지하기 위해 동적으로 그룹에 부가되거나 그룹으로부터 제거된다. 노드들 사이의 부하 밸런싱 또는 노드 제거 목적을 위하여 시스템은 투명하게 액티브 서비스를 이동시키고 그룹 내에서 태스크를 재분배한다.This system is failover in that, if a node fails to run an application, the set of application replicas remaining in the group can continue service without interrupting connection and service. Nodes are dynamically added to or removed from a group to maintain a certain quality of service that is transparent to the application. For load balancing or node removal purposes between nodes, the system transparently moves active services and redistributes tasks within groups.

그룹의 노드들에서 동작하는 어플리케이션은 클라이언트/서버 응용 프로그래밍과 자원의 관리를 단순화한 가상의 단일 엔티티로 그룹의 나머지를 보고 동작할 수 있다. 본 발명의 일실시예는 그룹 노드를 통해 독립적으로 동작하는 하나 이상의 세그먼트로 어플리케이션을 분리하는 것을, 코드 변경 요구 없이 어플리케이션에 투명한 방법으로 허용한다. An application running on the nodes of the group can view and operate the rest of the group as a single virtual entity that simplifies client / server application programming and resource management. One embodiment of the present invention allows separating an application into one or more segments that operate independently through a group node, in a transparent manner to the application without requiring code changes.

시스템은 다양한 어플리케이션 세그먼트로 접속을 통해 들어오는 태스크를 동적으로 투명하게 분배함으로써 노드 그룹간의 프로세싱 로드를 공유한다. 접속을 통해 도달한 단일 요청은 결합된 다중 세그먼트 작업에 의해 서비스되어 노드들에 대해 처리나 계산의 적절한 분배가 가능해진다. 이 시스템은 세그먼트의 다중 인스턴스가 병렬로 동작하는 것을 허용한다. 요청은 라운드 로빈, 최소 부하 노드(least loaded node), 친화성 기반(affinity based), 컨텐츠 해싱(content hashing) 등과 같은 스키마에 기반하여 선택된 인스턴스에 전달된다.The system shares the processing load among groups of nodes by dynamically and transparently distributing incoming tasks through connections to various application segments. A single request that arrives over the connection is serviced by a combined multi-segment operation, allowing for the proper distribution of processing or computations across nodes. This system allows multiple instances of a segment to operate in parallel. Requests are delivered to selected instances based on schemas such as round robin, least loaded node, affinity based, content hashing, and so on.

들어오는 요청은 장애 극복을 위해 다중 세그먼트 인스턴스로 원자적으로 전달된다. 결과는 선택적으로 비교되고 하나의 인스턴스가 출력이다. 세그먼트/노드의 실패시 남아있는 세그먼트 인스턴스는 접속 중단없이 서비스를 지속한다.Incoming requests are atomically forwarded to a multi-segment instance for failover. The results are optionally compared and one instance is the output. In the event of a segment / node failure, the remaining segment instances continue to service without interruption.

상기 시스템은 접속 엔드포인트에서 미세하게 필터를 제어하고 구성하는 방식으로 태스크를 분배함으로써 시스템의 유연한 외부적인 관리를 허용한다. 제거될 때, 노드의 로드에 대한 책임은 최하 부하, 라운드 로빈 또는 어플리케이션 특정 스키마와 같은 스키마를 사용하여 선택된 다른 노드로 이동된다. 이 시스템은 자동적이고 동적으로 자원을 변화하는 요구에 맞춰 풀로부터 그룹으로 부가한다. 마찬가지로 노드는 자동적이고 동적으로 제거되고 공급된다. 이 시스템은 자원을 자동적으로 추가하거나 제거하면서 일정 품질의 서비스를 유지한다. The system allows for flexible external management of the system by distributing tasks in a finely controlled and configured filter at the connection endpoint. When removed, responsibility for the load of the node is shifted to another node selected using a schema such as least-load, round-robin, or application-specific schema. The system automatically and dynamically adds resources from pool to group to meet changing needs. Similarly, nodes are automatically and dynamically removed and fed. The system automatically adds or removes resources while maintaining a certain quality of service.

이 명세서에 기술된 특징과 이점들은 모든 특징들을 포함하는 것이 아니고, 특히, 많은 추가적인 특징과 이점들이 도면, 명세서, 청구항의 관점에서 당업자에게 명확할 것이다. 또한 명세서에서 사용된 언어들은 가독성과 교육적 목적을 위해 원칙적으로 선택되었고 발명의 주제를 제한하기 위해 선택된 것은 아니다.The features and advantages described in this specification are not all-inclusive, and in particular, many additional features and advantages will be apparent to those skilled in the art in view of the drawings, specification, and claims. In addition, the languages used in the specification are selected in principle for readability and educational purposes and are not selected to limit the subject matter of the invention.

본 발명의 다른 이점 및 특징은 도면을 참고하여 다음의 발명의 상세한 설명과 부가된 청구항으로부터 좀더 쉽게 명확해질 것이다. Other advantages and features of the present invention will become more readily apparent from the following detailed description of the invention and the appended claims with reference to the drawings.

도 1a 는 본 발명의 일실시예에 따라 구축된 통신시스템의 일반화된 다이어그램이다.1A is a generalized diagram of a communication system constructed in accordance with one embodiment of the present invention.

도 1b는 본 발명의 일실시예에 따른 통신시스템의 블록 다이어그램을 나타낸다.1B illustrates a block diagram of a communication system according to an embodiment of the present invention.

도 1c는 본 발명의 일실시예에 따른 통신시스템의 구현을 위한 상위 레벨 컴포넌트의 구성을 나타낸 블록 다이어그램을 나타낸다.Figure 1c shows a block diagram showing the configuration of a high-level component for the implementation of the communication system according to an embodiment of the present invention.

도 1d는 본 발명의 일실시예에 따른 통신시스템의 최적 성능을 위한 하위 레벨 구성 컴포넌트 구현의 블록 다이어그램을 나타낸다.1D illustrates a block diagram of a low level configuration component implementation for optimal performance of a communication system in accordance with an embodiment of the present invention.

도 2는 본 발명의 일실시예에 따른 통신 시스템의 구현을 위한 상위 레벨 컴포넌트의 하드웨어 구성의 블록 다이어그램을 나타낸다.2 shows a block diagram of a hardware configuration of a high level component for implementation of a communication system according to an embodiment of the present invention.

도 3a는 본 발명의 일실시예에 따른 접속시 입력 데이터의 처리 경로에 대한 순서도를 나타낸다.3A is a flowchart of a processing path of input data when connected according to an embodiment of the present invention.

도 3b는 본 발명의 일실시예에 따른 접속시 입력데이터의 처리 경로에 대한 도 3a의 순서도의 나머지를 나타낸다.FIG. 3B illustrates the remainder of the flowchart of FIG. 3A for a processing path of input data upon connection according to an embodiment of the present invention.

도 3c는 본 발명의 일실시예에 따른 접속시 들어오는 데이터의 필터링에 대한 순서도를 나타낸다.3C is a flowchart illustrating filtering of incoming data upon connection according to an embodiment of the present invention.

도 4는 본 발명의 일실시예에 따른 노드 간의 공평을 위해 일회 당 최대 전송 크기를 제한하면서 접속을 통해 데이터를 전송하는 순서도를 나타낸다.4 is a flowchart illustrating data transmission through a connection while limiting a maximum transmission size per one time for fairness between nodes according to an embodiment of the present invention.

도 5a는 본 발명의 일실시예에 따른 액티브 발송헤드 할당을 위한 요청/그랜트 스키마의 블록 다이어그램을 나타낸다.5A illustrates a block diagram of a request / grant schema for active dispatch head allocation in accordance with an embodiment of the present invention.

도 5b는 본 발명의 일실시예에 따른 발송헤드에 대한 요청을 처리하기 위한 순서도를 나타낸다.5B illustrates a flowchart for processing a request for a dispatch head according to an embodiment of the present invention.

도 6은 본 발명의 일실시예에 따른 피어 그룹 윈도우 광고에 대한 가상 윈도우 스키마에 대해 기술한 블록 다이어그램을 나타낸다. 6 is a block diagram illustrating a virtual window schema for a peer group window advertisement according to an embodiment of the present invention.

도 7a는 본 발명의 일실시예에 따른 통신시스템에 대한 컴퓨팅 시스템의 블록 다이어그램을 나타낸다.7A illustrates a block diagram of a computing system for a communication system in accordance with an embodiment of the present invention.

도 7b는 본 발명의 일실시예에 따른 메인 프로세서의 부하의 경감을 제공하는 통신시스템에 대한 컴퓨팅 시스템의 불록 다이어그램을 나타낸다.7B illustrates a block diagram of a computing system for a communication system that provides lightening of the load of the main processor in accordance with one embodiment of the present invention.

도 7c는 본 발명의 일실시예에 따른 전용 하드웨어/가속기 칩에 대한 메인 프로세서의 부하의 경감을 제공하는 통신시스템에 대한 컴퓨팅 시스템의 블록 다이어그램을 나타낸다.7C illustrates a block diagram of a computing system for a communication system that provides lightening of the load of the main processor on dedicated hardware / accelerator chips in accordance with one embodiment of the present invention.

도 8은 본 발명의 일실시예에 따른 통신 시스템의 다른 일반화된 다이어그램 을 나타낸다.8 shows another generalized diagram of a communication system according to an embodiment of the present invention.

도 9는 본 발명의 일실시예에 따른 클라이언트 그룹과 서버 그룹간의 데이터 전달과 확인 응답 스키마를 나타낸다.9 illustrates a data transfer and acknowledgment scheme between a client group and a server group according to an embodiment of the present invention.

도 10은 본 발명의 일실시예에 따른 구현의 논리적 관점을 나타낸다. 10 illustrates a logical view of an implementation in accordance with one embodiment of the present invention.

도 11은 본 발명의 일실시예에 따른 장애 극복, 부하 배분, 부하 공유 및 단일 시스템 이미지를 갖는 대칭 다중 컴퓨터 시스템의 일반화된 다이어그램을 나타낸다.Figure 11 illustrates a generalized diagram of a symmetric multicomputer system with failover, load distribution, load sharing, and a single system image in accordance with one embodiment of the present invention.

본 발명은, 예를 들어, TCP/IP를 확장하는 원자적 다 지점 전달과 다 지점 전송으로 두 세트의 노드들 간의 신뢰성 있고 순서화된 데이터 통신이 가능한 시스템을 포함한다. 본 발명은 TCP의 신뢰성 있는 호스트 대 호스트 통신의 개념을 그룹간의 데이터 트래픽에 대한 TCP의 사양을 유지하는 대칭적인 그룹대 그룹 통신을 포함하도록 확장한다. 더 나아가서, 본 발명은 TCP 접속의 두 엔드포인트의 정의를 단일 접속을 통해 통신하는 적어도 두 그룹의 엔드포인트를 포함하는데까지 확장한다.The present invention includes, for example, a system capable of reliable and ordered data communication between two sets of nodes with atomic multipoint forwarding and multipoint forwarding extending TCP / IP. The present invention extends TCP's concept of reliable host-to-host communication to include symmetric group-to-group communication, which maintains TCP's specification for data traffic between groups. Furthermore, the present invention extends the definition of two endpoints of a TCP connection to include at least two groups of endpoints communicating over a single connection.

본 발명에서는, 접속의 엔드 포인트는 그룹 노드에서 종결된다. 다중 노드에 데이터가 전달되어야 할 때, 전달은 원자적으로 이루어진다. 선택적으로, 다중 노드로부터 발생한 데이터에 관하여 단일 데이터 인스턴스가 전송된다. 후술되는 바와 같이, 각 엔드포인트는 독립적으로 동작하는 수신헤드(receiveHead)와 발송헤드(sendeHead)로 이루어져 있다.In the present invention, the endpoint of the connection is terminated at the group node. When data needs to be delivered to multiple nodes, delivery is done atomically. Optionally, a single data instance is sent for data originating from multiple nodes. As will be described later, each endpoint consists of a receivehead and a sendhead that operate independently.

본 발명의 일실시예에서, 노드는 네트워크상의 접속, 예를 들어, 범용 컴퓨터 등의 데이터 처리 장치나 마이크로프로세서 또는 네트워크 관련 통신을 위한 데이터 처리 장치의 기능을 갖는 소프트웨어를 구비한 다른 장치를 포함한다. 그룹은 하나 이상의 대칭적으로 구성된 노드들의 집합을 의미한다. 어플리케이션 세그먼트는 다양한 그룹 노드들상에서 동작하는 다른 어플리케이션 세그먼트와 함께 제공되는 어플리케이션이나 어플리케이션의 세그먼트를 의미한다. 어플리케이션은 하나 이상의 어플리케이션 세그먼트로 구성되고 어플리케이션 세그먼트는 하나 이상의 프로세스로 구성된다.In one embodiment of the invention, a node comprises a data processing device such as a general purpose computer or the like, or another device with software having the function of a microprocessor or data processing device for network related communication. . A group refers to a set of one or more symmetrically configured nodes. An application segment refers to an application or a segment of an application provided with other application segments operating on various group nodes. An application consists of one or more application segments and an application segment consists of one or more processes.

발송헤드는 TCP 접속의 전송 엔드를 나타내는데, 이는 데이터 전송을 제어하고 노드에서 전송 상태를 유지한다. 수신헤드는 TCP 접속의 수신 엔드를 나타내는데 이는 접속상태에서 데이터 수신을 제어하고 노드에서 데이터 수신 상태를 유지한다. 액티브 발송헤드는, 예를 들어, 데이터 시퀀스 번호와 최근 확인응답(acknowledgement)의 시퀀스 번호 등의 가장 최신의 전송상태 정보를 갖도록 지정된 발송헤드를 나타낸다.The sending head represents the sending end of a TCP connection, which controls the data transmission and maintains the transmission state at the node. The receiving head represents the receiving end of the TCP connection, which controls the data reception in the connected state and maintains the data receiving state at the node. The active dispatch head represents a dispatch head designated to have the most recent transmission status information, such as, for example, a data sequence number and a sequence number of a recent acknowledgment.

버스 콘트롤러는 접속 설정과 피어 그룹과의 종결 프로세스를 제어하고/하거나 조정하는 노드를 나타낸다. 신호는 논리 버스를 통해 노드 그룹 안에서 교환되는 메시지를 나타낸다. 신호의 소스와 타겟이 같은 노드 안에 있으면, 비록 내부적으로 신호를 수신하는 효과와 마찬가지더라도, 신호는 발송되지 않는다. 접속의 엔드포인트는 사전에 동의된 시퀀스 번호의 쌍에 기반한 순차적인 순서로 피어(peer)와 데이터를 교환하는 TCP 같은 스택을 나타낸다. 접속의 엔드포인트는 출력 데이 터 스트림 시작점과 입력데이터 스트림 종결점을 갖는다. 요청(request)은 들어오는 데이터 스트림, 예를 들면, 클라이언트의 서비스 요청의 선택 세그먼트를 나타낸다. The bus controller represents a node that controls and / or coordinates the connection establishment and termination process with the peer group. Signals represent messages exchanged within a group of nodes via a logical bus. If the source and the target of the signal are in the same node, the signal is not sent, even if it is the effect of receiving the signal internally. The endpoint of the connection represents a stack, such as TCP, that exchanges data with peers in sequential order based on previously agreed pairs of sequence numbers. The endpoint of the connection has an output data stream start point and an input data stream end point. A request represents an optional segment of an incoming data stream, eg, a service request from a client.

시스템 개요System overview

도 1a는 본 발명의 일실시예에 따른 통신시스템을 도시한다. 통신시스템은 제 1 그룹(120)과 제 2 그룹(160)을 연결하는 TCP 접속(130)을 포함한다. 예를 들면, 제 1 그룹(120)은 제 1, 제 2 및 제 3 멤버 노드(100a, 100b, 100c)를 가지고 있으며, 제 2 그룹(160)은 제 1 및 제 2 멤버 노드(150x, 150y)를 가지고 있다. 이 두 그룹의 멤버 노드는 각 노드가 TCP 접속에 대해 동일한 액세스를 가지고 독립적이고 병렬로 동작하는 점에서 대칭적으로 구성된다. 제 1 데이터 스트림(110)과 제 2 데이터 스트림(111)은 통신시스템의 제 1 그룹(120)과 제 2 그룹(160) 간을 흐를 수 있다.1A illustrates a communication system according to an embodiment of the present invention. The communication system includes a TCP connection 130 connecting the first group 120 and the second group 160. For example, first group 120 has first, second and third member nodes 100a, 100b, 100c, and second group 160 has first and second member nodes 150x, 150y. Has) Member nodes in these two groups are configured symmetrically in that each node has the same access to the TCP connection and operates independently and in parallel. The first data stream 110 and the second data stream 111 may flow between the first group 120 and the second group 160 of the communication system.

제 1 어플리캐이션 세그먼트(135)와 제 2 어플리케이션 세그먼트 (136)는 120에서 서버 어플리케이션을 구축한다. 제 1 어플리케이션 세그먼트(135)는 레플리카 세트(135x 및 135y)를 가지고, 제 2 어플리케이션 세그먼트(136) 역시 레플리카 세트(136x 및 136y)를 갖는다. 어플리케이션 세그먼트 레플리카(135x 및 135y)는 노드(100a, 100b)를 통해 각각 가동되고 레플리카(136y 및 136x)는 각각 노드(100b, 100c)를 통해 가동된다. 그룹(160)의 클라이언트 어플리케이션은 레플리카 (151a 및 151b)를 갖는 어플리케이션 세그먼트(151)로 구성된다.The first application segment 135 and the second application segment 136 build a server application at 120. The first application segment 135 has replica sets 135x and 135y, and the second application segment 136 also has replica sets 136x and 136y. Application segment replicas 135x and 135y are run through nodes 100a and 100b, respectively, and replicas 136y and 136x are run through nodes 100b and 100c, respectively. The client application of group 160 is composed of application segments 151 with replicas 151a and 151b.

제 1 그룹(120)의 어플리케이션 세그먼트(135 및 136)는 제 2 그룹(160)의 세그먼트(151)와 접속(130)을 통해 통신한다. 접속(130)의 두 개의 데이터 스트림( 110 및 111)은 TCP 프로토콜을 따른다. 접속(130)은 제 1 그룹(120)에서 세 개의 다른 접속 엔드포인트(130a, 130b, 130c) 및 동일한 접속에 대해 그룹(160)에서 두 개의 다른 접속 엔드포인트(130x, 130y)를 갖는다. The application segments 135 and 136 of the first group 120 communicate with the segment 151 of the second group 160 via the connection 130. The two data streams 110 and 111 of the connection 130 follow the TCP protocol. Connection 130 has three different connection endpoints 130a, 130b, 130c in first group 120 and two different connection endpoints 130x, 130y in group 160 for the same connection.

각 그룹(120, 160)은 각각의 그룹 인터넷 프로토콜("IP") 어드레스(121, 161)가 할당된다. 그룹은 노드들로 구성되는 동안 서로를 단일 엔티티로 본다. 두 그룹(120, 160) 간의 통신은 그룹 IP 어드레스(121, 161)를 통해 각각 어드레싱된다. 세그먼트(151)로부터의 요청이 제 1 그룹(120)에 도달하면, 이는 그룹 IP 어드레스(161)로부터 온 데이터처럼 보인다. 마찬가지로, 제 2 그룹(160)은 그룹 어드레스(121)를 타겟으로 하여 데이터를 보낸다.Each group 120, 160 is assigned a respective group Internet Protocol ("IP") address 121, 161. A group sees each other as a single entity while it is made up of nodes. Communication between the two groups 120, 160 is addressed via group IP addresses 121, 161, respectively. When a request from segment 151 arrives at first group 120, it looks like data from group IP address 161. Similarly, the second group 160 sends data targeting the group address 121.

제 1 그룹(120)의 엔드포인트(130a, 130b, 130c)는 하나 이상의 어플리케이션 세그먼트 레플리카(135a, 135b, 136a, 136b)가 입력되는 요청과 함께 전달되도록 설정될 수 있다. 어플리케이션 세그먼트에 데이터가 전달되는 다른 방침의 예는 모든 레플리카, 하나의 레플리카, 모든 어플리케이션 세그먼트와 선택 어플리케이션 세그먼트, 요청을 특정 노드에 맵핑하기 위해 해싱 스키마에 기반하여, 라운드 로빈 요청 분배에 기반하여, 요청 컨텐츠에 기반하여 결정된 타겟 및 가중치 기반 우선 순위 등이다. 이러한 하나의 스키마 변경("쓰기(write)") 요청은 어플리케이션 세그먼트의 모든 레플리카에 전달되고 "판독(read)" 요청은 오직 하나의 선택된 레플리카에 전달된다.The endpoints 130a, 130b, 130c of the first group 120 may be set such that one or more application segment replicas 135a, 135b, 136a, 136b are delivered with an incoming request. Examples of other policies where data is delivered to an application segment include all replicas, one replica, all application segments and selected application segments, and round robin request distribution based on a hashing schema to map requests to specific nodes. Target and weight based priority determined based on the content. This one schema change ("write") request is sent to all replicas of the application segment and the "read" request is sent to only one selected replica.

제 2 그룹(160)에서 엔드포인트(130x 또는 130y)의 각각은 서버 그룹(120)에 요청을 보낼 수도 있다. 제 1 그룹(120)의 엔드포인트(130a, 130b, 130c)에서 하나 이상의 수신헤드는 세팅에 의존하는 데이터를 수신한다. 제 1 그룹(120)의 엔드포인트(130a, 130b, 130c)는 제 2 그룹(160)의 엔드포인트(130x, 130y)가 수신한 응답 데이터를 전송할 수도 있다. 특정한 또는 모든 들어오는 데이터의 수신을 원하는 어플리케이션 프로세스들은 데이터의 수신으로 클라이언트에 확인응답하기 전에 이를 수신했음을 보증 받는다. 데이터 전송의 TCP의 순차적 순서를 유지하기 위하여, TCP 시퀀스 번호는 데이터 전송이 시작하기 전에 연속적인 순서로 할당된다.Each of the endpoints 130x or 130y in the second group 160 may send a request to the server group 120. At the endpoints 130a, 130b, 130c of the first group 120, one or more receiving heads receive data depending on the setting. The endpoints 130a, 130b, 130c of the first group 120 may transmit response data received by the endpoints 130x, 130y of the second group 160. Application processes that wish to receive certain or all incoming data are guaranteed to have received it before acknowledging the client with the receipt of the data. In order to maintain the TCP's sequential order of data transmission, TCP sequence numbers are assigned in a sequential order before the data transmission begins.

선택적으로, 제 2 그룹(160)의 레플리카(151a, 151b)에 의한 복제된 데이터 출력은 통신시스템에 의해 제 1 그룹(120)에 전달될 단일 인스턴스로 감소된다. 마찬가지로, 제 1 그룹(120)의 어플리케이션 세그먼트(135, 136)의 레플리카 출력도 또한 하나로 감소할 수 있다. 135a, 135b, 136a, 136b의 레플리카가 반드시 출력을 생성할 필요는 없는데, 이는 많은 경우에 있어 세팅에 따라 요청이 단지 하나의 레플리카에만 전달되기 때문이다. Optionally, the replicated data output by replicas 151a, 151b of second group 160 is reduced to a single instance to be delivered to first group 120 by the communication system. Similarly, the replica output of the application segments 135, 136 of the first group 120 may also be reduced to one. The replicas of 135a, 135b, 136a, and 136b do not necessarily produce an output, because in many cases the request is only sent to one replica, depending on the setting.

본 발명에 따른 통신 시스템은 유리한 원자적 클라이언트/서버 요청과 응답을 제공한다. 즉, 이들은 단일 접속을 통해 두 그룹간의 데이터를 주고 받는 다중 프로세스를 가능하게 하여 바이트의 연속적인 시퀀스로서 발송 또는 수신된다.The communication system according to the present invention provides advantageous atomic client / server requests and responses. That is, they enable multiple processes to exchange data between two groups over a single connection, sending or receiving as a continuous sequence of bytes.

그룹(120 및 160) 간의 프로토콜은 TCP이고 데이터는 기존의 TCP에 따라 발송되었던 일련의 순서로 전달되는 것이 보증된다. 다중 엔드포인트를 타겟으로 하는 경우, 클라이언트가 데이터의 수령을 나타내는 TCP ACK 세그먼트를 전송하기 전 에 데이터가 모든 타겟 엔드포인트에 전달되는 것이 보증된다. 선택적으로, 레플리카 출력이 단일 복사 출력의 전송으로 감소되어야만 할 때, 모든 노드가 같은 데이터를 출력한다면 데이터가 전송된다는 점에서 출력은 원자적임이 보증된다. 그러나 결과가 부합되지 않는다면, 어플리케이션은 선택적으로 다수의 동의나 정정 또는 성공적인 결과 등에 따라 전송하기 위해 출력을 선택할 수 있다. The protocol between groups 120 and 160 is TCP and data is guaranteed to be delivered in a sequence of orders that had been sent in accordance with existing TCP. When targeting multiple endpoints, data is guaranteed to be delivered to all target endpoints before the client sends a TCP ACK segment indicating receipt of the data. Optionally, when the replica output should be reduced to the transmission of a single copy output, the output is guaranteed to be atomic in that the data is transmitted if all nodes output the same data. However, if the results do not match, the application can optionally select the output for transmission based on multiple agreements, corrections, or successful results.

어플리케이션 세그멘트화로, 어플리케이션 프로세스는 프로세싱을 위해 들어오는 데이터 스트림의 단지 선택부와 함께 전형적으로 전달된다. 예를 들면, 제 2 데이터 스트림(111)에 도달한 요청은 어플리케이션 세그먼트를 선택하기 위하여 전달될 수 있다. 어플리케이션 프로세스로의 데이터의 전달 순서는 RFC 793에 의해 지정된 대로 발송되는 순서가 되도록 보증되어야 한다. 즉, 특정 데이터가 어플리케이션 세그먼트에 전달되기 전에 스트림으로 도착된 모든 진행 데이터는 성공적으로 타겟의 엔드포인트에 전달되어야만 한다.With application segmentation, application processes are typically delivered with only a selection of incoming data streams for processing. For example, a request that arrives at the second data stream 111 may be forwarded to select an application segment. The order of delivery of data to the application process shall be guaranteed to be the order in which it is sent as specified by RFC 793. In other words, all progress data arriving in the stream must be successfully delivered to the target's endpoint before specific data can be delivered to the application segment.

도 1b를 보면, 제 1 그룹(120)은 제 1, 제 2, 제 3 노드(100a, 100b, 100c)로 구성된다. 제 1 그룹(120)과 제 2 그룹(160) 간의 접속(130)은 나가고 들어오는 데이터 스트림(110, 111)을 가지고 있다. 각 노드(100a-c)는 각각 그룹 대 그룹 통신 스택(130a-c)을 갖는다. 모든 노드들로의 데이터 전달은 각각의 노드(100a-c)와 결합된 스위치(141a-c)를 통해 간다. 기반 하드웨어에 의해 스위치(141a-c)에 대한 전달 보증과 관련된 어떠한 가정도 만들어지지 않았다. 왜냐하면, 이더넷과 같은 인기있는 하드웨어 기술은 신뢰할 수 없기 때문이다. 각 노드(100a-c) 또는 그 서브세트로의 데이터 전달이 선택적이거나 전혀 전달이 없는 것은 기본적인 하드웨어 장치에 의해 가능하다.1B, the first group 120 includes first, second, and third nodes 100a, 100b, and 100c. The connection 130 between the first group 120 and the second group 160 has outgoing and incoming data streams 110, 111. Each node 100a-c has a group to group communication stack 130a-c, respectively. Data transfer to all nodes goes through switches 141a-c associated with each node 100a-c. No assumptions have been made by the underlying hardware regarding the guarantee of delivery for switches 141a-c. This is because popular hardware technologies like Ethernet are unreliable. It is possible by basic hardware devices that the data transfer to each node 100a-c or a subset thereof is optional or there is no transfer at all.

들어오는 데이터는 IP 어드레스 및/또는 포트에 기반하여, 스위치(141a-c)에 의해 규격 TCP/IP 스택(140a-c)이나 그룹 대 그룹 통신 스택(130a-c)에 스위칭된다. 노드(100)의 어플리케이션 프로세스(142)는 표준 TCP 스택(140)을 사용하여 통신한다. 어플리케이션 세그먼트(135x, 135y 136x, 136y)는 각각 그룹 통신 스택 (130a-c)과 통신하다. 버스(105)는 그룹(131)의 동작을 제어하고 조정하는 제어 신호를 캐리한다. 콘트롤 버스(105)를 통해 전달된 신호의 범위는 제 1 그룹(120)에 제한된다. 가상 버스(143)는 제 1 및 제 2 데이터 스트림(110, 111) 및 그룹(120)과 연결되는 버스(105)의 제어 신호로 구성된다. 이 버스는 피어 그룹 TCP 접속(130)에 의해 직접 탭핑되어 있다.Incoming data is switched to standard TCP / IP stack 140a-c or group-to-group communication stack 130a-c by switches 141a-c based on IP addresses and / or ports. The application process 142 of the node 100 communicates using the standard TCP stack 140. Application segments 135x, 135y 136x, 136y communicate with group communication stack 130a-c, respectively. Bus 105 carries a control signal that controls and coordinates the operation of group 131. The range of signals transmitted via the control bus 105 is limited to the first group 120. The virtual bus 143 is composed of control signals of the bus 105 connected to the first and second data streams 110 and 111 and the group 120. This bus is tapped directly by peer group TCP connection 130.

가상 버스(143)의 대안은 노드들 간의 점 대 점 통신이고 이는 나은 대역폭 사용의 이점을 갖는다. 그러나, 이것은 통신시스템 내의 각 노드가 다른 노드와 그들의 어드레스 및 역할을 계속 주시할 것을 필요로 한다. 일실시예에서 논리 버스 모델은 위치와 아이덴티티 투명성에 기인하여 콘트롤 메시징 보다 바람직하다.An alternative to virtual bus 143 is point-to-point communication between nodes, which has the advantage of better bandwidth usage. However, this requires that each node in the communication system keep an eye on the other nodes and their addresses and roles. In one embodiment, the logical bus model is preferred over control messaging due to location and identity transparency.

도 1c는 본 발명의 일실시예에 따른 엔드포인트(130a)의 접속을 나타낸다. 일반적으로 스위치(141)은 데이터를 표준 TCP 스택 또는 그룹 대 그룹 통신 스택 인터넷 프로토콜("IP") 입력(171)으로 향하게 한다. 분할된 IP 패킷에 대해 170은 171로 전달하기 전에 재결합을 수행한다. 입력 패킷이 분할되지 않는 경우에는 간단한 기본적인 일관성 체크(consistency check) 후에 입력 컨텐츠 필터(171)로 직접 전달된다. 입력 컨텐츠 필터(171)는 어플리케이션 세그먼트(예를 들어, 135x, 135y, 또는 136x)에 전달될 데이터를 포함하고 있는지를 결정하기 위하여 입력 데이터 컨텐츠 및 또는 패킷 해더를 조사한다.1C illustrates a connection of endpoint 130a in accordance with one embodiment of the present invention. In general, switch 141 directs data to a standard TCP stack or group to group communication stack Internet Protocol (“IP”) input 171. For split IP packets, 170 performs recombination before forwarding to 171. If the input packet is not split, it is passed directly to the input content filter 171 after a simple basic consistency check. The input content filter 171 examines the input data content and / or packet header to determine if it contains data to be delivered to the application segment (eg, 135x, 135y, or 136x).

만일 통신 시스템이 패킷을 더이상 통과시키지 않기로 결정한다면, 더이상의 추가 동작 없이 패킷은 폐기되고 메모리가 비워진다. 그렇지 않다면, 입력 컨텐츠 필터(171)은 어플리케이션으로 통과한 패킷의 세그먼트를 표시한다. 패킷은 체크섬(Checksum) 계산과 다른 일관성 체크를 포함하는 유효성 검사를 완료하기 위해 IP 입력 프로세싱 계층(172)으로 전달된다. 어떠한 잘못된 패킷들은 더이상 처리되지 않고 폐기된다. 결과 패킷은 그룹대 그룹 TCP 계층(173)으로 전달된다. 그룹 대 그룹 TCP 계층(173)은 그룹 노드들(예를 들어, 120, 160)과 기능하고, 피어 그룹에 대한 확인응답 등의 TCP의 명세 요구에 부합하기 위해 데이터 수령을 제어한다. 그룹 대 그룹 TCP 계층(173)은 접속의 입력 TCP 상태를 유지하고 데이터를 데이터 경로(137)를 통해 소켓으로 전달한다. 데이터 경로(138)는 소켓 인터페이스에서 스택으로 가는 전송 데이터 경로를 나타낸다.If the communication system decides not to pass the packet anymore, the packet is discarded and memory is freed without further action. Otherwise, input content filter 171 marks the segment of the packet that passed to the application. The packet is passed to IP input processing layer 172 to complete validation including checksum calculation and other consistency checks. Any malformed packets are no longer processed and discarded. The resulting packet is forwarded to group to group TCP layer 173. The group-to-group TCP layer 173 functions with group nodes (eg, 120, 160) and controls the receipt of data to meet TCP's specification requirements, such as acknowledgments for peer groups. The group to group TCP layer 173 maintains the input TCP state of the connection and passes data through the data path 137 to the socket. Data path 138 represents the transport data path from the socket interface to the stack.

사용자 소켓은 출력 컨텐츠 필터(174)를 활성화하는 데이터를 내보낸다. 일실시예에서 출력 컨턴츠 필터(174)는 설치되지 않으며, 따라서 아무 동작도 수행하지 않는다. 장애 극복을 위한 필터는 동기적으로 다른 레플리카 세그먼트의 출력과 전송될 데이터를 비교하고 단일 출력 인스턴스를 전송한다. 피어그룹에 전송된 출력 인스턴스의 선택은 동일 출력, 다수 동의, 정정 결과 또는 성공적인 동작 출력과 같은 필터 내의 방침 세트에 의존한다. 전송 세그먼트 인스턴스의 실패시, 레플리카는 접속 중단없이 전송을 받아 지속한다. 피어 그룹에서 성공적인 출력 인스턴 스 수신시, 레플리카는 데이터를 폐기하고 메모리를 비운다. 출력 컨텐츠 필터(174)는 전송을 위해 데이터를 그룹 TCP 출력 계층(175)으로 전달한다. 그룹 TCP 출력 계층(175)은 데이터 전송을 제어하고 그룹 노드들과의 전송 상태를 유지한다. 그룹 TCP 출력 계층(175)은 TCP에 의해 정해진 순차적인 순서대로 피어 그룹에 데이터를 전송하기 위하여 자신의 그룹 노드와 함께 작업한다. 그룹 TCP 출력 계층(175)은 IP 출력 계층(176)에 데이터를 전송하기 위해 전달한다. IP 출력 계층(176)은 나중에 데이터에 대해 표준 IP 기능을 수행하고 데이터 전송을 위해 장치 드라이버(177)로 데이터를 전달한다.The user socket exports data activating the output content filter 174. In one embodiment, the output content filter 174 is not installed and therefore performs no operation. Failover filters synchronously compare the data to be transmitted with the outputs of other replica segments and send a single output instance. The choice of output instance sent to the peer group depends on the policy set in the filter, such as the same output, multiple agreements, correction results or successful operation outputs. In the event of a transport segment instance failure, the replica receives and continues to transmit without interrupting the connection. Upon receiving a successful output instance from the peer group, the replica discards the data and frees the memory. Output content filter 174 forwards the data to group TCP output layer 175 for transmission. The group TCP output layer 175 controls the data transfer and maintains the transfer state with the group nodes. The group TCP output layer 175 works with its group node to send data to the peer group in sequential order as defined by TCP. Group TCP output layer 175 forwards to transmit data to IP output layer 176. IP output layer 176 later performs standard IP functions on the data and passes the data to device driver 177 for data transmission.

출력 컨텐츠 필터(174)에 의한 출력 비교 결과가 노드들에 의해 생성된 출력이 다르다는 것을 나타내면, 서브세트 레플리카는 결함이 있다고 간주되어 접속을 통해 더 이상 서비스되지 않고 남겨진 엔드포인트 들은 접속의 중단 없이 서비스를 지속한다. 엔드포인트를 제외한 실시예에 있어, 이러한 제외는 엔드포인트의 다수가 다른 것을 제외한다는 결과에 대해 동의하는 스키마에 기초한다. 다른 방법으로 엔드포인트의 제외는 동작이 실패된 곳에서 발생할 수 있다. 엔드포인트의 배제는 필터로 프로그램할 수 있는 임의의 어플리케이션 특정 스키마일 수 있다. 데이터 전송의 엔드포인트의 실패시, 레플리카 엔드포인트는 접속 중단 없이 전송을 완료한다.If the result of the output comparison by the output content filter 174 indicates that the output produced by the nodes is different, the subset replica is considered defective and is no longer serviced over the connection, leaving endpoints serviced without interruption of the connection. To last. In embodiments excluding endpoints, this exclusion is based on a schema that agrees with the results that many of the endpoints exclude others. Alternatively, the exclusion of the endpoint can occur where the operation failed. Exclusion of an endpoint can be any application specific schema that can be programmed with a filter. In the event of an endpoint failure of a data transfer, the replica endpoint completes the transfer without interrupting the connection.

도 1d는 본 발명의 일실시예에 따른 컨텐츠 프로세스가 입출력 데이터를 검사하는 그룹 대 그룹 통신 스택의 접속 엔드포인트(130)을 나타낸다. 컨텐츠 필터링은 컨텐츠 프로세서의 기능이다. 컨텐츠 프로세서는 어플리케이션 메모리 데이터 가 반드시 존재해야 하는 장소, 데이터의 순서, 풀 요청 수신됨 등을 어플리케이션에 통지하는 시간 등을 결정한다. 네트워크 인터페이스 장치 드라이버(177)와 협력 작업할 때에, 데이터는 네트워크 인터페이스(193)과 어플리케이션 메모리(190) 간에 다이렉트 메모리 액세스 콘트롤러(196)에 의해 복사된다.1D illustrates a connection endpoint 130 of a group-to-group communication stack in which a content process examines input and output data in accordance with one embodiment of the present invention. Content filtering is a function of the content processor. The content processor determines a location where application memory data must exist, an order of data, a time for notifying an application of receiving a pull request, and the like. When working with the network interface device driver 177, data is copied by the direct memory access controller 196 between the network interface 193 and the application memory 190.

들어오는 새로운 요청 데이터를 검사하면서, 컨텐츠 프로세서는 메모리(192)를 어플리케이션 공간에 할당한다. 할당 사이즈는 어플리케이션에 특정한, 전형적으로 피어로부터 완전한 요청 데이터의 사이즈이다. 요청의 잔여 데이터는 완전한 요청을 위한 메모리가 할당되었다면 할당을 요구하지 않는다. 출력 데이터(193)은 어플리케이션 자체에 의해 할당된다. 게다가, 요청/응답 데이터(194, 195)의 세그먼트의 복사본도 있다. 이 스키마 어플리케이션 데이터는 관련된 메모리 복사본을 매개하지 않고 네트워크 인터페이스 입출력 버퍼와 어플리케이션 메모리 사이에서 직접 복사된다.While inspecting the incoming new request data, the content processor allocates memory 192 to the application space. The allocation size is the size of the complete request data from the peer, typically application specific. The remaining data in the request does not require allocation if memory for the complete request has been allocated. Output data 193 is allocated by the application itself. In addition, there is a copy of the segment of request / response data 194, 195. This schema application data is copied directly between the network interface I / O buffer and the application memory without mediating the associated memory copy.

도 2를 보면, 제 1 그룹(120)은 노드들(100: 100a, 100b, 100c)로 구성된 서버의 세트이다. 제 2 그룹(160)은 클라이언트 노드들(150: 150x, 150y)의 세트로 구성될 수 있다. 각 그룹(120, 160) 내의 노드들(100, 150)은 접속 디바이스(180)를 통해 서로 연결되어 있다. 접속 디바이스(180)는 브로드캐스트/멀티캐스트 장치로 구성될 수 있다. 예를 들면, 이더넷 버스 또는 계층 2/3 스위치이다. 네트워크(189)는 기존의 네트워크일 수 있다. 예를 들면, 근거리 통신망(LAN) 또는 인터넷일 수 있고 이를 통해 두 노드 그룹이 연결되어 있다. 네트워크(189)는 양 피어 그룹이 접속 디바이스(180)을 통해 직접 연결되면 필요 없어진다.2, the first group 120 is a set of servers comprised of nodes 100 (100a, 100b, 100c). The second group 160 may consist of a set of client nodes 150 (150x, 150y). The nodes 100, 150 in each group 120, 160 are connected to each other via a connection device 180. The access device 180 may be configured as a broadcast / multicast apparatus. For example, an Ethernet bus or a layer 2/3 switch. The network 189 may be an existing network. For example, it may be a local area network (LAN) or the Internet through which two groups of nodes are connected. The network 189 is not needed if both peer groups are connected directly through the connecting device 180.

일실시예에서, 도 2의 통신 시스템은 서버 노드(100a,b,c)에서 하나 이상의 네트워크 인터페이스 포트(185a, 185b, 185c)를 포함한다. 통신 링크(187a, 187b, 187c와 188a, 188b, 188c)는 노드(100)와 디바이스(180)를 연결한다. 접속 엔드포인트(130)를 통해 도달한 입력 데이터는 계층 2 나 계층 3 멀티캐스트 또는 브로드캐스트 능력을 사용하여 접속 디바이스(180)에 의해 (188a, 188b, 188c)로 복제된다. 도달하는 데이터는 포트(185a 185b, 185c)로 전달된다. 180 또는 관련된 하드웨어 포트 또는 링크에 의한 데이터 전달은 보증이 없다. 187a 187b, 187c를 통해 제 1 그룹(120)에 의해 전달된 데이터는 서로 독립적이고 따라서, 병렬로 동작한다. 187a, 187b, 187c를 통해 피어 그룹에 전달된 데이터는 120에 가시적인 필요가 없다. 접속(130)을 통해 데이터가 들어옴에 따라 논리 버스(105)를 통해 전달된 신호는 디바이스(180)에 의해 링크(188a, 188b, 188c)에 복제된다. 도 1b의 논리 버스(105)에 전달된 데이터는 서버 노드(100a, 100b, 100c)에 가시적이다.In one embodiment, the communication system of FIG. 2 includes one or more network interface ports 185a, 185b, 185c at server nodes 100a, b, c. The communication links 187a, 187b, 187c and 188a, 188b, 188c connect the node 100 and the device 180. Input data reached through connection endpoint 130 is replicated to connection 188a, 188b, 188c by layer 2 or layer 3 multicast or broadcast capability. Reaching data is forwarded to ports 185a 185b and 185c. Data transfer by 180 or related hardware ports or links is not guaranteed. The data delivered by the first group 120 via 187a 187b, 187c are independent of each other and thus operate in parallel. The data delivered to the peer group via 187a, 187b, 187c need not be visible at 120. As data enters through connection 130, the signal transmitted over logic bus 105 is replicated to link 188a, 188b, 188c by device 180. The data delivered to the logical bus 105 of FIG. 1B is visible to the server nodes 100a, 100b, 100c.

신호signal

본 발명의 일실시예에서 신호는 그룹에 공통인 접속 식별 정보를 가질 수 있다. 더욱이, 신호는 소스와 타겟 식별자를 또한 가질 수 있다. 타겟 식별자는 하나 이상의 노드나 전체 그룹이 될 수 있다.In one embodiment of the present invention, the signal may have access identification information common to the group. Moreover, the signal may also have a source and target identifier. The target identifier may be one or more nodes or an entire group.

본 발명의 일실시예에서, 통신 시스템 내의 신호는 IACK 신호를 포함한다. 이것은 그룹 대신 피어로부터 온 입력 데이터에 확인 응답하는 입력 확인 응답 신호이다. IACK는 확인 응답된 시퀀스 번호, 피어그룹으로부터 기대되는 데이터의 잔 여 바이트, 윈도우 업데이트 시퀀스 번호, 최근의 가장 양호한 타임 스탬프 및 수신한 액티브 발송헤드가 피어 그룹 TCP ACK를 전달해야 하는지를 지시하는 PUSH 플래그를 포함한다. REQSH 신호는 요청을 포함하고 액티브 발송헤드로 지정된 최근의 발송헤드 할당을 요구할 수 있다. 타겟 어드레스는 전체 그룹일 수 있다.In one embodiment of the invention, the signal in the communication system comprises an IACK signal. This is an input acknowledgment signal that acknowledges input data from peers instead of groups. The IACK contains a PUSH flag indicating whether the acknowledged sequence number, the remaining bytes of data expected from the peer group, the window update sequence number, the latest best time stamp, and whether the received active sendhead should forward the peer group TCP ACK. Include. The REQSH signal may require a recent dispatchhead assignment that contains the request and is designated as the active dispatchhead. The target address may be an entire group.

GRANTSH 신호는 액티브 발송헤드 상태 정보, 버스 타임, REQSH가 확인 응답 된 노드 리스트, 및 공지의 가장 최근의 IACK 정보를 갖는 메시지를 포함한다. 이 신호의 타겟은 상태 정보를 업데이트한 후 액티브 발송헤드를 가정하는 것이다. IACKSEG 신호는 세그먼트 대신 발송된 입력 데이터 확인 응답을 포함한다. 그것은 IACK 신호와 같거나 비슷한 정보를 가질 수 있다. REQJOIN 신호는 접속을 통해 서비스를 결합하자고 요청하면서 어플리케이션 세그먼트로 전달된다. LEAVE 신호는 접속된 어플리케이션 세그먼트의 서비스 중단을 허용해 줄 것을 요청하면서 전달된다.The GRANTSH signal contains a message with active dispatchhead status information, bus time, a list of nodes for which REQSH has been acknowledged, and the most recent IACK information of the announcement. The target of this signal is to assume the active dispatch head after updating the status information. The IACKSEG signal contains an input data acknowledgment sent on behalf of the segment. It may have the same or similar information as the IACK signal. The REQJOIN signal is passed to the application segment, requesting to join the service over the connection. The LEAVE signal is sent requesting that service interruption of the connected application segment be allowed.

ACKLEAVE 신호는 접속된 어플리케이션의 서비스 중단 허용을 그랜트한다. RESET 신호는 접속을 리셋하기를 요청하는 경우에 전달된다. CLOSE 신호는 어플리케이션 세그먼트에 의한 접속의 출력 스트림의 종료를 요청할 때 전달된다. ACKCLOSE 신호는 CLOSE 요청의 수신에 확인 응답한다.The ACKLEAVE signal grants the service interruption permission of the connected application. The RESET signal is sent when requesting to reset a connection. The CLOSE signal is delivered when requesting the termination of an output stream of a connection by an application segment. The ACKCLOSE signal acknowledges receipt of a CLOSE request.

접속 구축과 종료Connection establishment and termination

기존의 TCP 상태 다이어그램은 알려져 있다. 이러한 상태 다이어그램을 설명하는 순서도는 Richard Steven의 "TCP/IP Illustrated Volume I and Volume II"의 제목의 책에 나타나 있으며 기술되어 있다. 이 내용은 참조로 여기에 통합되어 있다. 부가적으로 TCP/IP 프로토콜과 사양은 RFC 793과 RFC 1323에서 논의되었고 관련된 부분은 참조로 여기에 통합되어 있다.Existing TCP state diagrams are known. A flow chart describing this state diagram is shown and described in a book entitled Richard Steven's "TCP / IP Illustrated Volume I and Volume II." This content is incorporated herein by reference. In addition, the TCP / IP protocols and specifications are discussed in RFC 793 and RFC 1323, the relevant parts of which are incorporated herein by reference.

접속 구축과 종료 동안 노드는 버스 콘트롤러로 작동하기 위해서 선택된다. 버스 콘트롤러는 프로세스를 조절하고 제어하고 그룹 대신 피어 그룹과 통신한다. 디폴트에 의해 정적인 버스 콘트롤러가 선택되고, 반면에 어플리케이션 프로그램은 선택적으로 필요에 따라 버스 콘트롤러를 선택한다. 그룹 멤버 노드로의 버스를 제어하는데 기인한 부하를 분산시키기 위해 버스 콘트롤러 기능이 라운드 로빈 방식으로 노드에 지정될 수 있다. 다른 방안으로 버스 콘트롤러는 들어오는 시퀀스 번호나 소스 IP 어드레스/포트 어드레스의 조합에 대한 해시값에 기반하여 동적으로 선택될 수 있다. 가장 낮은 ID를 갖는 세그먼트가 버스 콘트롤러 역할로 가정되는 스키마는, 세그먼트의 레플리카가 가용일 때 버스 콘트롤러의 책임은 레플리카 중에 라운드 로빈 방식에 할당된다. Nodes are selected to act as bus controllers during connection establishment and termination. The bus controller coordinates and controls the process and communicates with peer groups instead of groups. By default a static bus controller is selected, while the application program optionally selects a bus controller as needed. Bus controller functions can be assigned to nodes in a round robin fashion to distribute the load resulting from controlling the bus to group member nodes. Alternatively, the bus controller can be dynamically selected based on a hash value for an incoming sequence number or a combination of source IP address / port address. In the schema where the segment with the lowest ID is assumed to be the bus controller role, the bus controller's responsibility is assigned to the round robin method during the replica when the replica of the segment is available.

일반적으로 TCP에는 4가지 형태의 접속 동작이 있다. 각 형태는 상이한 상태 천이 세트를 따른다. 그룹이 피어 그룹과 접속을 시작하면, 이는 능동적 개시라 한다. 반면에 피어 그룹에 의해 시작된 접속 프로세스는 수동적 개시라 한다. 유사하게 그룹에 의해 접속 종료가 시작되면, 이는 능동적 종료라 하고, 피어 그룹에 의해 종료가 시작되면 이를 수동적 종료라 한다.In general, there are four types of connection operations in TCP. Each form follows a different set of state transitions. When a group initiates a connection with a peer group, it is called active initiation. On the other hand, the connection process initiated by the peer group is called passive initiation. Similarly, when the termination of a connection is initiated by a group, it is called active termination, and when the termination is initiated by a peer group, it is called passive termination.

수동적 접속 구축Passive connection establishment

수동적 개시로 피어그룹으로부터 동기화("SYN") 요청이 도달할 때, 버스 콘트롤러는 접속 서비스와 결합하기 위해 어플리케이션 세그먼트를 요청하는 REQJOIN 신호를 내보낸다. 버스 콘트롤러는 받은 SYN 신호에 대해 ACK를 가진 SYN요청으로 피어 그룹에 응답한다. 피어 그룹이 그룹 대신 전송한 SYN 요청에 확인 응답하면, 어플리케이션 세그먼트를 구동하는 그룹 노드는 IACKSEG로 응답한다. 모든 요구된 그룹 노드들이 IACKSEG 신호로 접속 서비스를 결합하면, 접속은 구축된 것으로 간주되고 데이터 전송이 시작될 수 있다.When a synchronization ("SYN") request arrives from the peer group with passive initiation, the bus controller issues a REQJOIN signal requesting an application segment to join the connection service. The bus controller responds to the peer group with a SYN request with an ACK for the received SYN signal. If the peer group acknowledges the SYN request sent on behalf of the group, the group node driving the application segment responds with IACKSEG. If all required group nodes combine the connection service with the IACKSEG signal, the connection is considered established and data transmission can begin.

능동적 접속 구축Active connection establishment

능동적 개시에 있어, 그룹으로부터 시작된 접속의 경우, 버스 콘트롤러는 그룹 노드들을 지정하여 REQJOIN 신호를 내보낸다. 다음으로, 그룹 대신 SYN 요청을 전송함으로써 피어 그룹과 접속 프로세스를 시작한다. 버스 콘트롤러 SYN에 대한 ACK를 갖는 피어 그룹으로부터의 SYN 요청의 수령시, 그룹 노드는 피어그룹으로부터 유효한 ACK의 수령을 나타내는 IACKSEG를 보낸다. 요구된 노드로부터 IACKSEQ를 받았을 때, 버스 콘트롤러는 피어 그룹으로부터의 SYN 요청에 대한 ACK를 내보내고 접속은 구축된 것으로 간주된다.In active initiation, for connections initiated from a group, the bus controller specifies group nodes and sends a REQJOIN signal. Next, initiate the connection process with the peer group by sending a SYN request instead of the group. Upon receipt of a SYN request from a peer group with an ACK for the bus controller SYN, the group node sends an IACKSEG indicating receipt of a valid ACK from the peer group. When receiving an IACKSEQ from the requested node, the bus controller issues an ACK for the SYN request from the peer group and the connection is considered established.

수동적 접속 종료Passive connection termination

수동적인 종료에 있어, 피어 그룹으로부터의 FIN 세그먼트의 수령시, 노드는 FIN의 수령을 나타내는 IACKSEG 신호를 보낸다. 모든 요구된 세그먼트로부터 IACKSEG가 수신되면, 버스 콘트롤러는 수신(완료)된 FIN에 대한 ACK 로 피어 그룹에 응답한다. 노드가 데이터 전송을 마치면, 접속을 중지하기 바람을 나타내는 LEAVE 신호를 보낸다. FIN 수령 후에 모든 그룹 노드로부터 LEAVE 요청 신호가 도달하면, 버스 콘트롤러는 피어 그룹으로 FIN 요청을 내보낸다. 버스 콘트롤러는 ACKLEAVE 신호를 내보내고 이를 수신시 신호 노드의 타겟은 CLOSED 상태로 진입한다. 전송된 FIN 요청에 대한 ACK가 도달되면, 버스 콘트롤러는 CLOSED 상태로 진입한다.In passive termination, upon receipt of a FIN segment from a peer group, the node sends an IACKSEG signal indicating receipt of the FIN. When an IACKSEG is received from all requested segments, the bus controller responds to the peer group with an ACK for the received (completed) FIN. When the node has finished sending data, it sends a LEAVE signal indicating that it wants to stop the connection. If the LEAVE request signal arrives from all group nodes after FIN receipt, the bus controller issues a FIN request to the peer group. The bus controller emits an ACKLEAVE signal and upon receiving it the target of the signal node enters the CLOSED state. When the ACK for the transmitted FIN request is reached, the bus controller enters the CLOSED state.

능동적 접속 종료Active connection termination

능동적 종료에 있어, 어플리케이션 세그먼트가 데이터 전송을 종료하고 접속을 닫고자 할 때, 이들은 CLOSE 신호를 보낸다. 모든 그룹 노드로부터 CLOSE 요청을 수령하면, 버스 콘트롤러는 피어그룹에 FIN 요청을 내보낸다. 피어그룹으로부터 FIN 요청을 받으면, 노드는 LEAVE 요청을 내보낸다. 그룹 노드들로부터의 LEAVE 신호와 전송된 FIN에 대한 ACK가 수신되면, 버스 콘트롤러는 TIME_WAIT 상태로 진입한다. In active termination, when the application segment wants to terminate the data transfer and close the connection, they send a CLOSE signal. Upon receiving a CLOSE request from all group nodes, the bus controller issues a FIN request to the peer group. When receiving a FIN request from a peer group, the node issues a LEAVE request. When the LEAVE signal from the group nodes and the ACK for the transmitted FIN are received, the bus controller enters the TIME_WAIT state.

접속을 통한 데이터 입력Data entry via connection

도 3a에서와 같이, 데이터 패킷이 노드에 도달하면, 패킷이 그룹 어드레스로 타겟지어질 것인지가 체크 되어 진다(311). 만일 그렇고 패킷이 TCP 프래그먼트라면, TCP 세그먼트의 마지막 프래그먼트의 도달시 완전한 TCP 세그먼트를 만들기 위 한 프래그먼트 재결합 동작이 수행된다(314). 대부분의 경우, TCP 세그먼트는 분할되지 않는다. 따라서 이와 같은 동작은 일어나지 않는다.As shown in Fig. 3A, when the data packet reaches the node, it is checked whether the packet is targeted to the group address (311). If the packet is a TCP fragment, then a fragment recombination operation is performed (314) to make a complete TCP segment upon arrival of the last fragment of the TCP segment. In most cases, TCP segments are not split. Therefore, this operation does not occur.

TCP 세그먼트가 그룹에 타겟지워지지 않는다면, 표준 TCP/IP 스택이 이후의프로세싱을 위해 TCP 세그먼트로 인계된다(312).If the TCP segment is not targeted to the group, the standard TCP / IP stack is handed over to the TCP segment for further processing (312).

도 3b에서와 같이 데이터 수령시 그룹의 수신헤드는 필터링을 수행하고(315) 노드상의 어플리케이션 세그먼트에 타겟지워지는 데이터가 존재하는지 검증한다(316). 그리고 어플리케이션에 무관한 데이터라면 폐기한다. 필터링 후 결과 데이터 패킷은 타임 스탬프 유효성, 체크섬 유효성, 다른 TCP/IP 파라미터의 유효성에 대해 체크된다(317). 모든 유효하지 않은 패킷은 곧바로 폐기된다. 수신헤드는 검사한 모든 유효한 패킷을 반영하는 상태로 업데이트한다. 필터 처리 후의 체크섬 및 다른 자세한 패킷의 검증을 수행함으로써, 폐기된 패킷의 계산상의 오버헤드를 피할 수 있다.As shown in FIG. 3B, upon receipt of data, the receiving head of the group performs filtering (315) and verifies whether there is data targeted to the application segment on the node (316). And discard any data that is not relevant to the application. After filtering, the resulting data packet is checked for time stamp validity, checksum validity, and validity of other TCP / IP parameters (317). All invalid packets are immediately discarded. The receivehead updates to reflect all valid packets examined. By performing checksum after filter processing and verification of other detailed packets, the computational overhead of discarded packets can be avoided.

다음으로 수신된 데이터에 앞선 모든 데이터가 적절한 어플리케이션 세그먼트에 전달되었는지가 검증된다(318). 데이터는 이어 즉시 어플리케이션 세그먼트로 전달된다(320). 명세에 따라 필요하다면 TCP ACK 세그먼트는 피어그룹으로 전달된다. 그러나 만약 확인응답 계류 중에 있는 앞선 데이터가 존재한다면, 데이터는 확인응답을 기다리면서 큐잉된다(319).Next, it is verified (318) that all data preceding the received data has been delivered to the appropriate application segment. The data is then immediately passed to the application segment (320). If required by the specification, the TCP ACK segment is forwarded to the peer group. However, if there is preceding data in acknowledgment pending, the data is queued waiting for an acknowledgment (319).

만일 세그먼트 인스턴스가 데이터의 수령 중에 실패한다면, 남아있는 인스턴스는 수신과 확인응답 제어를 지속한다. 이는 어플리케이션이 노드의 장애시 중단없이 서비스를 지속하는 것을 가능하게 한다.If a segment instance fails during receipt of data, the remaining instance continues to receive and acknowledge control. This allows the application to continue service without interruption in the event of a node failure.

데이터 필터화Filter data

도 3c를 보면 수신헤드는 타겟 등을 판단하기 위해 후속의 데이터가 더 필요한지, 새로운 요청의 시작, 어플리케이션으로 전달되었는지, 요청이 무시되었는지 등 입력 데이터의 상태를 유지한다. 패킷은 하나 이상의 요청이나 부분적인 요청 데이터를 포함할 수 있으므로, 패킷이 처리될 잔여 데이터를 갖고 있는지 검증된다(330). 만일 남은 데이터가 하나도 없다면 필터화 프로세스는 완료된다.Referring to FIG. 3C, the receiving head maintains the state of the input data such as whether subsequent data is needed to determine the target, the start of a new request, whether it is delivered to the application, or the request is ignored. Since the packet may contain one or more requests or partial request data, it is verified whether the packet has residual data to be processed (330). If no data is left, the filtering process is complete.

필터링될 패킷에 남아 있는 데이터가 있을 때 현재 상태는 검증된다(331). 만일 현재 상태가 요청 데이터가 폐기되어야 한다고 나타내면, 패킷내의 요청 데이터의 최대한까지 폐기된 것처럼 스케쥴링되고(332) 잔여 데이터가 더 있는지를 검증한다(330). 비슷하게 요청 데이터가 수용되고 어플리케이션 세그먼트로 전달되어야 한다면, 패킷의 요청 데이터의 남은 부분은 전달을 위해 스케쥴링된다. 모든 전달된 데이터는 반드시 한번은 체크-섬 계산되어야 하고, 타임스탬프 및 패킷 헤더 검증되어야 하고(333) 유효하지 않은 패킷은 곧바로 폐기된다(336).The current state is verified when there is data remaining in the packet to be filtered (331). If the current state indicates that the request data should be discarded, it is scheduled as if discarded to the maximum of the request data in the packet (332) and verifies if there is more residual data (330). Similarly, if request data is accepted and must be delivered to the application segment, the remainder of the request data in the packet is scheduled for delivery. All delivered data must be check-sum calculated once, timestamp and packet header verified (333) and invalid packets are immediately discarded (336).

현재 상태가 새로운 요청의 시작을 나타낼 때 어플리케이션 특정 필터는 데이터 타겟을 결정하도록 유발되고(334) 현재 상태는 검증결과를 반영할 수 있게 업데이트된다. 충분한 데이터의 부족으로 필터화 코드가 요청 타겟을 결정할 수 없을 경우 순서와 상관없이 도착한 데이터를 홀딩하는 재결합 큐(queue)로부터 곧바로 따라오는 데이터와 결합된다. 여전히 충분한 데이터가 없다면 남은 데이터는 재결합 큐로 들어가서 충분한 데이터가 도착할 시기를 반복 체크 한다. 충분한 데이터 가 발견된다면 단계(330)가 데이터를 필터링하기 위해 반복된다.When the current state indicates the start of a new request, an application specific filter is triggered to determine the data target (334) and the current state is updated to reflect the verification result. If the lack of sufficient data prevents the filtering code from determining the request target, it is combined with data immediately following from a recombination queue that holds the data arriving in any order. If there is still not enough data, the remaining data enters the recombination queue and repeatedly checks when enough data arrives. If enough data is found, step 330 is repeated to filter the data.

데이터 입력 확인응답Data entry acknowledgment

다중 엔드포인트로의 원자적 데이터 전달이 요구된다면, 모든 엔드포인트들이 긍정적으로 데이터를 받을 때만 수신된 데이터에 대한 확인응답이 보내진다. 데이터를 받은 타겟 엔드포인트는 TCP 순서로의 데이터 수령을 나타내는 IACK 신호를 버스를 통해 보낸다. 모든 요구된 노드가 특정한 데이터를 받았는지를 검증한 후 TCP 명세에 맞다면 액티브 발송헤드는 TCP ACK 세그먼트를 피어그룹에 보낸다.If atomic data delivery to multiple endpoints is required, an acknowledgment of the received data is sent only when all endpoints receive data positively. The target endpoint receiving the data sends an IACK signal over the bus indicating receipt of data in TCP order. After verifying that all requested nodes have received specific data, if the TCP specification is met, the active sendhead sends a TCP ACK segment to the peer group.

접속을 통한 데이터 출력Data output through connection

그룹의 다중 엔드포인트들이 TCP 순서로 데이터를 전송할 수 있다. 따라서 전송될 데이터의 세그먼트로 연속적인 시퀀스 번호를 할당하는 것이 필요하다. 엔드포인트로부터 뚜렷한 요청/응답이 섞이는 것을 피하기 위해 전송된 데이터의 일관성을 유지하는 것도 필요하다. 이를 위해 각각의 완전한 요청/응답 데이터는 전송 노드에 의해 레코드로서 취급된다.Multiple endpoints in a group can transmit data in TCP order. Therefore, it is necessary to assign consecutive sequence numbers to segments of data to be transmitted. It is also necessary to maintain consistency of the transmitted data to avoid mixing distinct requests / responses from the endpoint. For this purpose each complete request / response data is treated as a record by the transmitting node.

도 4를 보면 어플리케이션 프로세스가 데이터를 쓸 때(385) 새로운 전송 레코드가 만들어진다(386). 피어에 다른 어떤 매개 데이터도 보내지 않으면서 하나 이상의 쓰기 요청 데이터가 보내져야 할 때 MORETOCOME 플래그가 마지막 쓰기의 도착때까지 세팅된다. 전송 노드가 액티브 발송헤드(387)가 아닐 경우, 앞선 요청이 확인응답되거나 계류중이 아닐 때 액티브 발송헤드 요청이 REQSH 신호와 함께 보내 진다(388). 노드를 향한 GRANTSH 신호와 함께 액티브 발송헤드 상태가 도착하면(389) 액티브 발송헤드는 GRANTSH로부터 최후의 정보를 업데이트한 후에 가정되고 액티브 발송헤드인지인지를 반복 체크한다(387).4, when the application process writes data (385), a new transport record is created (386). When more than one write request data needs to be sent without sending any other intermediary data to the peer, the MORETOCOME flag is set until the arrival of the last write. If the transmitting node is not the active sendhead 387, the active sendhead request is sent with the REQSH signal when the preceding request is not acknowledged or pending (388). When the active dispatching head state arrives with the GRANTSH signal directed to the node (389), the active dispatching head repeatedly checks whether it is assumed to be the active dispatching head after updating the last information from GRANTSH (387).

액티브 발송헤드가 된 후 보낼 데이터를 가진 노드는 새로운 전송 시퀀스 번호를 일련의 순서에 따라 레코드에 할당하며 전송이 시작된다(390). 만일 전송된 쓰기 동작의 연속으로 더 이상의 데이터가 기대되지 않고, 전송 시퀀스 번호를 이용해 할당을 대기하는 레코드가 더 이상 없거나(392) 최대 전송 한계치를 초과했을 경우(393), 액티브 발송헤드는 발송헤드를 기다리는 다음 요청 노드로 그랜트된다.After becoming the active dispatching head, the node with the data to send allocates a new transmission sequence number to the record in a sequence of steps and transmission begins (390). If no more data is expected in the continuation of the transmitted write operation, and there are no more records waiting to be allocated using the transfer sequence number (392) or the maximum transfer limit has been exceeded (393), the active sendhead is the sendhead. The request is then sent to the next request node.

발송헤드가 그랜트될 다음 노드는 REQSH 요청자, 최우선권을 가진 발송헤드 요청자, 라운드 로빈이나 어플리케이션 특정 스키마의 리스트로부터 시계방향 방식으로 수적으로 근접한 노드-id로 노드를 선택함으로써 결정된다. 그러나, 더 많은 전송 레코드가 시퀀스 번호의 할당을 기다린다면 단계(387)은 남은 데이터를 보내기 위해 반복될 것이다.The next node to which the sending head is to be granted is determined by selecting the node by numerically close node-id in a clockwise manner from the REQSH requestor, the highest priority sending head requester, a round robin or a list of application specific schemas. However, if more transport records wait for the assignment of sequence numbers, step 387 will be repeated to send the remaining data.

액티브active 발송헤드Shipping head 할당 Assignment

도 5a에서는 액티브 발송헤드 할당을 위한 스키마가 설명되어 있다. 노드(100a)는 액티브 발송헤드 역할을 요구하는 REQSH(105r) 신호를 내보내고 액티브 발송헤드(100c)는 필요한 상태 정보를 가진 GRANTSH(105t) 신호로 역할을 요청자에게 보낸다. REQSH 신호는 노드(100a)에 의해 발송된다. 노드(100b)는 발송헤드에 반응하지 않고 REQSH를 무시한다. 요청이 있을 때 액티브 발송헤드인 노드(100c)는 그랜트될 발송헤드가 비어 있으므로 GRANTSH 신호로 100a 요청에 응답한다. 5A illustrates a schema for active dispatch head assignment. Node 100a issues a REQSH 105r signal requesting the active dispatch head role and active dispatch head 100c sends the role to the requestor with a GRANTSH 105t signal with the necessary status information. The REQSH signal is sent by node 100a. Node 100b does not respond to the dispatch head and ignores REQSH. When there is a request, the node 100c, which is the active sending head, responds to the 100a request with a GRANTSH signal because the sending head to be granted is empty.

GRANTSH 신호의 수령 때 요청 노드(100a)는 액티브 발송헤드를 가정한다. GRANTSH 신호는 그룹에 의해 유지되면서 동작중인 요청자의 리스트를 포함한다. GRANTSH 신호(105t)검사시, 노드(100b)는 신호내에서 동작중인 요청자의 검사 리스트를 검증하여 액티브 발송헤드에 대한 자신의 요청이 확인 응답되었는지를 체크한다. 확인 응답되었을 때 REQSH 신호의 재전송은 턴오프된다.Upon receipt of the GRANTSH signal the requesting node 100a assumes an active dispatch head. The GRANTSH signal contains a list of active requestors maintained by the group. Upon checking the GRANTSH signal 105t, node 100b verifies the checklist of the requestor operating in the signal to check if its request for an active dispatch head has been acknowledged. When acknowledged, the retransmission of the REQSH signal is turned off.

노드가 액티브 발송헤드를 다른 노드에 그랜트할 때, 전송될 미결 데이터를 갖고 있다면, 추가적인 요청 신호를 피하기 위해 자신을 요청자의 목록에 추가한다. 모든 노드로 REQSH 같은 신호를 보내는 또 다른 방법은 액티브 발송헤드 노드와 같은 타겟으로 직접 보내는 것이다. 이러한 접근법의 이점은 대역폭 사용의 원활함이지만 위치 투명성이 부족하다.When a node grants an active sendhead to another node, if it has outstanding data to be sent, it adds itself to the requester's list to avoid further request signals. Another way to send a REQSH-like signal to all nodes is to send them directly to the same target as the active sendhead node. The advantage of this approach is smooth bandwidth usage but lacks location transparency.

도 5b를 보면 REQSH 신호가 액티브 발송헤드 노드에 도착할 때(551), 발송헤드를 그랜트할 수 없다면(553), 요청자 id가 요청자 리스트로 입력된다(554). 그러나 GRANTSH 신호는 발송헤드가 그랜트될 수 있다면 요청자를 향해 보내진다(555). GRANTSH는 미결의 요청자가 들어 있는 리스트로 모든 미결의 REQSH에 대해 입력확인응답 신호처럼 동작한다. 다른 노드에 그랜팅 하지 않고 REQSH 수령을 확인응답하기 위해 발송헤드는 스스로 그랜트한다. GRANTSH가 타겟 노드에 도착하면 요청자의 리스트는 요청자의 로컬 리스트에 추가된다. GRANT 신호는 재전송을 제외한 발송헤드의 각 인스턴스에 대한 매우 단순하게 증가하는 시퀀스 식별 번호를 할당받아 순서화된다.Referring to FIG. 5B, when the REQSH signal arrives at the active sendhead node (551), if the sendhead cannot be granted (553), the requestor id is entered into the requester list (554). However, the GRANTSH signal is sent to the requestor if the sending head can be granted (555). GRANTSH is a list of pending requesters and acts as an input acknowledgment signal for all outstanding REQSHs. The sending head grants itself to acknowledge the receipt of a REQSH without granting to another node. When GRANTSH arrives at the target node, the requestor's list is added to the requestor's local list. The GRANT signal is ordered by assigning a very simple incrementing sequence identification number for each instance of the sendhead except for retransmission.

그룹 내에서의 Within a group TCPTCP 타임 스탬프와 라운드 Round with time stamp 트립Trip 타임( time( TimeTime stampstamp andand roundround -- triptrip -time: -time: RTTRTT ) 계산) Calculation

대부분의 TCP 구현은 RFC 1323을 따른다. 이는 타임 스탬프를 이용해 라운드 트립 타임을 측정하는 방법을 상술한다. 라운드 트립 타임은 전형적으로 호스트의 리얼타임으로부터 확인응답된 데이터 서버의 타임 스탬프를 빼서 측정된다. 시퀀스 번호로 래핑되어 유효하지 않은 패킷을 확인하기 위해서 명세는 단순하게 증가하는 타임스탬프를 요구한다.Most TCP implementations follow RFC 1323. This details the method of measuring the round trip time using the time stamp. Round trip time is typically measured by subtracting the time stamp of the acknowledged data server from the host's real time. In order to identify packets that are not valid because they are wrapped in sequence numbers, the specification simply requires increasing timestamps.

다른 하드웨어 타이머를 가진 다양한 형태의 여러 노드에 대한 명세에 맞추는 것이 필요하다. 이상적인 솔루션은 노드가 완벽하게 동기화된 시간을 갖도록 하는 것이나 이는 매우 어렵다. 일실시예에서 데이터를 보내는 노드에서의 시간과 마지막으로 데이터를 전송한 노드의 시간을 동기화 함으로써 단조적으로 증가하는 타임스탬프의 명세요구에 맞춰진다. 이러한 동기화는 데이터가 늘 이전의 TCP 데이터 세그먼트 타임스탬프보다 같거나 높은 타임스탬프 값으로 보내짐을 보증한다.It is necessary to match the specification for several nodes of various types with different hardware timers. The ideal solution would be to have the nodes have perfectly synchronized time, but this is very difficult. In one embodiment, the time stamp is synchronized with a timestamp that increases monotonically by synchronizing the time of the node that sent the data with the time of the node that last sent the data. This synchronization ensures that data is always sent with a timestamp value that is equal to or higher than the previous TCP data segment timestamp.

스키마의 구현이 여기서 이뤄진다. 리얼타임 (종종 로컬타임)을 유지하는 노드는 고정된 간격으로 그 값을 증가시키는 하드웨어 클럭을 따라 구현된다. 버스타임이라 불리는 그룹 범위의 리얼 타임 클럭은 노드에 의해 접속된 각 TCP 접속에 대해 구현되어야 한다. 어떤 노드에서의 버스타임은 아래와 같이 계산된다.The implementation of the schema is done here. Nodes holding real time (often local time) are implemented along with a hardware clock that increments its value at fixed intervals. A group-wide real time clock, called bustime, must be implemented for each TCP connection accessed by the node. The bus time at any node is calculated as

버스타임 = 로컬타임 - 베이스타임Bus Time = Local Time-Base Time

베이스타임은 초기에 버스 콘트롤러에 의해 선택되고 계산된 임의의 값일 수 있다. 노드에 액티브 발송헤드가 그랜트될 때마다 수여자의 버스타임은 GRANTSH 신호와 함께 보내진다. GRANTSH 신호를 수신시, 액티브 발송헤드가 그랜트된 노드는 버스타임을 다음과 같이 조절한다.The base time can be any value initially selected and calculated by the bus controller. Each time an active sendhead is granted to a node, the recipient's bustime is sent with a GRANTSH signal. Upon receiving the GRANTSH signal, the node to which the active sendhead is granted adjusts the bus time as follows.

버스타임이 수여자의 발송헤드와 함께 받은 버스타임보다 작다면 다음과 같다:If the bus time is less than the bus time received with the recipient's dispatch head:

버스타임 = 수여자 버스타임(즉, GRANTSH 신호로부터의 버스타임)Bustime = the recipient bustime (ie the bustime from the GRANTSH signal)

노드의 버스타임이 GRANTSH 전달 딜레이 때문에 위의 공식과 완벽하게 맞지는 않더라도 단순하게 증가하는 타임스탬프 값의 요구에 부합한다. 버스타임의 입자성(graninlarity)을 보내진 타임스탬프 입자성보다 높게 선택하여, 노드에 의한 동시 재전송 동안 타임스탬프 충돌 때문에 발생하는 오류를 줄일 수 있다. 예를 들어 타임스탬프가 10 밀리 초의 입자성을 갖고 버스타임이 1 마이크로 초의 입자성을 가질 때, 에러계수는 1부터 1/10000 로 감소하게 된다. 정밀한 라운드 트립 계산을 위해, 전송시의 베이스타임은 발송헤드에 의해 전송 레코드로 들어간다. 신호의 최소 레이턴시를 계산하기 위해 고정된 시간 값이 GRANTSH 타겟 노드에서 수여자 버스타임에 더해진다. 타임스탬프로서 버스타임을 사용하여, 패킷의 라운드 트립 타임은 다음과 같이 계산된다:Although the node's bus time does not fit perfectly with the above formula because of the GRANTSH propagation delay, it satisfies the need for a simply increasing timestamp value. By selecting the granularity of the bus times higher than the sent timestamp granularity, errors caused by timestamp collisions during simultaneous retransmission by the node can be reduced. For example, when the timestamp is 10 millisecond granularity and the bus time is 1 microsecond granularity, the error coefficient decreases from 1 to 1/10000. For precise round trip calculations, the base time at the time of transmission enters the transmission record by the sending head. To calculate the minimum latency of the signal, a fixed time value is added to the grant bustime at the GRANTSH target node. Using bus time as the timestamp, the round trip time of the packet is calculated as follows:

라운드 트립 타임 = 버스타임 - 타임스탬프 Round Trip Time = Bus Time-Timestamp

그룹 내에서의 Within a group TCPTCP 윈도우window 업데이트update

윈도우는 엔드포인트가 메모리 안으로 데이터를 수용할 수 있는 데이터량이 다. 정보를 갖기 위해 단지 두 엔드포인트만을 가진 기존의 TCP는 최적의 성능이 쉽게 수행되도록 하는데 동의했다. 많은 엔드포인트가 포함되면서, 각각이 다른 메모리 사이즈를 갖고 예측 불가능한 데이터 타겟이 최적의 사용과 성능을 이루는 것이 중요하다.Windows is the amount of data the endpoint can accept into memory. Existing TCP, with only two endpoints to have information, agreed that optimal performance could be easily performed. As many endpoints are involved, it is important that each of the different memory sizes and unpredictable data targets achieve optimal use and performance.

여기에서는 그룹 범위의 단일 가상 윈도우 사이즈가 그룹 내에서 효과적인 윈도우 운영을 위해 사용되는 스키마를 설명한다. 발송헤드는 그룹 대신 윈도우 정보로 피어 그룹을 업데이트할 책임이 있다. 그룹 노드에 초기에 가상의 윈도우 사이즈가 할당된다. 노드는 일단 전달된 어플리케이션에 의해 읽어 들인 데이터의 입력 시퀀스 번호를 전송하여 발송헤드로 윈도우를 업데이트한다. 액티브 발송헤드는 피어 그룹을 그룹 범위의 가상 윈도우 사이즈로부터 어플리케이션으로 전달될 미결의 데이터 양을 빼서 획득된 윈도우로 업데이트한다.This section describes the schema in which a single virtual window size in a group scope is used for effective window operations within a group. The sending head is responsible for updating the peer group with window information instead of the group. The virtual node size is initially assigned to the group node. The node updates the window with the dispatch head by sending an input sequence number of data once read by the application. The active dispatchhead updates the peer group with the window obtained by subtracting the amount of outstanding data to be delivered to the application from the virtual window size of the group scope.

윈도우 업데이트는 윈도우 업데이트 신호의 수를 줄이기 위해 IACK 신호와 함께 피기배킹된다. 윈도우 업데이트 신호 및 TCP 세그먼트의 수를 더 줄이기 위해, 지정된 윈도우 사이즈가 가상의 윈도우에 덧붙여져서 유지된다. 이러한 윈도우의 총계에 달하는 데이터는 언제든지 그룹에 의해서 확인 응답된 미결 데이터일 수 있다. 노드가 지정된 윈도우보다 적거나 같은 사이즈의 데이터의 수령을 확인 응답하는 IACK를 내보낼 때와 많은 진행중인 데이터가 어플리케이션에 의해 읽혀질 때, 다수의 데이터가 어플리케이션에 의해 읽혀지듯이 IACK 시퀀스와 같은 업데이트된 윈도우가 사용된다. 윈도우 업데이트는 IACK와 함께 이뤄지기 때문에, 그렇지 않으면 요구되는 추가적인 윈도우 업데이트 신호를 피한다. 이러한 기술은 선택적으로 세팅 또는 리셋팅된다.The window update is piggybacked with the IACK signal to reduce the number of window update signals. To further reduce the number of window update signals and TCP segments, the specified window size is maintained in addition to the virtual window. The data reaching the total of these windows may be pending data acknowledged and acknowledged by the group at any time. When a node emits an IACK acknowledging receipt of data of less or equal size than the specified window, and when a lot of ongoing data is read by the application, an updated window, such as an IACK sequence, is used, as many data are read by the application. do. Since the window update is done with the IACK, it avoids the additional window update signal that is otherwise required. This technique is optionally set or reset.

도 6을 보면 확인 응답되지 않은 입력 시퀀스가 610으로 표시되고, 피어 그룹으로 광고될 것으로 예상되는 최대 데이터 시퀀스 번호는 620으로 표시된다. 619는 윈도우 업데이트가 보내질 수 있는 정도의 최대로 지정된 윈도우 시퀀스를 나타낸다. 611, 612, 613, 614는 노드(100a, 100b, 100c)에 의해 받은 데이터의 윈도우 시퀀스를 보여준다. 615는 피어 그룹에 의해 보내질 수 있는 데이터양이다. 617과 618은 노드(100a와 100c)가 보낸 윈도우 업데이트를 보여주는데, 611과 612과 관련되어서 IACK와 함께 보내진다. 최대 광고된 윈도우는 621로 나타나 있으며 최대 지정 윈도우는 622로 보여진다.Referring to FIG. 6, an unacknowledged input sequence is indicated by 610, and the maximum data sequence number expected to be advertised to a peer group is indicated by 620. 619 indicates the maximum specified window sequence to which a window update can be sent. 611, 612, 613, 614 show the window sequence of data received by nodes 100a, 100b, 100c. 615 is the amount of data that can be sent by the peer group. 617 and 618 show window updates sent by nodes 100a and 100c, which are sent with an IACK in association with 611 and 612. The maximum advertised window is shown as 621 and the maximum specified window is shown as 622.

그룹 group TCPTCP 를 이용한 랩 Lab 어라운드Around (( wrappedwrapped -- aroundaround ) ) 시퀀스에In sequence 대한 보호 For protection

10기가비트의 이더넷같은 고속도의 네트워크에서 TCP의 현재의 32비트 시퀀스 번호는 짧은 시간동안 랩 어라운드한다. IACK와 같이 랩어라운드 시퀀스 번호로 지연된 신호는 시퀀스 번호가 데이터 입력 확인에 사용될 때 실수로 유효한 것으로 간주 될 수 있다. 우리는 32비트 시퀀스 번호가 접속 초기부터 랩 어라운드되는 시간을 고려하는 64비트 TCP 값에 32비트 TCP 시퀀스 번호가 맵핑되는 스키마를 사용한다. 그룹 안에서 사용되는 64비트 시퀀스 값은 피어와 사용되는 32비트로 다시 맵핑된다.In high-speed networks such as 10 Gigabit Ethernet, TCP's current 32-bit sequence number wraps around for a short time. A signal delayed with a wraparound sequence number, such as an IACK, can be considered to be valid by mistake when the sequence number is used to confirm the data input. We use a scheme that maps a 32-bit TCP sequence number to a 64-bit TCP value that takes into account the time at which the 32-bit sequence number wraps around from the beginning of the connection. The 64-bit sequence values used within the group are mapped back to the 32-bits used with the peer.

32비트 시퀀스를 맵핑하기 위해 두 개의 32비트 값으로 64비트 시퀀스를 분리하고 여기서 하위 32비트가 피어와 원활하게 사용된 TCP 시퀀스를 나타내고 상위 의 32비트는 접속이 시작된 이후 시퀀스가 몇번이나 랩 어라운드 되는지 그 수를 센다. 64비트 값을 32비트 시퀀스 번호에 맵핑하기 위해서 하위 32비트가 사용된다. 또다른 스키마에서 오버헤드(overhead)가 같을지라도 IACK는 시퀀스화된다.To map a 32-bit sequence, separate the 64-bit sequence into two 32-bit values, where the lower 32 bits represent the TCP sequence used seamlessly with the peer, and the upper 32 bits wrap how many times the sequence wraps around since the connection was initiated. Count that number The lower 32 bits are used to map 64-bit values to 32-bit sequence numbers. In another schema, the IACK is sequenced even though the overhead is the same.

어플리케이션application 세그먼트Segment 인스턴스Instance 및 복제 And replication

세그먼트의 많은 인스턴스들은 노드 사이에서 로드의 추가 분배를 가능케 한다. 레플리카 세그먼트 사이에서 로드는 필터와 함께 라운드-로빈, 최소 부하(least-loaded), 해시드(hashed), 친화도 기반 스키마를 사용하여 세그먼트에 요청을 전달하여 공유될 수 있다.Many instances of the segment allow further distribution of loads between nodes. Loads between replica segments can be shared by passing requests to segments using round-robin, least-loaded, hashed, and affinity-based schemas with filters.

세그먼트 레플리카는 장애 극복을 가능하게 한다. 입력 동안 레플리카가 실패하는 경우, 남겨진 레플리카가 있다면 중단없이 서비스를 계속한다. 이는 일관성 있는 입력의 주시를 유지하는 레플리카에 의해 달성된다. 세그먼트 콘트롤러는 원자적인 입력 전달로 레플리카들 간의 일관성을 가능하게 한다. 새로운 세그먼트 콘트롤러는 실패 뒤에 선택될 필요가 있을 수 있다.Segment replicas allow for failover. If a replica fails during input, it will continue service without interruption if there is a replica left. This is achieved by replicas that keep a consistent eye on input. The segment controller enables coherence between replicas with atomic input delivery. The new segment controller may need to be selected after the failure.

레플리카가 데이터를 전송하는 동안 실패할 경우, 남아있는 레플리카가 접속의 중단없이 서비스를 계속할 수 있다. 각 레플리카는 출력의 인스턴스에 동의하고 발송헤드 상태가 전송이 시작되기 전에 공유된다. 각 레플리카는 피어 그룹에 의해 확인 응답된 메모리와 데이터를 비운다.If a replica fails while sending data, the remaining replica can continue to service without interrupting the connection. Each replica agrees with an instance of the output and the dispatchhead state is shared before the transfer begins. Each replica frees memory and data acknowledged by the peer group.

각 어플리케이션 세그먼트는 남아있는 다수의 레플리카를 선택하는 것이 자유롭다. 세그먼트 콘트롤러로서 다이나믹하게 선택된 노드는 세그먼트 IACK를 형성 하기 위해 레플리카 IACK를 조정한다. 세그먼트 콘트롤러의 선택은 라운드 로빈, 최소부하, 해시 혹은 스태틱 노드에 근거할 수 있다. 접속 구축, 종료, 윈도우 관리 등 설명된 모든 작업들이 대응하는 스키마와 함께 이미 설명되었다. 세그먼트 레플리카가 특정한 입력 세그먼트의 수령에 동의할 때, 콘트롤러는 레플리카 대신 응답한다. 레플리카와 달리 세그먼트 인스턴스 사이에서 로드가 밸런싱될 때, 콘트롤러를 포함시킬 필요가 없다.Each application segment is free to choose the remaining number of replicas. The dynamically selected node as a segment controller adjusts the replica IACK to form a segment IACK. The selection of segment controllers can be based on round robin, minimum load, hash or static nodes. All the operations described, such as connection establishment, termination, and window management, have already been described with the corresponding schema. When the segment replica agrees to the receipt of a particular input segment, the controller responds on behalf of the replica. Unlike replicas, there is no need to include a controller when the load is balanced between segment instances.

레플리카가 데이터를 수신할 때 레플리카는 입력 데이터의 수령을 나타내는 IACK를 보낸다. 레플리카로부터 IACK를 모니터링하는 세그먼트 콘트롤러가 발송된 순서로 특정한 입력을 수령한 모든 레플리카를 결정할 때, 그것은 세그먼트 대신 IACK를 보내고 클라이언트 ACK를 시작한다. 이 IACK는 데이터를 어플리케이션 소켓으로 보내거나 원자적으로 필요한 동작을 취하는 레플리카에 확인 응답으로 기능한다. 세그먼트 콘트롤러의 선정은 요청에 대해 라운드 로빈하거나 최하위 레플리카 ID를 갖는 것처럼 선택 레플리카에 정적이다.When the replica receives the data, the replica sends an IACK indicating receipt of the input data. When a segment controller monitoring an IACK from a replica determines all replicas that have received a particular input in the order in which it was sent, it sends an IACK instead of a segment and initiates a client ACK. This IACK acts as an acknowledgment to the replica to send data to the application socket or to take atomically necessary actions. The selection of the segment controller is static on the selection replica, as if it had a round robin or lowest replica ID for the request.

노드 기반 그룹대 그룹 통신Node-based group-to-group communication

도 7a는 일반적인 컴퓨터의 블록도이며 그 요소들은 발명요소를 구현하기에 적합하다. 그룹대 그룹 통신 스택이 시스템의 프로세서에 의해 수행된다.7A is a block diagram of a typical computer, the elements of which are suitable for implementing the invention. A group-to-group communication stack is performed by the system's processor.

메인 main CPUCPU 의 부하를 감소시키는 그룹대 그룹 통신To group communication to reduce the load on the computer

도 7b는 컴퓨터의 블록도이며 그 요소들은 특정한 요소를 처리할 때 메인 프 로세서를 오프로딩하면서 발명의 요소를 구현하기 위해 적합하다. 그룹대 그룹 통신 스택은 자체의 프로세서를 보유하고 어댑터카드로 오프로드된다.7B is a block diagram of a computer, the elements of which are suitable for implementing elements of the invention while offloading the main processor when processing certain elements. The group-to-group communication stack has its own processor and is offloaded to the adapter card.

집적회로에서의 그룹대 그룹 통신Group to Group Communication in Integrated Circuits

도 7c는 컴퓨터의 블록도이며 그 요소들은 전용 하드웨어/가속기 집적 칩에 대해 발명의 특정한 요소를 처리할 때 메인 프로세서를 오프로딩하면서 발명의 요소를 구현하는데 적합하다. 대부분의 프로세싱에서 오프로드가 요구되는데 그렇지 않을 경우 메인 CPU에 의해 그룹대 그룹 통신 스택이 완전히 내지는 부분적으로 구현되어야 한다. 7C is a block diagram of a computer, the elements of which are suitable for implementing the elements of the invention while offloading the main processor when processing certain elements of the invention for a dedicated hardware / accelerator integrated chip. Offloading is required for most processing, otherwise the group-to-group communication stack must be fully or partially implemented by the main CPU.

도 8은 본 발명의 다른 실시예이다. 이 실시예에서 하드웨어 디바이스는 단일 TCP 접속 엔드포인트를 다중 엔드포인트로 복제한다. 노드 그룹은 802로 표시된다. 접속(801)은 각각 입력 스트림(826)과 출력 스트림(825)를 갖는다. 여기에 있는 디바이스(820)는 노드(800a,b,c)의 외부에 있다. 각 서버 노드는 동일한 접속(801)의 접속 엔드포인트(801a,b,c)를 갖는다. 디바이스(820)은 단일 접속(801)을 노드(800)의 세 개의 엔드포인트(801a,b,c)로 복제한다. 디바이스(820)은 포트(819)가 피어 디바이스에 접속을 위해 링크되어있는 동안 네 개의 포트(816, 817, 818, 819)를 갖는다. 이 디바이스는 잠재적으로 고장날 수 있는 단일 점이며 여분의 네트워크 홉을 더한다.8 is another embodiment of the present invention. In this embodiment, the hardware device replicates a single TCP connection endpoint to multiple endpoints. The node group is indicated by 802. Connection 801 has an input stream 826 and an output stream 825, respectively. The device 820 here is outside of the nodes 800a, b, and c. Each server node has connection endpoints 801a, b, c of the same connection 801. Device 820 replicates a single connection 801 to three endpoints 801a, b, c of node 800. Device 820 has four ports 816, 817, 818, 819 while port 819 is linked to connect to the peer device. The device is a single point of potential failure and adds extra network hop.

도 9는 두 데이터 세그먼트(910, 911)이 두 노드(902, 904)로 반드시 전달되어야 하는 클라이언트 그룹과 서버 그룹간의 원자적인 데이터 전달과 확인응답 스 키마를 설명한다. 901은 클라이언트 그룹이고 902, 903, 904, 905 및 906은 서버 그룹 노드이다. 910과 911은 TCP를 설명하는데 데이터 세그먼트는 클라이언트 그룹(901)에 의해 전송된다. 비록 910과 911은 잠재적으로 모든 서버 그룹 노드에서 사용할 수 있어도 프로그램될 수 있는 인스턴스에서 입력 필터링 시스템에 의해 결정됨에 따라 오직 노드(902, 904)로만 전달된다. 레퍼런스 912는 서버그룹 발송헤드(903)로부터 클라이언트 그룹으로 송신된 TCP ACK 세그먼트를 설명한다. 데이터 세그먼트(910)이 도착할 때 TCP ACK 세그먼트는 송신되지 않으나, 두 번째 데이터 세그먼트가 도착하면 다른 패킷이 확인 응답되어야만 하는 TCP 명세에 따라, ACK 세그먼트가 서버 그룹에 의해 클라이언트에게 송신된다. 플렉스 콘트롤러(902)는 오직 요구되는 플렉스/레플리카(904)에서 같은 데이터 세그먼트의 확인 응답을 나타내는 PIACK (Plex IACK) 신호(913)의 수령시에만 원자적인 전달을 나타내는 IACK 신호(914)를 송신한다. 902는 PIAK를 보내지 않은데 이는 데이터의 원자적인 전달을 나타내는 IACK를 송신하는 것은 콘트롤러에게 책임이 있기 때문이다. IACK 신호(914)를 받을 때 발송헤드를 갖는 903은 TCP ACK 세그먼트(912)를 클라이언트 그룹에 송신한다. 다른 TCP 세그먼트가 도착할 때 클라이언트 ACK 세그먼트를 송신하는 것에 더하여, ACK 세그먼트는 모든 완전한 요청 입력의 끝에서 선택적으로 송신될 수도 있다. 또한 클라이언트 ACK 세그먼트는 도착한 세그먼트의 순서가 벗어난 것과 세그먼트 대기시간 초과 등과 같은 예외 조건이 감지되면 송신된다. 그래서 클라이언트와 서버 그룹은 싱크업되고 분실된 TCP 세그먼트를 즉각적으로 재전송한다. 서버 노드가 PIACK를 보내고 이에 대한 IACK를 수신하는 데 실패하면, 서버 노 드는 PIACK를 재전송하고 수신헤드는 노드의 입력 데이터의 최근 시퀀스를 나타내는 다른 IACK로 응답한다. 9 illustrates atomic data transfer and acknowledgment schema between a client group and a server group where two data segments 910 and 911 must be delivered to two nodes 902 and 904. 901 is a client group and 902, 903, 904, 905 and 906 are server group nodes. 910 and 911 describe TCP where data segments are sent by client group 901. Although 910 and 911 are potentially available to all server group nodes, they are only forwarded to nodes 902 and 904 as determined by the input filtering system in a programmable instance. Reference 912 describes a TCP ACK segment sent from server group sendhead 903 to a client group. The TCP ACK segment is not transmitted when the data segment 910 arrives, but the ACK segment is sent by the server group to the client in accordance with the TCP specification that another packet must be acknowledged when the second data segment arrives. The flex controller 902 transmits an IACK signal 914 indicating atomic delivery only upon receipt of a PIACK (Plex IACK) signal 913 indicating an acknowledgment of the same data segment in the required flex / replica 904. . The 902 does not send a PIAK because it is the controller's responsibility to send an IACK indicating the atomic delivery of data. When receiving the IACK signal 914, 903 with the sendhead sends a TCP ACK segment 912 to the client group. In addition to transmitting the client ACK segment when another TCP segment arrives, the ACK segment may optionally be sent at the end of every complete request input. The client ACK segment is also sent when an exceptional condition is detected, such as out of order of the arriving segments and segment latency. Thus, client and server groups immediately resend synced and lost TCP segments. If the server node sends a PIACK and fails to receive an IACK for it, the server node retransmits the PIACK and the receiving head responds with another IACK representing the latest sequence of the node's input data.

도 10은 입력 데이터가 버스에서처럼 공유되지만 출력 데이터는 스위칭되는 구현의 논리적 관점을 보여준다. 1000은 피어그룹으로부터의 입력 데이터 스트림이다. 1010은 이더넷과 같은 공유 매체 또는 멀티캐스트를 사용하여 입력만을 공유하는 논리 하프-버스이다. 1020, 1021 및 1022는 각각 노드(1030, 1031, 1032)에 대응하는 버스입력 엔드포인트를 나타낸다. 1040, 1041, 1042는 계층 2 또는 계층 3의 IP 스위칭 디바이스(1050)으로 피딩되는 출력 엔드포인트들이다. 1060은 입력(1000)에 대해 생성되는 노드들(1030, 1031, 1032)에 의해 생성된 집합된 출력을 나타낸다. 1000과 1060은 각각 단일 접속의 입력과 출력을 형성한다.10 shows a logical view of an implementation where input data is shared as on a bus but output data is switched. 1000 is the input data stream from the peer group. 1010 is a logical half-bus that shares input only using multicast or shared media such as Ethernet. 1020, 1021, and 1022 represent busin endpoints corresponding to nodes 1030, 1031, and 1032, respectively. 1040, 1041, 1042 are output endpoints that are fed to the IP switching device 1050 of Layer 2 or Layer 3. 1060 represents the aggregated output generated by nodes 1030, 1031, 1032 generated for input 1000. 1000 and 1060 form a single connection input and output, respectively.

로드 road 쉐어링과Sharing and 로드 road 밸런싱Balancing

도 11은 본 발명의 일실시예에 따른 대칭적 멀티 컴퓨터 시스템을 나타낸다. 서버 그룹(1112)는 다수의 노드(1100a,b,c,d,e,f)로 구성된다. TCP 접속(1109)의 입력 스트림(1110)은 그룹노드에 걸쳐있는 다중 엔드포인트(1110a,b,c,d,e,f)를 갖는다. 마찬가지로 같은 접속의 출력 스트림(1111)은 엔드포인트(1111a,b,c,d,e,f)로 구성된다. 11 illustrates a symmetric multicomputer system in accordance with an embodiment of the present invention. The server group 1112 is composed of a plurality of nodes 1100a, b, c, d, e, and f. Input stream 1110 of TCP connection 1109 has multiple endpoints 1110a, b, c, d, e, f that span group nodes. Similarly, output stream 1111 of the same connection consists of endpoints 1111a, b, c, d, e, f.

이 어플리케이션은 각 어플리케이션 세그먼트(1113a,b, 1114a,b, 1115a,b)에 대해 두 개의 인스턴스를 갖는 전체그룹에 걸쳐 동작하는 세 개의 세그먼트(1113, 1114, 1115)로 구성된다. 통신 시스템을 프로그래밍함으로써 세그먼트는 관리해야 할 데이터와 수행해야 할 동작과 같은 기준에 기반한 특정작업과 함께 전달된다. 서비스에 대한 요청의 특정 서브세트를 어플리케이션의 특정 인스턴스에 전달하는 것과 같은 방법의 데이터 전달을 구성함으로써 어플리케이션의 세그먼트가 달성되고 많은 경우에 기존 어플리케이션의 코드 변경없이 달성되어진다. 어플리케이션은 다양한 방법으로 분할된다. 예를 들어 세그먼트가 처리할 수 있는 요청의 종류나 유형, 시퀀스 번호 등과 같은 접속 정보나 데이터 내용에 기반한 해싱 알고리즘에 기반하여 분할될 수 있다. 어플리케이션은 다른 세그먼트로 프로그래밍함으로써 세그먼트로 분리되는 것도 어느정도 가능성이 있다. This application consists of three segments 1113, 1114, 1115 that operate across an entire group of two instances for each application segment 1113a, b, 1114a, b, 1115a, b. By programming the communication system, segments are delivered with specific tasks based on criteria such as data to manage and actions to perform. By configuring data delivery in such a way as to deliver a specific subset of requests for services to a particular instance of the application, segments of the application are achieved and in many cases without code changes to existing applications. The application is partitioned in various ways. For example, the segment may be divided based on a hashing algorithm based on access information or data content such as a type or type of request that can be processed, or a sequence number. It is also possible for an application to be divided into segments by programming to other segments.

그룹 노드는 레플리카(1100a,b, 1100c,d와 1100e,f)와 같이 쌍을 이루며, 각 쌍은 각각 어플리케이션 세그먼트(1113, 1114, 1115)의 두 개의 인스턴스로 동작한다. 세그먼트(1100a) 장애시 쌍(1100b)는 중단없이 서비스를 지속한다. 인스턴스(1100a)의 장애가 전송중에 발생하면 다른 인스턴스(1100b)가 서비스의 중단없이 피어에게 응답의 나머지를 전달할 것이다. 마찬가지로 새로운 어플리케이션 세그먼트 인스턴스가 그룹에 부가되어 장애 직면시 서비스를 지속하도록 가용의 부가된 인스턴스로 인한 장애 극복을 증가시킨다. 이는 예를 들어 새로운 프로세스 구동 어플리케이션 세그먼트 인스턴스를 형성하고 이를 그룹에 부가하여 요청을 적절하게 분산시킴으로써 이루어질 수 있다.Group nodes are paired like replicas 1100a, b, 1100c, d and 1100e, f, and each pair operates as two instances of application segments 1113, 1114, and 1115, respectively. When segment 1100a fails, pair 1100b continues service without interruption. If the failure of instance 1100a occurs during transmission, another instance 1100b will deliver the rest of the response to the peer without interruption of service. Similarly, new application segment instances are added to the group to increase the failover due to the added instances available to continue service in the event of a failure. This can be done, for example, by forming a new process driven application segment instance and adding it to a group to properly distribute the request.

동작의 한 모드에 있어, 그룹의 비어있지 않은 서브세트는 요청을 근본적으로 비지않은 서브세트로 분산해서 노드에 부하를 밸런싱시키는 라운드 로빈과 가중치 기반 우선 순위와 같은 특정한 순서에 따라 요청과 함께 전달된다. In one mode of operation, a non-empty subset of a group is delivered with the request in a specific order, such as round robin and weight-based priority, which distributes the request into essentially non-empty subsets to balance the load across the nodes. .

동작의 한 모드에 있어, 하나 이상의 레플리카가 태스크와 함께 전달되고 태스크가 완성된 후에 인스턴스로부터의 결과는 다른 레플리카와 상관없이 접속을 통해 송신된다. 레플리카 동작의 다른 모드에서는 ,하나 이상의 레플리카가 같은 태스크를 갖고 전달될 것이다. 그러면 관련된 레플리카가 동작을 병렬로 수행하고 결과를 이끌어낸다. 그룹대 그룹 통신 시스템의 출력 스트림에서 설치된 출력 필터는 결과를 비교하고 결과의 단일 인스턴스는 피어그룹에 송신되어 그룹이 피어그룹에 대한 단일 엔티티로 보인다. 피어 그룹에 전달된 출력 인스턴스의 선택은 필터내 같은 출력, 다수의 동의, 정정 결과 혹은 성공적인 동작 출력 등과 같은 방침 세트에 의존한다. 방침의 선택은 어플리케이션에 의존한다. 세그먼트 인스턴스의 전송 실패때 레플리카가 인계해서 접속 중단없이 전송을 계속한다.In one mode of operation, one or more replicas are delivered with the task and after the task is completed, results from the instance are sent over the connection regardless of other replicas. In another mode of replica operation, more than one replica will be delivered with the same task. Relevant replicas then perform the actions in parallel and drive the results. Output filters installed on the output stream of the group-to-group communication system compare the results and a single instance of the results is sent to the peer group so that the group appears to be a single entity for the peer group. The choice of output instance delivered to a peer group depends on the policy set, such as the same output in the filter, multiple agreements, corrective results or successful behavior outputs. The choice of policy depends on the application. If the segment instance fails to transfer, the replica takes over and continues the transfer without interrupting the connection.

출력 컨텐츠 필터에 의한 출력 비교 결과가 노드에 의해 만들어진 출력과 다름을 나타낼 때 서브세트 레플리카는 오류라고 간주되어 접속을 통한 추가 서비스로부터 배제되고, 남아있는 엔드포인트들은 접속 중단없이 서비스를 계속한다. 엔드포인트가 제외되는 실시예에서, 이러한 제외는 엔드포인트 대부분이 다른 것이 못 들어오게 하는데 동의하는 스키마에 기초한다. 선택적으로 엔드포인트들의 제외는 동작이 실패했을 때 발생한다. 엔드포인트의 제외 역시 필터에 의해 프로그램될 수 있는 어플리케이션 특정 스키마로부터 있을 수 있다.When the output comparison result by the output content filter indicates that it is different from the output produced by the node, the subset replica is considered an error and is excluded from further service over the connection, and the remaining endpoints continue to service without interruption of the connection. In embodiments in which endpoints are excluded, this exclusion is based on a schema that most of the endpoints agree to prevent others from coming in. Optionally, the exclusion of endpoints occurs when the operation fails. Exclusion of endpoints can also be from application specific schemas that can be programmed by filters.

동작의 다른 모드에서, 레플리카는 메모리와 저장장치에서 수정된 데이터와 같은 상태 변화로 끝나는 동작과 함께 전달된다. 이런 방법으로 레플리카는 일관된 상태를 유지한다. 일관성에 영향을 주지 않는 읽기 동작과 같은 동작이 레플리카 사이에 일어날 때, 태스크는 레플리카의 오직 하나의 인스턴스로 전달된다. 이는 레플리카 사이의 로드 밸런싱을 가능하게 한다.In another mode of operation, a replica is conveyed with an operation that ends in a state change, such as data modified in memory and storage. In this way, the replica remains consistent. When an operation occurs between replicas, such as a read operation that does not affect consistency, the task is delivered to only one instance of the replica. This allows for load balancing between replicas.

노드 추가와 제거Adding and Removing Nodes

TCP 그룹 대 그룹 통신 시스템의 엔드포인트 접속에 있는 필터는 어플리케이션 세그먼트로의 데이터 전달의 미립자 제어를 가능하게 한다. 동적으로 필터를 구성함에 의해 특정한 태스크는 노드에 태스크 요청의 전달을 통해 외부적 제어를 가능하게 하면서 특정한 노드로 전달된다. 어플리케이션 세그먼트로의 요청의 흐름은 노드 사이에서 훌륭한 태스크 분배를 위한 스위치처럼 제어된다.Filters in the endpoint connections of TCP group-to-group communication systems enable fine grain control of data delivery to application segments. By dynamically configuring filters, specific tasks are delivered to specific nodes, enabling external control through the delivery of task requests to nodes. The flow of requests to the application segment is controlled like a switch for good task distribution between nodes.

그룹에는 언제든지 노드가 더해질 수 있다. 새롭게 추가된 노드는 이미 존재하는 접속과 새로운 접속으로부터 부하를 공유할 수 있다. 이미 존재하는 접속에서, 노드는 서비스를 결합하고 그곳에 도착하는 태스크를 수령하기 시작한다. 노드 사이의 필요한 부하는 태스크의 이동을 통해 밸런싱된다.Nodes can be added to a group at any time. Newly added nodes can share load from existing and new connections. On an already existing connection, the node joins the service and begins to receive the tasks arriving there. The necessary load between nodes is balanced by the movement of tasks.

노드 제거를 위해 노드의 부하 책임은 최소부하, 라운드 로빈 또는 어플리케이션 특정 스키마와 같은 스키마를 이용해서 선택된 다른 노드로 이동한다. 제거되면서, 새로운 태스크들을 수령하지 않고 작은 태스크들이 끝날 때까지 기다리는 동안 노드는 완벽하게 자유롭다. 긴 진행 태스크가 포함되어 있을 때는 시스템 레벨 프로세스 이동과 같은 태스크의 이동이 이용된다. 스택과 같은 어플리케이션 프로세스의 전체 상황이 프로세스 이동을 이용해서, 데이터 오픈 화일들이 또 다른 노드로 투명하게 이동한다. 노드들은 그룹으로 대표되는 가상의 엔티티의 어드레스로 접속을 형성하여 그룹내의 다른 노드들과 통신한다. 이는 그룹 노드들 간의 통신을 위한 모든 전술된 특징을 제공한다.To remove a node, the node's load responsibility is transferred to another node selected using a schema such as minimum load, round robin, or application-specific schema. Being removed, the node is completely free while waiting for small tasks to finish without receiving new tasks. When long progress tasks are involved, movement of tasks such as system level process movement is used. The entire situation of an application process, such as the stack, uses process movement so that data open files are transparently moved to another node. The nodes communicate with other nodes in the group by establishing a connection with the address of the virtual entity represented by the group. This provides all the aforementioned features for communication between group nodes.

자동 공급Automatic feeding

시스템은 변화하는 요구에 맞춰 풀로부터 자원을 그룹으로 자동적이고 동적으로 부가한다. 유사하게, 노드들은 동적이고 자동적으로 제거되고 공급된다. 시스템은 클라이언트에게 전달된 서비스의 질을 감시하고 자원의 추가와 제거를 통해 서비스의 특정 질을 유지한다. 동작은 시스템 외부에서 이뤄질 수 있고 피어 그룹에 잠재적으로 투명하다. The system automatically and dynamically adds resources from the pool into groups to meet changing needs. Similarly, nodes are removed and fed in dynamically and automatically. The system monitors the quality of service delivered to clients and maintains a certain quality of service through the addition and removal of resources. The operation can be done outside the system and potentially transparent to the peer group.

본 발명에 대해 당업자는 본 발명의 원리에 따라 단일 접속으로 그룹 대 그룹 통신을 위한 추가 다른 구조와 기능의 설계가 가능함을 이해할 것이다. 따라서 본 발명의 특정 실시예 및 어플리케이션이 예증되고 기술되었지만 본 발명이 여기에서 설명된 정확한 구조와 구성 성분에만 한정되어 있지 않음을 이해하게 될 것이다. 그리고 청구 범위의 발명의 범위와 정신을 벗어나지 않고 본 명세서의 방법 및 장치의 배열, 동작 및 상세한 설명으로 다양한 수정, 변경, 변화가 가능함이 당업자에게 명백할 것이다.Those skilled in the art with respect to the present invention will understand that it is possible to design additional other structures and functions for group-to-group communication in a single connection in accordance with the principles of the present invention. Thus, while specific embodiments and applications of the present invention have been illustrated and described, it will be understood that the invention is not limited to the precise structure and components described herein. It will be apparent to those skilled in the art that various modifications, changes and variations can be made in the arrangement, operation and details of the method and apparatus herein without departing from the scope and spirit of the invention as claimed.

Claims

In a method of communication between nodes for atomic delivery of data,

Determining a node to transmit data;

Sending a single instance of data; And

Serializing transmission control for data transmission.

An apparatus for communication between groups in a network, the apparatus comprising:

a. A first uniquely addressed group and a second uniquely addressed group, wherein the first group includes a plurality of endpoints and each group comprises at least one node;

b. A communication protocol having a single logical connection enabled for communication between each group;

c. Send data to the second uniquely addressed group that is enabled and to allow data to the nodes of the first uniquely addressed group to effectively communicate between non-empty subsets of each group. And a plurality of endpoints enabled for the purpose of communication.

The communication device according to claim 2, wherein data communication between each group does not pass through each node of each group.

3. The communications device of claim 2 wherein the endpoint comprises a protocol stack.

3. The communications apparatus of claim 2 wherein the node comprises one of a group of processes or devices.

3. The communications device of claim 2, wherein each group is enabled to address another group by a unique address.

The communication device of claim 2, wherein the communication protocol comprises a Transmission Control Protocol (TCP).

9. A communications device according to claim 8, wherein data communication between each group does not pass through each node of each group.

9. A communication device as claimed in claim 8, wherein the endpoint comprises a protocol stack.

9. A communications device according to claim 8, wherein said protocol guarantees data transmission in the order in which they were sent by said group.

10. The communications apparatus of claim 8 wherein the endpoint resides in a node.

9. The communications device of claim 8 wherein the endpoint resides outside of the node.

9. The communications device of claim 8, wherein the membership of the two groups varies over the duration of the connection.