KR100915399B1

KR100915399B1 - Data partitioning and critical section reduction for bayesian network structure learning

Info

Publication number: KR100915399B1
Application number: KR1020077017659A
Authority: KR
Inventors: 춘롱 라이; 웨이 후
Original assignee: 인텔 코포레이션
Priority date: 2004-12-31
Filing date: 2004-12-31
Publication date: 2009-09-03
Also published as: KR20070100779A

Abstract

병렬 시스템에서, 다수의 스레드는 병렬로 동작하여 네트워크 구조 학습(network structure learning)을 실행한다. 글로벌 점수 캐시(global score cache)는 다수의 분리 점수 캐시로 분할되는데, 이 분할 동작은 일실시예에서 점수 캐시와 학습될 구조의 노드를 연관시키는 것을 포함한다. 분리 점수 캐시(split score cache)를 가지고 분리 네이버 점수화 루프 내에서 학습이 실행될 수 있는데, 여기에서 제 1 루프는 별개의 점수 캐시 분할부에서 작동하고, 제 2 루프에 대한 점수 캐시 분할을 활성화(warming)한다.In a parallel system, multiple threads operate in parallel to perform network structure learning. The global score cache is divided into a number of separate score caches, which in one embodiment involves associating a score cache with a node of a structure to be learned. Training can be performed within a split neighbor scoring loop with a split score cache, where the first loop operates on a separate score cache partition and enables score cache splitting for the second loop. )do.

Description

STRUCTURE LEARNING METHOD, COMPUTER-READABLE STORAGE MEDIA, DEVICES AND SYSTEMS {DATA PARTITIONING AND CRITICAL SECTION REDUCTION FOR BAYESIAN NETWORK STRUCTURE LEARNING}

본 발명의 실시예는 네트워크 구조 학습에 관한 것이고, 보다 구체적으로는 베이지안 네트워크 구조 학습(Bayesian network structure learning)을 위한 데이터 분할에 관한 것이다.Embodiments of the present invention relate to network structure learning, and more particularly, to data partitioning for Bayesian network structure learning.

대량의 정보, 특히 관련 정보는 네트워크 구조로 조직화될 수 있다. 베이지안 네트워크는 이러한 네트워크 구조의 보편적인 예시이다. 베이지안 네트워크의 사용은 생물 정보학(bioinformatics), 패턴 인식, 통계 계산 등에서 증가하고 있다. 베이지안 네트워크 구조의 학습은 매우 계산 집약적이고, 진정한 "최적" 구조를 찾는 해결책은 NP-완전(NP-complete : 난해한 문제)일 수 있다. 베이지안 네트워크 구조의 학습이 매우 계산 집약적임에도 불구하고, 훨씬 더 큰 데이터 세트를 갖는 네트워크가 탐구되었는데, 이것은 계산 집약도를 증가시킬 수 있고, 잠재적으로는 계산 집약도의 지수 함수적 증가를 포함할 수 있다. 발견적 접근법(heuristic approaches)은 때때로 예를 들면, 실행 시간을 감소시키는 것 등과 같이 구조 학습의 성능 효율을 향상시키는 데 중점을 둔다. 현대 네트워크에 대한 허용 가능한 실제적 해결책을 제공하는 데 있어서 성능 효율이 점차적으로 중요해지고 있다.A large amount of information, especially related information, can be organized into a network structure. Bayesian networks are a common example of such a network structure. The use of Bayesian networks is increasing in bioinformatics, pattern recognition, and statistical calculations. Learning of Bayesian network structures is very computationally intensive, and the solution to finding a true "optimal" structure can be NP-complete. Although the learning of Bayesian network structures is very computationally intensive, networks with much larger data sets have been explored, which can increase computation intensity, potentially including exponential increases in computation intensity. have. Heuristic approaches sometimes focus on improving the performance efficiency of structural learning, for example by reducing execution time. Performance efficiency is becoming increasingly important in providing acceptable practical solutions for modern networks.

병렬 학습 접근법은 구조 학습 알고리즘을 실행하는 데 있어서 다수의 계산 머신 및/또는 프로세싱 코어의 자원을 포함하도록 고려되었다. 이러한 접근법의 병렬적 속성은 다수의 자원 사이에 작업을 분배하여 임의의 하나의 시스템이 해결책을 찾는데 소요하는 시간을 감소시키도록 시도하였다. 종래의 병렬 학습은 기본적인, 단순한 방식으로 계산 작업을 분배하였고, 이것은 전형적으로 병렬 계산 자원 사이에 계산 작업을 분배하는 데 있어서 오로지 각각의 병렬 계산 자원에 할당되는 작업의 개수만을 고려하였다.The parallel learning approach has been contemplated to include the resources of multiple computational machines and / or processing cores in executing structure learning algorithms. The parallel nature of this approach attempted to distribute the work among multiple resources to reduce the time it takes for any one system to find a solution. Conventional parallel learning distributes computation tasks in a basic, simple manner, which typically only takes into account the number of tasks allocated to each parallel computation resource in distributing computation tasks among parallel computation resources.

예를 들면, 네이버 점수 계산(neighbor score computation)에 있어서, 주요 또는 마스터 스레드는 다른 이용 가능한 스레드들 사이에 네이버 계산을 분배할 수 있다. 네이버의 계산에 있어서, 스레드는 점수 캐시를 검사하여 그 구조에 대한 패밀리 점수(family score)가 알려져 있는지 여부를 판정한다. 점수 캐시는 전통적으로 네트워크 패밀리 점수가 획득되고 저장되어 있으며, 스레드들 사이에서 공유되고, 해시 테이블(hash table)로서 액세스되는 데이터 구조이다. (캐시 히트의 결과로서) 그 점수가 획득되면, 계산 자원은 단순하게 그 점수를 로딩하고, 그 점수를 이용하여 네이버 점수(DAG(directed acyclic graph), 또는 관심 대상이 되는 구조의 점수)를 계산할 수 있다. (캐시 미스의 결과로서) 그 점수가 획득되지 않으면, 계산 자원은 네이버 점수를 계산하기 전에 패밀리 점수를 계산하도록 요구될 수 있다. 데이터 구조가 다수의 스레드에게 이용 가능하기 때문에, 점수 캐시 액세스는 데이터의 오버런(overrunning)을 감소하기 위해 임계 구역(critical section) 내에 존재하도록 요구될 수 있다는 것을 유의하라. 따라서, 계산 자원이 비효율적으로 그 자원을 이용하고/또는 다른 계산 자원이 점수 캐시를 공개하기를 기다리면서 대기 상태로 유지되는 주기가 존재할 수 있다. 따라서, 구조 학습에 대한 현재 또는 종래의 멀티프로세싱/하이퍼-스레딩(hyper-threading) 접근법은 그 크기 및 복잡성이 증가하는 네트워크에 대한 구조 학습의 요구되는 성능을 제공하는데 실패했다고 할 수 있다.For example, in neighbor score computation, the main or master thread may distribute neighbor calculations among other available threads. In calculating Naver, the thread checks the score cache to determine whether a family score for the structure is known. The score cache is traditionally a data structure in which network family scores are obtained and stored, shared among threads, and accessed as hash tables. Once the score is obtained (as a result of a cache hit), the computational resource simply loads the score and uses that score to calculate the Naver score (directed acyclic graph, or score of the structure of interest). Can be. If the score is not obtained (as a result of a cache miss), a computing resource may be required to calculate the family score before calculating the neighbor score. Note that because the data structure is available to multiple threads, score cache access may be required to be present in a critical section to reduce overrunning of data. Thus, there may be a period in which a computational resource inefficiently uses that resource and / or remains idle while waiting for other computational resources to publish the score cache. Thus, it may be said that current or conventional multiprocessing / hyper-threading approaches to structure learning have failed to provide the required performance of structure learning for networks of increasing size and complexity.

이하의 상세한 설명은 한정하는 방식이 아닌 오로지 예로서, 도면 및 첨부된 도면에 여러 예시를 포함하였다. 이 도면은 다음과 같이 간략하게 설명될 수 있다.The following detailed description has included several examples in the drawings and the accompanying drawings as examples only and not by way of limitation. This figure can be briefly described as follows.

도 1은 분할된 점수 캐시를 갖는 컴퓨팅 장치에 대한 일실시예를 도시하는 블록도.1 is a block diagram illustrating one embodiment for a computing device having a partitioned score cache.

도 2는 분할된 점수 캐시를 갖는 멀티-스레딩 컴퓨팅 장치에 대한 일실시예를 도시하는 블록도.2 is a block diagram illustrating one embodiment for a multi-threaded computing device having a partitioned score cache.

도 3은 구조 학습 루프를 내부 루프 및 외부 루프로 분리하는 것에 대한 일실시예를 도시하는 흐름도.3 is a flow diagram illustrating one embodiment for splitting a structure learning loop into an inner loop and an outer loop.

도 4는 분리 점수 캐시에 의한 구조 학습에 대한 일실시예를 도시하는 흐름도.4 is a flow diagram illustrating one embodiment for structure learning with a split score cache.

매우 일반적인 의미로, 구조 학습은 조건/상태 및/또는 조건/상태의 원인 및/또는 결과와 관련된 정보를 각각 나타내는 노드의 네트워크 내의 변수들 사이의 개연 관계를 발견하는 방법으로서, 베이지안 네트워크에 적용된다. 구조 학습은 개별 노드의 개연 관계에 기초한 네트워크 구조 표현을 구성함으로써 네트워크를 표현할 수 있다. 힐-클라이밍(hill-climbing)은 때때로 정적 및/또는 동적 베이지안 네트워크를 학습하는 데 이용되는 알고리즘이고, 전형적으로 해시 테이블(hash table)로 구현되는 다차원 희소 행렬(sparse matrix)인 점수 캐시를 포함할 수 있다. 행렬의 각각의 성분은 노드 패밀리 또는 노드의 패밀리(또는 단순히 본 명세서에서는 "패밀리"로 지칭함)의 점수를 저장한다. 패밀리는 관심 대상이 되는 현재의 노드 또는 타겟 노드(차일드 노드(child node)) 및 현재의 노드의 페어런트 노드(또는 단순히 본 명세서에서는 "페어런트(parents)"로 지칭함)를 포함한다. 페어런트 및 타겟 노드는 개연 관계에 따라 연관될 수 있다. 점수 캐시는 점수가 계산된 후에 패밀리의 점수를 저장하기 위해 이용될 수 있다.In a very general sense, structure learning is applied to Bayesian networks as a way of discovering probable relationships between variables in a node's network, each representing information related to the condition / state and / or cause and / or effect of the condition / state. . Structure learning can represent a network by constructing a network structure representation based on the probability relationships of individual nodes. Hill-climbing is an algorithm that is sometimes used to learn static and / or dynamic Bayesian networks, and typically includes a score cache, which is a multidimensional sparse matrix implemented as a hash table. Can be. Each component of the matrix stores a score of a node family or a family of nodes (or simply referred to herein as a "family"). The family includes the current node or target node of interest (child node) and the parent node of the current node (or simply referred to herein as "parents"). Parent and target nodes may be associated according to a probability relationship. The score cache may be used to store the score of the family after the score is calculated.

구조 학습에 있어서, 학습 알고리즘은 일반적으로 먼저 훈련 데이터(training data)(알려진 관계)를 로딩하고, 훈련 데이터 및 특정 점수화 메트릭에 기초하여 점수를 계산함으로써 구조를 결정한다. 알고리즘은 시작 포인트 또는 초기의 현재 구조에 대한 점수를 계산하는데, 이것은 구조 학습이 개시될 초기 사용자-정의 베이지안 네트워크 구조일 수 있다. 시작 포인트의 네이버(현재의 구조로부터 에지 차(edge difference)만큼 분리된 구조)가 생성될 수 있고, 각각의 네이버의 점수가 계산될 수 있다. 각각의 네이버에 대한 계산을 점수화하는 종래의 접근법은 점수 캐시를 탐색하여 네이버에 대응하는 패밀리의 점수가 확인되거나 이미 계산되었는지 여부를 판정하는 것을 포함한다. 패밀리 점수가 변하지 않으리라고 간주할 수 있으므로, 하나의 계산에 대해 점수가 계산되면 그 점수는 저장되고 다른 계산에서 재사용될 수 있다. 패밀리 점수가 이용 가능하면, 점수는 직접 로딩될 수 있고 네이버 점수가 계산된다. 패밀리 점수가 이용 불가능하면 전체 구조(패밀리를 포함함)의 점수가 일반적으로 계산되고, 점수 캐시가 업데이트된다. 프로세스는 전형적으로 모든 네이버에 대해 반복되고, 알고리즘은 학습의 다음 반복이 시작될 수 있는 새로운 현재 구조로서 최대 점수를 갖는 네이버를 선택한다. 최선으로, 현재의 구조보다 더 높은 점수를 가질 수 있는 네이버가 존재하지 않을 때까지 프로세스는 반복된다. 일실시예에서 결정된 바와 같이 최적 해결책이 실행 불가능하거나 달성할 수 없기 때문에 실제적인 적용에서는 때때로 발견적 접근법을 이용한다.In structure learning, the learning algorithm generally determines structure by first loading training data (known relationships) and calculating scores based on the training data and specific scoring metrics. The algorithm calculates a score for either the starting point or the initial current structure, which may be an initial user-defined Bayesian network structure in which structure learning will begin. A neighbor of the starting point (a structure separated by an edge difference from the current structure) may be generated, and the score of each neighbor may be calculated. Conventional approaches to scoring the calculations for each neighbor include searching the score cache to determine whether the score of the family corresponding to the neighbor has been verified or already calculated. Family scores can be considered unchanged, so once a score is calculated for one calculation, the score can be stored and reused in another calculation. If a family score is available, the score can be loaded directly and the Naver score is calculated. If a family score is not available, the score of the entire structure (including the family) is generally calculated and the score cache is updated. The process is typically repeated for every neighbor, and the algorithm selects the neighbor with the maximum score as the new current structure in which the next iteration of learning can begin. At best, the process is repeated until there is no neighbor that can have a higher score than the current structure. Practical applications sometimes use a heuristic approach because optimal solutions are not feasible or achievable as determined in one embodiment.

구조 학습과 관련된 통상적인 문제는 계산 집약성과 그와 관련된 긴 실행 시간에 있다. 병렬화는 다수의 스레드(예를 들면, 병렬 코어, ALU(arithmetic logic units), 프로세싱 코어, 프로세서, CPU(central processing units) 등) 에게 네이버 점수화 작업/계산을 분배함으로써 구조 학습 프로세스를 가속화할 수 있다. 단순한 병렬화 접근법에서, 각각의 스레드는 계산할 하나 이상의 네이버를 수신할 수 있고, 이 네이버는 스레드들에게 동등하게 분배된다. 점수 캐시의 컨텐츠 및/또는 점수 캐시를 액세스하는 데 포함된 프로세스는 스레드의 성능, 경쟁 상태(racing conditions)의 생성, 지연 등에 영향을 줄 수 있다. 전통적으로 점수 캐시는 각각의 스레드에 액세스 가능한 단일 데이터 구조이고, 이것은 상술된 것과 동일한 상태를 초래할 수 있다. 현재의 병렬화 접근법은 임계 구역을 생성하여 한 시점에 캐시 액세스를 하나의 스레드로 한정하는 것을 포함한다. 그 외의 접근법은 캐시의 다수의 버전을 각각의 스레드에 대해 하나씩 저장할 수 있으나, 이것은 허용 불가능한 수준의 오버헤드 장애(overhead penalty)를 초래할 수 있다.A common problem with structure learning is computational intensity and the long execution time associated with it. Parallelism can speed up the structure learning process by distributing NAVER scoring tasks / calculations to multiple threads (eg, parallel cores, arithmetic logic units (ALUs), processing cores, processors, central processing units (CPUs, etc.)). . In a simple parallelization approach, each thread can receive one or more neighbors to compute, which neighbors are equally distributed to the threads. The processes involved in accessing the score cache and / or the contents of the score cache may affect the performance of the thread, the generation of racing conditions, delays, and the like. The score cache is traditionally a single data structure accessible to each thread, which can result in the same state as described above. Current parallelization approaches involve creating a critical section to limit cache access to one thread at a time. Other approaches can store multiple versions of the cache, one for each thread, but this can lead to an unacceptable level of overhead penalty.

시스템 효율은 향상된 국부 최적화(locality optimization) 및 계산 작업 사이의 부하-균형(load-balancing)을 제공하는 것에 의해 향상될 수 있다. 추가적으로, 점수화 알고리즘은 점수 캐시의 보다 효율적인 사용 및/또는 액세스를 가능하게 하도록 실행되기 위해 재구성될 수 있다. 일실시예에 있어서 점수 캐시는 다수의 하부 부분으로 분할된다. 예를 들면, 점수 캐시는 개별적으로 어드레스 가능한 메모리 구조의 어레이일 수 있고, 여기에서 어레이의 각각의 소자는 노드에 대한 점수 캐시이다. 예를 들면, 하나가 수천개의 노드를 포함하는 복잡한 DAG(directed acyclic graph)에서, 해시-인덱스(hash-index) 함수의 불규칙한 액세스 패턴에 의해 유발된 장애는 점수 캐시 어레이를 제공하는 것에 의해 감소될 수 있다. 해시-인덱싱의 장애는 점수 캐시를 관리하는 더 높은 오버헤드와 연관된 장애로 대체될 수 있다. 대형의 복잡한 DAG에서, 불규칙/임의적 액세스에 대한 장애는 오버헤드에 의한 장애보다 훨씬 더 높을 수 있다.System efficiency can be improved by providing improved load-balancing between locality optimization and computational tasks. In addition, the scoring algorithm may be reconfigured to be executed to enable more efficient use and / or access of the score cache. In one embodiment, the score cache is divided into a number of lower portions. For example, the score cache may be an array of individually addressable memory structures, where each element of the array is a score cache for a node. For example, in a complex directed acyclic graph (DAG) where one contains thousands of nodes, failures caused by irregular access patterns of hash-index functions may be reduced by providing a score cache array. Can be. The failure of hash-indexing can be replaced by a failure associated with the higher overhead of managing the score cache. In large complex DAGs, the impairment for random / random access can be much higher than the impairment due to overhead.

예를 들면, 유전자(gene) 네트워크 애플리케이션에서, DAG는 수 천개의 노드로 이루어질 수 있고, 이것은 점수 캐시가 수 백만개의 소자를 저장해야 한다는 것을 의미한다. 해시-인덱스 함수는 수 천개의 노드 각각에 대해 어레이 소자를 구비하는 점수 캐시 어레이를 관리하는 것보다 상당히 높은 장애를 유발할 수 있다. 예를 들면, 2400개의 변수(이는 DAG가 2400개의 노드를 갖는 것에 대응함)를 갖는 유전자-네트워크에 대한 임의적 캐시 액세스는 2400개 소자의 어드레스 가능 어레이에 의해 생성된 점수 캐시를 액세스하는 것보다 더 높은 장애를 유발할 수 있다. 일실시예에서 각각의 소자는 유전자 네트워크 내에 대응하는 노드의 점수만을 저장하는 소형 점수 캐시를 나타낸다.For example, in gene network applications, a DAG can consist of thousands of nodes, meaning that the score cache must store millions of devices. The hash-index function can cause significantly higher failures than managing a score cache array with array elements for each of thousands of nodes. For example, random cache access to a gene-network with 2400 variables (which corresponds to a DAG having 2400 nodes) is higher than accessing a score cache generated by an addressable array of 2400 elements. May cause disability. In one embodiment each element represents a small score cache that stores only the scores of corresponding nodes in the genetic network.

게다가, 분리 점수 캐시 접근법은 점수화 내에 현재의 DAG의 네이버를 재조직할 기회를 제공한다. 동일 노드에 에지를 추가 또는 삭제하는 네이버를 평가할 때, 점수화 알고리즘(또한 알고리즘을 실행하는 스레드)은 특정 노드(본 명세서에서 분리 점수 캐시로 지칭됨)에 대응하는 하나의 단일 점수 캐시 어레이 소자만을 액세스함으로써 네이버 계산을 완료하는 데 필요한 모든 정보를 획득할 수 있다. 따라서, 점수에 네이버를 분배하는 엔트리를 관리하는 것은, 노드에 관련된 네이버를 그룹화하여 점수 캐시를 액세스하는 국부성을 향상시킬 수 있다. 에지가 역방향인 네이버에 있어서, 원래의 에지를 삭제하는 것과, 반대 방향의 에지를 추가하는 2개의 동작이 수반된다. 이러한 경우에 각각의 역방향 네이버는 차일드 노드에 대해 하나의 분리 점수 캐시를 액세스하고, 페어런트 노드에 대해 하나의 분리 점수 캐시를 액세스할 것이다. 역방향 네이버는 동일한 차일드 노드에 대해 추가/삭제 에지를 실행하는 네이버와 그룹화될 수 있는데, 이는 점수화/계산 함수의 액세스 패턴을 제어하고, 그에 따라 액세스 국부성을 향상시킬 수 있다. 향상된 국부성의 결과로 실행 속도가 더 우수해진다. 최적화된 액세스를 갖는 분리 점수 캐시는 시간적 및 지리적 국부화 성능 강화를 모두 제공한다는 것을 유의하라. In addition, the split score cache approach offers the opportunity to reorganize the neighbor of the current DAG within scoring. When evaluating neighbors that add or delete edges to the same node, the scoring algorithm (also the thread that executes the algorithm) accesses only one single score cache array element corresponding to a particular node (referred to herein as a split score cache). Thus, all the information necessary to complete the Naver calculation can be obtained. Thus, managing entries that distribute neighbors to scores can improve the locality of accessing the score cache by grouping neighbors associated with nodes. For a neighbor whose edge is reverse, two operations are involved, deleting the original edge and adding the edge in the opposite direction. In this case each reverse neighbor will access one split score cache for the child node and one split score cache for the parent node. Reverse neighbors can be grouped with neighbors that perform add / delete edges for the same child node, which can control the access pattern of the scoring / calculation function, thereby improving access locality. Improved locality results in faster execution. Note that split score cache with optimized access provides both temporal and geographic localization performance enhancement.

분리 점수 캐시를 가지고 데이터를 분할하는 것은 또한 데이터 경쟁 또는 경쟁 상태(race condition)의 가능성을 감소시킬 수 있다. 서로 다른 종료 노드(end nodes)가 서로 다른 스레드에 분배되면, 스레드들 사이에 데이터의 상호 의존 가능성이 감소된다. 시작 및 종료 노드에 대응하는 2개의 점수 캐시를 액세스할 수 있는 임계 구역은 또한 역방향 에지를 계산하는 데 이용될 수 있지만, 그렇지 않을 경우에 임계 구역의 사용은 크게 감소될 것이다.Partitioning data with a split score cache can also reduce the likelihood of data races or race conditions. If different end nodes are distributed to different threads, the possibility of interdependence of data between threads is reduced. A critical zone that can access the two score caches corresponding to the starting and ending nodes can also be used to calculate the reverse edge, but otherwise the use of the critical zone will be greatly reduced.

따라서, 점수 캐시의 분할은 점수화 알고리즘의 멀티-스레드 구현에서 여러 임계 구역의 제거를 제공하는 것뿐만 아니라 데이터 액세스 국부성, 점수화 알고리즘에 대한 미세 관리 능력을 향상시킬 수 있다. 일실시예에서 메인 점수화 루프는 2개의 루프로 분리된다. 제 1 루프에서, 스레드는 다른 스레드에게 이용 가능한 것과는 상이한 부분/서브-섹션일 수 있는 점수 캐시의 소정 부분에만 액세스하도록 허용될 수 있다. 또한, 제 1 루프 내의 스레드는 다른 스레드에 대한 점수 캐시를 "활성화(warm)"하도록 시도할 수 있다. 점수 캐시의 활성화는 노드와 연관된 하나 이상의 패밀리에 대한 패밀리 점수를 점수 캐시에 로딩하는 것을 포함할 수 있다. 점수 캐시가 제 1 루프의 프로세스에 의해 제 2 루프에서의 사용을 위해 프리-로딩(pre-loaded)되었다면, 각각의 캐시가 오로지 판독만 되고 기록되지 않으리라는 사실에 기인하여 제 2 루프는 임계 구역을 필요로 하지 않을 수 있다. 제 1 루프에 점수 캐시를 로딩하는 것은, 임계 구역이 없음에도 불구하고 제 2 루프에서 캐시 미스가 발생되지 않게 하고, 데이터 경쟁의 가능성을 감소 또는 제거한다. 결과적으로, 이 접근법에서 멀티-스레드 확장성(scalability)은 크게 증가된다. 예를 들어, 실험에 의하면 이러한 기법을 구현하는 것은 듀얼 프로세서(DP) SMP(simultaneous multi-processor) 머신/시스템에서 1.95배의 가속화를 제공할 수 있다는 것이 확인되었다. 또한 설명된 데이터 분할에 기인하여 더 많은 프로세서 또는 CPU(central processing units)를 갖는 SMP, CMP(on-chip multi-processor) 및/또는 비균일 메모리 액세스 NUMA 병렬 프로세싱 머신에서 거의 선형의 속도 증가를 획득할 가능성이 있다. 분리 점수 캐시를 관리하기 위해 메모리 및 실행 시간에서 발생된 추가적인 오버헤드에 기인하여 적은 개수의 노드를 갖는 점수화 네트워크 내에 대략 3-11%의 성능 저하가 관찰될 수 있다는 것을 유의하라.Thus, partitioning of the score cache may not only provide removal of several critical zones in a multi-threaded implementation of the scoring algorithm, but also improve data access locality, fine management capabilities for the scoring algorithm. In one embodiment, the main scoring loop is divided into two loops. In the first loop, a thread may be allowed to access only a portion of the score cache, which may be a different portion / sub-section than is available to other threads. Also, a thread in the first loop may attempt to "warm" the score cache for another thread. Activation of the score cache may include loading family scores for the one or more families associated with the node into the score cache. If the score cache was pre-loaded for use in the second loop by the process of the first loop, the second loop would be a critical zone due to the fact that each cache would only be read and not written. You may not need it. Loading the score cache in the first loop prevents cache misses from occurring in the second loop even though there is no critical section, and reduces or eliminates the possibility of data contention. As a result, multi-thread scalability is greatly increased in this approach. For example, experiments have shown that implementing this technique can provide 1.95x acceleration on dual processor (DP) simulaneous multi-processor (SMP) machines / systems. Also, due to the data partitioning described, an almost linear speed increase is obtained on SMP, on-chip multi-processor (CMP) and / or non-uniform memory access NUMA parallel processing machines with more processors or central processing units (CPUs). There is a possibility. Note that about 3-11% performance degradation can be observed in the scoring network with a small number of nodes due to the additional overhead incurred in memory and execution time to manage the split score cache.

본 명세서에서 "실시예"에 대한 여러 참조는 본 발명의 적어도 하나의 실시예에 포함된 특정한 피쳐, 구조 또는 특징을 설명하는 것으로 이해되어야 한다. 따라서, "일실시예에 있어서" 또는 "다른 실시예에 있어서" 등과 같은 문구의 표현은 본 발명의 여러 실시예를 나타내는 것이고, 반드시 모두가 그 실시예만을 지칭하는 것은 아닐 수 있다.Various references to "embodiments" herein are to be understood to describe particular features, structures or features included in at least one embodiment of the invention. Thus, expressions such as "in one embodiment" or "in another embodiment" and the like refer to various embodiments of the invention and not necessarily all refer to that embodiment.

도 1은 분할된 점수 캐시를 갖는 컴퓨팅 장치에 대한 일실시예를 도시하는 블록도이다. 컴퓨팅 장치(100)는 컴퓨터, 서버, 워크스테이션 또는 다른 컴퓨팅 디바이스/장치/머신을 나타낸다. 컴퓨팅 장치(100)는 서로 다른 프로세스에 대한 동시/병렬 처리를 가능하게 하는 멀티-스레딩일 수 있다. 프로세서(110)는 하나 이상의 처리 유닛 및/또는 컴퓨팅 코어를 나타낸다. 프로세서(110)는 중앙 처리 장치, 마이크로제어기, 디지털 신호 프로세서(DSP), ALU 등을 포함할 수 있다. 프로세서(120)는 마찬가지로 하나 이상의 처리 유닛 및/또는 컴퓨팅 코어를 나타내고, 중앙 처리 장치, 마이크로제어기, DSP, ALU 등을 포함할 수 있다. 프로세서(110, 120)는 병렬로 작동할 수 있다. 일실시예에서 프로세서(110, 120)는 컴퓨팅 장치(100)의 병렬 프로세싱 코어를 나타낸다. 일실시예에서 컴퓨팅 장치(100)는 프로세서(120)를 포함하지 않는다. 컴퓨팅 장치(100)는 SMP(simultaneous multi-processor) 시스템, CMP(on-chip multi-processor) 시스템, 및/또는 다른 병렬 프로세싱 NUMA(non-uniform memory access) 시스템을 나타낼 수 있다.1 is a block diagram illustrating one embodiment for a computing device having a partitioned score cache. Computing device 100 represents a computer, server, workstation or other computing device / device / machine. The computing device 100 may be multi-threaded to enable concurrent / parallel processing for different processes. Processor 110 represents one or more processing units and / or computing cores. The processor 110 may include a central processing unit, a microcontroller, a digital signal processor (DSP), an ALU, and the like. Processor 120 likewise represents one or more processing units and / or computing cores, and may include a central processing unit, microcontroller, DSP, ALU, and the like. Processors 110 and 120 may operate in parallel. In one embodiment, processors 110 and 120 represent parallel processing cores of computing device 100. In one embodiment, computing device 100 does not include a processor 120. The computing device 100 may represent a simulaneous multi-processor (SMP) system, an on-chip multi-processor (CMP) system, and / or other parallel processing non-uniform memory access (NUMA) system.

메모리(112)는 프로세서(110)에 의해 실행되는 일시적 변수 및/또는 인스트럭션에 대한 저장 장치를 제공할 수 있다. 메모리(112)는 온-칩 메모리, 예를 들면, 프로세서(110) 상의 캐시 계층, 컴퓨팅 장치(100)의 시스템 버스 상의 휘발성 저장 장치, 시스템 RAM(random access memory) 등을 나타낼 수 있다. 메모리(112)는 프로세서(110)에 의한 직접적 액세스, 시스템 버스를 통한 액세스 및/또는 이들의 조합으로 액세스될 수 있다. 메모리(122)는 프로세서(120)에 대해서도 동일하게 설명될 수 있다. 일실시예에서, 메모리/캐시는 프로세서(110, 120) 모두에 대해 공통으로 액세스 가능하다.Memory 112 may provide storage for temporary variables and / or instructions executed by processor 110. Memory 112 may represent an on-chip memory, such as a cache hierarchy on processor 110, volatile storage on the system bus of computing device 100, system random access memory (RAM), and the like. Memory 112 may be accessed by direct access by processor 110, through a system bus, and / or a combination thereof. The memory 122 may be similarly described with respect to the processor 120. In one embodiment, the memory / cache is commonly accessible to both processors 110 and 120.

일실시예에서 컴퓨팅 장치(100)는 I/O(input/output) 인터페이스(130)를 포함하는데, 이것은 컴퓨팅 장치(100)가 외부 소스로부터 입력을 수신 및/또는 외부 소스로 출력을 제공할 수 있게 연결하는 하나 이상의 메커니즘/장치를 나타낸다. 외부 소스는 다른 컴퓨팅 시스템, 사용자 등을 포함할 수 있고, 디스플레이 장치, 커서 제어부, 문자·숫자 입력 장치, 오디오 입력 및/또는 출력 장치, 시각 디스플레이(예를 들면, LED(light emitting diodes) 등을 포함할 수 있다. I/O 인터페이스(130)는 또한 I/O 장치에 대한 드라이버를 포함할 수 있다. I/O 인터페이스(130)를 통해 수신된 정보/데이터/인스트럭션은 메모리(112) 및/또는 메모리(122) 및/또는 대용량 저장 장치(140) 내에 저장될 수 있다. 대용량 저장 장치(140)는 착탈 가능(removable) 저장 장치(142)(예를 들면, 디스크 드라이브, 메모리 스틱/카드/슬롯, USB(universal serial bus) 접속 장치 등) 및 비휘발성 저장 장치(144)(예를 들면, 디스크 드라이브, 메모리 스틱/카드, 하드 디스크 드라이브 등)를 포함하는 하나 이상의 여러 저장 메커니즘을 나타낸다. 대용량 저장 장치(140)는 프로세서(110 및/또는 120)에서의 실행을 위해 메모리(112 및/또는 122) 내에 로딩되는 프로그램/응용 프로그램 및/또는 인스트럭션, 및/또는 프로그램 또는 인스트럭션과 관련 또는 연관된 데이터를 저장할 수 있다.In one embodiment, the computing device 100 includes an input / output (I / O) interface 130, which may allow the computing device 100 to receive input from and / or provide output to an external source. Represents one or more mechanisms / devices that connect securely. External sources may include other computing systems, users, and the like, and may include display devices, cursor controls, alphanumeric input devices, audio input and / or output devices, visual displays (eg, light emitting diodes (LEDs), etc.). I / O interface 130 may also include a driver for the I / O device .. Information / data / instructions received via I / O interface 130 may include memory 112 and //. Or in memory 122 and / or mass storage 140. Mass storage 140 may be removable storage 142 (e.g., disk drive, memory stick / card /). One or more different storage mechanisms, including slots, universal serial bus (USB) connections, etc.) and non-volatile storage devices 144 (eg, disk drives, memory sticks / cards, hard disk drives, etc.). that Device 140 may store programs / applications and / or instructions loaded into memory 112 and / or 122 for execution on processor 110 and / or 120, and / or data associated with or associated with the program or instructions. Can be stored.

데이터, 인스트럭션 및/또는 프로그램 정보는 제조물을 통해 머신/전자 장치/하드웨어에 의해 제공되고, 컴퓨팅 장치(100)에 의해/컴퓨팅 장치(100)에서 실행될 수 있다. 제조물은 인스트럭션, 데이터 등을 제공하는 컨텐츠를 구비한 머신 액세스 가능/판독 가능 매체를 포함할 수 있다. 컨텐츠는 컴퓨팅 장치(100)가 설명된 여러 동작 및 작동을 실행하게 할 수 있다. 머신 액세스 가능 매체는 머신(예를 들면, 컴퓨팅 장치, 전자 장치, 전자 시스템/서브시스템 등)에 의해 액세스 가능한 형태로 정보/컨텐츠를 제공(즉, 저장 및/또는 송신)하는 임의의 메커니즘을 포함한다. 예를 들면, 머신 액세스 가능 매체는 전기, 광학, 음향 또는 다른 형태의 전파된 신호(예를 들면, 반송파(carrier waves), 적외선 신호, 디지털 신호 등) 등뿐만 아니라 기록 가능/기록 불가능 매체(예를 들면, ROM(read only memory), RAM(random access memory), 자기 디스크 저장 매체, 광학 저장 매체, 플래시 메모리 장치 등)를 포함한다. 머신 액세스 가능 매체는 컴퓨팅 시스템에 로딩되어 컴퓨팅 시스템이 동작 중일 때 컴퓨팅 시스템에 의해 실행될 수 있는 코드를 갖는 컴퓨팅 시스템을 더 포함할 수 있다. 따라서, 컴퓨팅 시스템에 이러한 코드를 전달하는 것은, 제조물에 상술된 것과 같은 컨텐츠를 제공하는 것으로 이해될 수 있다. 게다가, 데이터베이스 또는 다른 메모리 위치에 코드를 저장하고, 전파 신호를 통해 통신 매체에 걸쳐 다운로딩되도록 코드를 제공하는 것은, 제조물에 상술된 것과 같은 컨텐츠를 제공하는 것으로 이해될 수 있다.Data, instructions and / or program information may be provided by the machine / electronic device / hardware through the article of manufacture and executed by the computing device 100 / in the computing device 100. The article of manufacture may include a machine accessible / readable medium having content that provides instructions, data, and the like. Content can cause computing device 100 to perform the various operations and operations described. Machine-accessible media includes any mechanism for providing (ie, storing and / or transmitting) information / content in a form accessible by a machine (eg, computing device, electronic device, electronic system / subsystem, etc.). do. For example, a machine-accessible medium may be an electrical, optical, acoustical or other form of propagated signal (e.g., carrier waves, infrared signals, digital signals, etc.) as well as recordable / non-recordable media (e.g., For example, read only memory (ROM), random access memory (RAM), magnetic disk storage media, optical storage media, flash memory devices, and the like. The machine accessible medium may further comprise a computing system having code loaded into the computing system and executable by the computing system when the computing system is operating. Thus, delivering such code to a computing system can be understood to provide content such as that described above in an article of manufacture. In addition, storing the code in a database or other memory location and providing the code to be downloaded over a communication medium via a radio signal may be understood to provide content such as that described above in the article of manufacture.

일실시예에서 컴퓨팅 장치(100)는 유선 또는 무선 인터페이스를 포함 및/또는 유선 및 무선 인터페이스 모두를 포함할 수 있는 네트워크 인터페이스(150)를 포함할 수 있다. 네트워크 인터페이스(150)는 컴퓨팅 장치(100)가 네트워크를 통해 병렬 컴퓨팅 장치와 인터페이스로 연결될 수 있게 하는 네트워크 카드/회로를 나타낼 수 있다. In one embodiment, computing device 100 may include a network interface 150 that may include a wired or wireless interface and / or may include both a wired and wireless interface. Network interface 150 may represent a network card / circuit that enables computing device 100 to interface with a parallel computing device via a network.

일실시예에서 컴퓨팅 장치(100)는 데이터 네트워크의 구조 학습을 위한 값의 저장을 제공하는 하나 이상의 구성 요소를 나타낼 수 있는 점수 캐시(160)를 포함한다. 점수 캐시(160)는 개별적으로 어드레스 가능한 메모리 위치/섹션일 수 있는 다수의 분할부/서브섹션/소자(161)를 포함할 수 있다. 각각의 위치는 어레이의 각각의 소자를 소자(161)가 되게 하여 점수 캐시(160)를 소자의 어레이로 구성함으로써 어드레스/액세스될 수 있다. 점수 캐시(160)는 저장 하드웨어를 관리하는 하드웨어 및/또는 소프트웨어 성분을 나타낼 수 있다. 일실시예에서 점수 캐시(160)는 프로세서(110) 및 프로세서(120)에 액세스 가능한 데이터 구조를 나타낸다. 점수 캐시(160) 내에서 데이터의 저장 및 데이터에 대한 액세스는 구조 학습 프로세스 내에서 데이터 경쟁을 감소시키고, 국부성을 향상시키며 하나 이상의 임계 구역을 제거하도록 제어될 수 있다. 점수 캐시(160)는 일반적으로 컴퓨팅 장치(100) 내에 도시되어 있고, 대용량 저장 장치(140)에 저장되어, 프로세서(110 또는 120) 중 하나 또는 모두에서 실행되고/또는 메모리(112 또는 122) 중 하나 또는 모두에 상주하는 데이터로 이해될 수 있다. In one embodiment, computing device 100 includes a score cache 160 that may represent one or more components that provide storage of values for structural learning of a data network. Score cache 160 may include multiple partitions / subsections / elements 161, which may be individually addressable memory locations / sections. Each location may be addressed / accessed by making each element of the array an element 161 to configure the score cache 160 as an array of elements. Score cache 160 may represent hardware and / or software components that manage storage hardware. In one embodiment, the score cache 160 represents a data structure accessible to the processor 110 and the processor 120. Storage of data and access to data within score cache 160 may be controlled to reduce data contention, improve locality, and remove one or more critical areas within the structure learning process. Score cache 160 is generally shown within computing device 100 and stored in mass storage device 140 to be executed on one or both of processors 110 or 120 and / or in memory 112 or 122. It can be understood as data residing in one or both.

도 2는 분할된 점수 캐시를 갖는 멀티-스레딩 컴퓨팅 장치에 대한 일실시예를 도시하는 블록도이다. 컴퓨팅 장치(200)는 프로그램의 실행을 이행/제어하는 디바이스, 시스템, 장치 및/또는 컴퓨터를 나타낸다. 컴퓨팅 장치(200)는 네트워크 구조 학습 함수 또는 프로그램의 일부분 또는 전부를 실행할 수 있다. 컴퓨팅 장치(200)는 1부터 N까지의 다수의 스레드를 포함하는데, 여기에서 N은 정수를 나타낸다. 예를 들면, 스레드 1 내지 스레드 N은 SMP, CMP, 또는 NUMA 머신 상의 다수의 병렬 계산 자원을 나타낼 수 있다. 일실시예에서 컴퓨팅 장치(200)는 또한 sub 1에서 sub M까지의 다수의 서브-파트로 분할된 점수 캐시(210)를 포함하는데, 여기에서 M은 N과 동일 값이거나 아닐 수 있는 정수를 나타낸다. 예를 들면, 컴퓨팅 장치(200)는 2개의 스레드를 포함하고 점수 캐시(210)를 1000개의 소자로 분할할 수 있다. 다른 예에서, 컴퓨팅 장치는 16개의 스레드를 포함하고, 점수 캐시를 20개의 소자로 분할할 수 있다. 본 명세서의 기법에 대한 설명은 이것 또는 임의의 다른 특정한 구현으로 한정되지 않는다. 이 접근법은 다수의 병렬 계산 자원(스레드)으로 확장 가능하고, 소형의 네트워크 학습 구조로부터 대형의 복잡한 학습 구조까지 크기가 조절될 수 있다.2 is a block diagram illustrating one embodiment for a multi-threaded computing device having a partitioned score cache. Computing device 200 represents a device, system, apparatus and / or computer that implements / controls execution of a program. The computing device 200 may execute some or all of the network structure learning functions or programs. Computing device 200 includes a number of threads from 1 to N, where N represents an integer. For example, threads 1 through N may represent multiple parallel computing resources on an SMP, CMP, or NUMA machine. In one embodiment, computing device 200 also includes a score cache 210 partitioned into a number of sub-parts from sub 1 to sub M, where M represents an integer that may or may not be equal to N. . For example, computing device 200 may include two threads and partition score cache 210 into 1000 elements. In another example, the computing device may include sixteen threads and divide the score cache into twenty elements. The description of the techniques herein is not limited to this or any other specific implementation. This approach is scalable to multiple parallel computing resources (threads) and can be scaled from small network learning structures to large complex learning structures.

각각의 분할부, 즉 sub 1 내지 sub M은 다수의 엔트리를 포함할 수 있는데, 이들은 각 분할부에 대해 동일한 개수의 엔트리이거나 그렇지 않을 수 있다. 도 2의 표현은 오로지 예시를 위한 것이고, 임의의 특정한 분할부와 연관될 수 있는 엔트리의 개수를 나타내도록 의도되지 않았다. 일실시예에서 각각의 분할부는 학습할 네트워크 구조 내의 서로 다른 노드와 연관되어 있다. 엔트리는 각각의 노드와 연관된 상이한 패밀리의 값을 나타내거나, 서로 다른 노드 상태(예를 들면, 다른 노드와의 관계)를 나타낼 수 있다. 값이 계산되고 예를 들어, sub 1에 저장되면, 이 값은 상수로 가정할 수 있다. 따라서, 이 값은 일단 점수화되어, sub 1의 엔트리 내에 저장되고, 나중의 네이버 점수의 계산에 이용하기 위해 추후의 액세스에 대해 개별적으로 어드레스될 수 있다.Each partition, sub 1 through sub M, may contain multiple entries, which may or may not be the same number of entries for each partition. The representation of FIG. 2 is for illustration only and is not intended to represent the number of entries that may be associated with any particular partition. In one embodiment each partition is associated with a different node in the network structure to learn. An entry may represent a value of a different family associated with each node or may represent a different node state (eg, a relationship with another node). If a value is calculated and stored in, for example, sub 1, this value can be assumed to be a constant. Thus, this value is once scored, stored in the entry of sub 1, and can be individually addressed for later access for use in the calculation of later neighbor scores.

sub 1 내지 sub M은 각각 스레드 1 내지 스레드 N 중 어느 하나에 의해 액세스될 수 있다. 일실시예에서 여러 스레드로부터의 액세스는 여러 분할부에 대해 제어된다. 액세스는 스레드에 의해 실행될 계산 작업의 조직화에 의해 제어될 수 있다. 예를 들면, 스레드 N은 sub M의 엔트리를 각각 계산하는 소정의 연속 작업일 수 있다. 후속 계산 작업에서, 스레드 1이 sub M을 액세스한다면, 모든 값이 계산될 것이고, 스레드 1에 의한 액세스에서 캐시 미스가 존재하지 않을 것이다. 이러한 방식으로, 엔트리의 값이 변동되지 않을 것이므로, 임계 구역을 필요로 하지 않으면서 스레드 1 및 스레드 N은 sub M을 액세스할 수 있다.sub 1 to sub M may be accessed by any one of threads 1 to N, respectively. In one embodiment, access from multiple threads is controlled for different partitions. Access can be controlled by the organization of computational tasks to be executed by threads. For example, thread N may be any continuous task that computes each entry of subM. In subsequent computations, if thread 1 accesses sub M, all values will be calculated and there will be no cache miss in access by thread 1. In this way, since the value of the entry will not change, thread 1 and thread N can access sub M without requiring a critical section.

일실시예에서 통신 네트워크(220)는 예를 들면 인터넷 등의 WAN(wide area network)를 나타낼 수 있고, LAN(local area network), 또는 컴퓨터 간의 다른 지역 상호 접속을 나타낼 수도 있다. 통신 네트워크(220)는 LAN 및 WAN의 조합을 나타낼 수 있고, 유선 및/또는 무선 접속/링크를 포함할 수 있다. 컴퓨팅 장치의 상호 접속을 대표하는 통신 네트워크(220)는, 정보의 논리 표현 및 정보의 내부 관계를 지칭하는 데이터 네트워크 또는 네트워크 구조(예를 들면, 베이지안 네트워크)와 혼동되지 않아야 한다.In one embodiment, communication network 220 may represent a wide area network (WAN), for example, the Internet, or may represent a local area network (LAN), or other local interconnection between computers. Communication network 220 may represent a combination of LAN and WAN, and may include wired and / or wireless connections / links. The communication network 220 representing the interconnection of computing devices should not be confused with a data network or network structure (eg, Bayesian network) that refers to the logical representation of the information and the internal relationships of the information.

일실시예에서 데이터베이스(DB)(240)는 직접 링크이거나, 예를 들면, 통신 네트워크(220)를 통한 통신 인터페이스일 수 있는 컴퓨팅 장치(200)와 결합되어 있다. 이 예에서 통신 네트워크(220)는 컴퓨팅 장치(200)와 데이터베이스(240) 사이의 하나 이상의 중간 접속부를 나타낼 수 있다. 데이터베이스(240)는 정보/데이터를 저장할 수 있는 임의의 개수의 데이터베이스 하드웨어 장치, 서버 등을 나타낼 수 있다. 일실시예에서 데이터베이스(240)는 학습될 네트워크 구조에 관련된 정보를 포함한다. 예를 들면, 데이터베이스(240)는 하나 이상의 패밀리 또는 다른 네트워크 구조가 도출되게 하는 증거물(예를 들면, 알려진 데이터, 훈련 데이터)을 포함할 수 있다. 따라서 데이터베이스(240)는 구조를 학습하게 하는 데이터, 네이버를 판정할 수 있게 하고 네이버와 연관된 패밀리를 판정할 수 있게 하는 데이터 등을 포함할 수 있다. 데이터베이스(240)는 또한 네트워크 구조 자체를 저장하도록 고려될 수 있다. 데이터베이스(240) 및/또는 컴퓨팅 장치(200)의 스레드 중 하나에 의해 네트워크 구조에 대한 핸들이 생성될 수 있다. 데이터베이스(240)의 정보는 점수 캐시(210) 내에 로딩될 패밀리 점수를 계산하는 데 이용될 수 있다.In one embodiment, the database (DB) 240 is coupled with the computing device 200, which may be a direct link or, for example, a communication interface through the communication network 220. In this example, communication network 220 may represent one or more intermediate connections between computing device 200 and database 240. Database 240 may represent any number of database hardware devices, servers, etc., capable of storing information / data. In one embodiment, the database 240 includes information related to the network structure to be learned. For example, database 240 may include evidence (eg, known data, training data) that allows one or more families or other network structures to be derived. Thus, the database 240 may include data that allows the structure to be learned, data that enables the determination of the neighbor, and the family associated with the neighbor. Database 240 may also be considered to store the network structure itself. The handle to the network structure may be generated by one of the threads of the database 240 and / or computing device 200. The information in database 240 may be used to calculate family scores to be loaded into score cache 210.

도 3은 구조 학습 루프를 내부 및 외부 루프로 분리하는 것에 대한 일실시예를 도시하는 흐름도이다. 구조 학습을 시작하는 데 있어서, 구조는 시작 포인트로서 선택될 수 있고, 네이버가 판정될 수 있다. 네이버의 판정 및 하나 이상의 시작 구조에 대한 선택은 메인 스레드에 의해 실행될 수 있다. 메인 스레드는 네이버 계산 작업을 메인 스레드 자체를 포함하는 다수의 병렬 컴퓨팅 스레드에 분배할 수 있다. 메인 스레드 또는 병렬 스레드 중 어느 하나에서, 초기 점수가 계산된다(302). 도 3의 성분으로부터 도출된 이 계산 및 임의의 다른 계산, 동작, 실행은 다수의 스레드에서, 동시에 또는 거의 동시에 실행될 수 있고, 병렬로 동작한다고 말할 수 있다. 시작 포인트는 구조 학습 프로그램 내의 주어진 기준/파라미터에 기초하여 선택되거나, 임의로 선택되거나, 다른 이유를 기초로 선택될 수 있다. 스레드는 DAG 내의 노드를 종료 노드로 선택하고(304), 소정의 노드를 시작 노드로 선택한다(306). 종료 노드는 아크(arcs)가 향하게 될 차일드 노드이다. 시작 노드는 페어런트 노드이고, 2개의 노드 사이에 페어런트 및 차일드의 관계는 아크 또는 에지로 표현된다. 시작 및 종료 노드는 또한 노드의 리스트 중 첫 번째에서와 같이 임의로 선택되거나, 소정의 다른 기준에 기초하여 선택될 수 있다. 일실시예에서 이 선택은 중요하지 않다.3 is a flow diagram illustrating one embodiment for dividing a structure learning loop into an inner and an outer loop. In starting structure learning, the structure can be selected as a starting point and the neighbor can be determined. The determination of the neighbor and the selection of one or more startup structures may be executed by the main thread. The main thread can distribute the neighbor computation work to multiple parallel computing threads, including the main thread itself. In either the main thread or the parallel thread, an initial score is calculated 302. This calculation and any other calculations, operations, and executions derived from the components of FIG. 3 can be said to be executed in multiple threads, concurrently or nearly simultaneously, and operate in parallel. The starting point may be selected based on a given criterion / parameter in the structure learning program, randomly selected, or based on other reasons. The thread selects a node in the DAG as an end node (304), and selects a given node as a start node (306). The end node is the child node to which the arcs will face. The starting node is a parent node, and the relationship of parent and child between two nodes is represented by arc or edge. The starting and ending nodes may also be arbitrarily selected, such as in the first of a list of nodes, or based on some other criteria. In one embodiment this selection is not important.

스레드는 시작 노드로부터 종료 노드까지 에지가 존재하는지 여부를 판정한다(310). 예를 들면, 스레드는 점수 캐시를 액세스하여 에지의 값이 계산되었는지 여부를 판정한다. 분리 점수 캐시의 경우에, 점수 캐시에 대한 액세스는 점수 캐시에 대해 실행되거나, 노드와 연관된 점수 캐시의 세그먼트 또는 분할부에 대해 실행된다. 에지가 존재하지 않는다면, 스레드는 에지의 추가가 유효 네이버를 생성할 것인지 여부를 판정한다(320). 유효 네이버가 생성될 것이라면, 새로운 에지를 추가하여 종료 노드의 새로운 점수를 계산한다(322). 스레드는 새로운 에지에 기초하여 네이버 점수를 결정할 수 있고, 그 네이버 점수가 현재의 최고 네이버보다 우수한지 판정하며, 계산된 네이버 점수가 더 우수하다면 최고 점수화 네이버를 업데이트한다(324).The thread determines 310 whether there is an edge from the start node to the end node. For example, the thread accesses the score cache to determine whether the value of the edge has been calculated. In the case of a split score cache, access to the score cache is performed on the score cache, or on a segment or segment of the score cache associated with the node. If no edge exists, the thread determines 320 whether adding an edge will create a valid neighbor. If a valid neighbor will be generated, a new edge is added to calculate the new score of the ending node (322). The thread may determine the neighbor score based on the new edge, determine if the neighbor score is better than the current highest neighbor, and if the calculated neighbor score is better, update the highest scoring neighbor (324).

에지가 존재하지 않으면, 스레드는 역방향 에지가 유효 네이버를 생성할 것인지 여부를 판정할 수 있다(330). 유효 네이버가 생성될 것이라면, 시작 노드 및 종료 노드의 새로운 점수가 계산될 수 있다(332). 스레드는 새로운 에지에 기초하여 네이버 점수를 결정하고, 그 네이버 점수가 현재의 최고 네이버보다 우수한지 판정하며, 계산된 네이버 점수가 더 우수하다면 최고 점수화 네이버를 업데이트한다(324). 최고 점수화 네이버가 업데이트된 후, 시작 노드로부터 종료 노드로의 에지가 유효 네이버를 생성하지 않거나, 역방향 에지가 유효 네이버를 생성하지 않는다면, 스레드는 점수화된 시작 노드가 선택된 종료 노드에 대한 최종 시작 노드인지 여부를 판정한다(340). 이 판정은 점수화 알고리즘의 더 빠른 실행을 위해 더 우수한 국부성을 제공하는 데 있어서 분리 또는 분할된 점수 캐시가 사용될 수 있게 한다. 예를 들면, 분리 점수 캐시는 종료 노드에 대한 단일 어드레스 가능 메모리 위치를 가질 수 있고, 메모리 위치에서의 각각의 엔트리는 시작 노드에 대한 종료 노드와 관련된 점수를 나타낸다. 종료 노드 및 시작 노드의 역할은 유사한 결과를 생성하면서 변경될 수 있다는 것을 유의하라.If no edge exists, the thread may determine whether the reverse edge will generate a valid neighbor (330). If a valid neighbor will be generated, a new score of the start node and end node can be calculated (332). The thread determines the neighbor score based on the new edge, determines if the neighbor score is better than the current highest neighbor, and updates the highest scoring neighbor if the calculated neighbor score is better. After the highest scoring neighbor is updated, if the edge from the starting node to the ending node does not produce a valid neighbor, or if the reverse edge does not produce a valid neighbor, then the thread determines whether the scored starting node is the last starting node for the selected ending node. It is determined whether or not (340). This decision allows the split or split score cache to be used in providing better locality for faster execution of the scoring algorithm. For example, the split score cache may have a single addressable memory location for the ending node, with each entry in the memory location representing a score associated with the ending node for the starting node. Note that the roles of the end node and the start node can be changed while producing similar results.

시작 노드가 최종 시작 노드가 아니라면, 다음 시작 노드가 선택되고(342), 프로세스는 모든 시작 노드가 처리될 때까지 반복된다. 모든 시작 노드가 처리되면, 스레드는 모든 종료 노드가 처리되었는지 여부를 판정한다(350). 이 구조는 제 1 루프 및 제 2 루프를 제공하고, 여기에서 제 1 루프는 노드에 대한 점수 캐시의 값을 처리한다. 이러한 방식으로 제 1 루프는 제 2 루프에 의해 이용될 캐시를 로딩하고, 점수 캐시를 활성화한다. 최종 종료 노드가 처리되지 않았다면, 다음 종료 노드가 선택되고(352), 처리는 그 종료 노드에 대해 개시된다. 최종 종료 노드가 처리되었다면, 스레드는 학습이 완료되었는지 여부를 판정한다(360). 학습이 노드에서 완료되면, 현재의 DAG는 업데이트되고 점수화되며(362), 프로세스는 네이버의 점수가 현재 구조의 점수보다 높지 않을 때까지 반복된다. 최고의 점수화 네이버가 발견되면 가장 마지막에 학습된 DAG가 출력된다(364).If the start node is not the last start node, then the next start node is selected (342), and the process repeats until all start nodes have been processed. Once all start nodes have been processed, the thread determines 350 whether all end nodes have been processed. This structure provides a first loop and a second loop, where the first loop processes the value of the score cache for the node. In this way the first loop loads the cache to be used by the second loop and activates the score cache. If the last ending node has not been processed, the next ending node is selected (352), and processing begins for that ending node. If the last ending node has been processed, the thread determines 360 whether the learning is complete. When learning is completed at the node, the current DAG is updated and scored 362, and the process repeats until the Naver's score is no higher than the score of the current structure. If the best scoring neighbor is found, the last learned DAG is output (364).

도 4는 분리 점수 캐시로 구조 학습하는 것에 대한 실시예를 나타내는 흐름도이다. 메인 또는 마스터 스레드는 학습될 구조에 대한 초기 점수를 계산한다(402). 초기 점수는 학습을 위한 시작 포인트를 제공한다. 마스터 스레드는 분리 점수 캐시 아키텍처에 기초하여 병렬 스레드에 대한 스레드 작업 디스패칭을 준비한다(404). 일실시예에서 메인 점수 캐시 데이터 구조는 다수의 소형의 또는 분리된 점수 캐시로 분할되고, 이것은 학습될 네트워크의 각각의 노드에 대해 하나의 분리 점수 캐시가 할당되게 할 수 있다. 점수 계산은 처리를 위해 병렬 스레드들에게 분배(디스패칭)될 수 있다. 점수 캐시의 분리 특성은 소정 량의 작업이 순차적으로(병렬의 반대 의미) 실행되게 하여 작업 분배에서의 소정의 최적화가 시스템 내에서 감소되게 한다.4 is a flow diagram illustrating an embodiment of structure learning with a split score cache. The main or master thread calculates 402 an initial score for the structure to be learned. Initial scores provide a starting point for learning. The master thread prepares for thread task dispatching for parallel threads based on the split score cache architecture (404). In one embodiment, the main score cache data structure is divided into a number of small or separate score caches, which may allow one split score cache to be allocated for each node of the network to be learned. Score calculation may be distributed (dispatched) to parallel threads for processing. The segregation characteristic of the score cache allows a certain amount of work to be executed sequentially (as opposed to parallel), so that some optimization in work distribution is reduced in the system.

마스터 스레드는 계산 작업을 각각의 이용 가능한 스레드, 즉 스레드 1 내지 스레드 N에 분배한다. 계산 작업은 네이버의 점수를 계산하는 동작의 일부 또는 전부를 포함할 수 있다. 각각의 스레드는 유효 네이버의 점수를 계산한다(412, 422). 점수 캐시가 분리되므로, 스레드는 네이버 계산의 제 1 루프에서 충돌하지 않는 점수 계산을 할당할 수 있다. 전통적으로, 병렬화는 모든 네이버를 생성하고 그들을 계산하기 위해 스레드에 동등하게 분배하는 것으로 이루어졌다. 그러면 스레드는 각각 할당된 네이버를 처리한다. 그러나, 먼저 스레드는 네이버 계산을 위해 점수가 계산되었는지 여부를 확인하기 위해 점수 캐시를 탐색할 수 있으므로, 글로벌 점수 캐시만이 전통적으로 이용되었다는 사실은 데이터 경쟁(경쟁 조건)을 회피하고 스레드 안정성을 보장하기 위해 점수 캐시 액세스가 임계 구역에 의해 모니터링될 필요가 있다는 것을 의미하였다. 이것은 질의가 손실될 경우 스레드가 점수를 계산하고 점수 캐시를 업데이트하기 때문이다. 이러한 경우에 서로 다른 스레드가 글로벌 점수 캐시의 동일한 엔트리에 기록하도록 시도할 수 있다.The master thread distributes computational work to each available thread, that is, thread 1 to thread N. The calculation may include some or all of the operations of calculating the score of the neighbor. Each thread calculates the score of the effective neighbor (412, 422). Since the score cache is detached, a thread can allocate a score calculation that does not conflict in the first loop of neighbor calculations. Traditionally, parallelization has consisted of creating all neighbors and equally distributing them to threads to compute them. The thread then processes each assigned neighbor. However, first, the thread can search the score cache to see if the score has been calculated for the Naver calculation, so the fact that only the global score cache has traditionally been used avoids data races (competitive conditions) and ensures thread stability. This meant that score cache access would need to be monitored by the critical zone. This is because if the query is lost, the thread calculates the score and updates the score cache. In this case, different threads may attempt to write to the same entry in the global score cache.

그러나, 임계 구역은 멀티 스레드 프로그램의 일부분이 순차적으로 실행된다는 것에 의해 병렬화의 이점을 감소시키고, 이것은 확장성(scalability)을 크게 감소시킨다. 예를 들면, 유전자 네트워크의 실행에 있어서 임계 구역에 기인하여 최대 60%의 실행 시간이 순차적으로 실행될 수 있다. 암다흐(Amdah)의 법칙에 따르면, 60%의 실행 시간이 순차적일 때 듀얼 프로세서 시스템은 l((l-0.60)/2+0.60)=1.25배 이하의 가속으로 제한되고, 쿼드 프로세서 시스템은 l/((l-0.60)/4+0.60)=1.43배 이하의 가속으로 제한되었다. 이것은 각각 오로지 0.625 또는 0.358의 잠재적 선형 가속화에 대응한다. 프로세서의 개수를 증가하는 것은 선형 가속 및 실제 가속 사이에 불일치를 악화시킨다는 것을 이해할 것이다.However, the critical section reduces the benefit of parallelism by having portions of a multi-threaded program run sequentially, which greatly reduces scalability. For example, up to 60% of the run time may be executed sequentially due to critical regions in the execution of the genetic network. Amdah's law states that when 60% of the execution time is sequential, the dual processor system is limited to an acceleration of l ((l-0.60) /2+0.60) = 1.25 times or less, and the quad processor system is l /((l-0.60)/4+0.60)=1.43 times less acceleration. This corresponds only to potential linear acceleration of 0.625 or 0.358, respectively. It will be appreciated that increasing the number of processors exacerbates the discrepancy between linear acceleration and actual acceleration.

분리 점수 캐시를 가지고, 각각의 학습 반복의 메인 네이버 점수화 루프는 2개의 더 적은 루프로 분리될 수 있다. 제 1 루프는 추가/삭제 에지의 네이버를 처리하고 대응하는 역방향 에지의 경우에 분리 점수 캐시를 활성화한다. 일실시예에서, 모든 데이터/동작이 완전히 분할되어 점수 캐시 데이터에서 스레드의 충돌을 방지하기 때문에 이것은 임계 구역없이 이루어질 수 있다. 계산된 점수가 분리 점수 캐시에 기록되기 때문에, 스레드의 시스템은 동기화될 수 있고, 점수 캐시에 대한 모든 후속 질의는 점수 캐시 히트를 초래하여 점수 캐시를 다시 기록할 필요성을 없앤다. 따라서, 필요한 모든 계산이 제 1 루프 내에서 실행되었기 때문에 제 2(후자의) 루프는 임계 구역없이 분리 점수 캐시를 판독함으로써 역방향 에지 네이버를 처리할 수 있다.With a split score cache, the main neighbor scoring loop of each learning iteration can be split into two fewer loops. The first loop processes the neighbor of the add / delete edge and activates the split score cache in the case of the corresponding reverse edge. In one embodiment, this can be done without a critical section because all data / actions are completely partitioned to prevent thread collisions in the score cache data. Since the calculated score is written to the split score cache, the system of threads can be synchronized, and all subsequent queries to the score cache result in a score cache hit, eliminating the need to rewrite the score cache. Thus, since all necessary calculations have been performed in the first loop, the second (later) loop can handle the reverse edge neighbor by reading the split score cache without a critical section.

따라서, 분리 점수 캐시의 스레드 내부(intra-thread) 액세스(414, 424)는, 오로지 특정 스레드에 의해서만 처리되는 점수 캐시 분할부(예를 들면, 점수 캐시 어레이의 소자)에서 스레드에 의해 실행된다. 스레드는 점수 캐시 분할부와 관련된 노드에 대한 값을 점수화하고, 후속 스레드 간(inter-thread) 액세스에 대한 점수 캐시를 로딩할 수 있다(416, 426), 이것은 상술된 바와 같은 제 2 루프일 수 있다. 스레드는 최고 점수를 갖는 네이버를 선택하고(418, 428), 일실시예가 최저 점수 또는 특정 값에 최근접한 점수가 최고 점수가 되도록 설계될 수 있기는 하지만, 이 값은 전형적으로 힐-클라이밍 구현에서 최고 점수일 것이다.Thus, intra-thread access 414, 424 of the detached score cache is executed by a thread in a score cache partition (eg, elements of a score cache array) that is only processed by a particular thread. The thread may score a value for the node associated with the score cache partition and load the score cache for subsequent inter-thread access (416, 426), which may be the second loop as described above. have. The thread selects the neighbor with the highest score (418, 428), and although one embodiment may be designed such that the lowest score or the score closest to a particular value is the highest score, this value is typically in a hill-climbing implementation. It will be the best score.

여러 병렬 스레드에 의해 실행된 처리는 통상적으로 최고 점수를 갖는 현재의 처리된 네이버를 선택하는 마스터 스레드에 의해 동기화된다(430). 다음에 제 2 루프는 더 우수한 점수의 네이버가 존재하는지 여부에 대해 판정하도록 처리될 수 있다. 동작은 다시 여러 병렬 스레드에 대해 디스패칭되고, 스레드는 스레드 간 분리 점수 캐시 액세스로 유효 네이버의 점수를 계산한다(442, 452). 분리 점수 캐시가 이전에 활성화되었거나 계산된 점수로 로딩되었으므로, 스레드는 어떠한 점수 계산도 실행할 필요가 없다는 것을 유의하라. 따라서, 스레드가 완전히 분리된 데이터 또는 다른 스레드에 의한 액세스로부터 격리된 데이터에서 작동하지 않음에도 불구하고, 이러한 동작은 임계 구역없이 실행될 수 있다. 스레드는 여기에서도 최고 점수를 갖는 네이버를 선택하고(444, 454), 개별 스레드의 선택은 동기화되며, 시스템에서의 최고 네이버는 최고 점수로 업데이트된다(462). 현재의 최고 네이버는, 스레드에 의해 점수화된 네이버 중의 하나가 현재의 최고 네이버보다 우수한지 여부를 판정하는 마스터 스레드에 의해 제어될 수 있다.The processing executed by the various parallel threads is typically synchronized 430 by the master thread selecting the current processed neighbor with the highest score. The second loop may then be processed to determine whether there is a better scored neighbor. The operation is again dispatched for several parallel threads, and the thread calculates the score of the valid neighbor with a thread-to-thread split score cache access (442, 452). Note that since the split score cache was loaded with previously enabled or calculated scores, the thread does not need to perform any score calculations. Thus, even though a thread does not operate on data that is completely isolated or data that is isolated from access by another thread, this operation can be executed without a critical section. The thread selects the neighbor with the highest score here (444, 454), the selection of the individual threads is synchronized, and the highest neighbor in the system is updated with the highest score (462). The current best neighbor can be controlled by a master thread that determines whether one of the neighbors scored by the thread is better than the current best neighbor.

다음에 마스터 스레드는 네이버를 DAG에 적용하고 점수를 업데이트할 수 있다(464). 학습이 완료되면(470), 최종 학습된 DAG가 출력된다(472). 학습이 완료되지 않으면, 학습은 반복되어(474), 더 우수한 점수를 갖는 네이버를 결정한다. 학습은 특정 구현에 대해 다수의 기준을 기초로 완료될 수 있다. 예를 들면, 그 기준은 네이버가 임계 반복 처리 횟수에서의 최고 네이버이거나, 임계 반복 회수에 도달하였거나, 계산의 시간 한계에 도달하는 것 등이 있다.The master thread may then apply the neighbor to the DAG and update the score (464). When the learning is completed (470), the final learned DAG is output (472). If the learning is not completed, the learning is repeated 474 to determine the neighbor with the better score. Learning can be completed based on a number of criteria for a particular implementation. For example, the criterion may be that the neighbor is the highest neighbor in the number of critical iterations, the number of critical iterations has been reached, the time limit of the calculation is reached, and the like.

본 명세서에 설명된 것 외에도, 본 발명의 범주를 벗어나지 않으면서 본 발명의 실시예에 대한 여러 수정이 이루어질 수 있다. 그러므로, 본 명세서의 설명 및 예시는 일례를 든 것으로 간주되어야 하며, 제한적 의미로 간주되어서는 안 된다. 본 발명의 범주는 오로지 이하의 청구항에 의해서만 판단되어야 한다.In addition to those described herein, various modifications may be made to embodiments of the invention without departing from the scope of the invention. Therefore, the description and examples herein should be regarded as illustrative and not in a limiting sense. The scope of the invention should only be determined by the claims which follow.

Claims

As a structure learning method,

A first thread for scoring a task of a structure learning scoring process, the first thread accessing only a first portion of a global scoring cache for the scoring task and for use in the scoring task; Calculating a value-assigning to,

A second thread to perform an additional score calculation operation-the second thread accesses a second partition of the global score cache for the additional score calculation operation, does not access the first partition of the global score cache, and Assigning in parallel a second value to be used in an additional scoring operation,

The first and second threads respectively store the first value and the second value in the first partition of the global score cache and the second partition of the global score cache.

Structure learning method.

The method of claim 1,

Assigning the score calculation task and the additional score calculation task to the first thread and the second thread includes assigning the task to the first and second processors of a concurrent multi-processor system.

Structure learning method.

The method of claim 1,

Assigning the score calculation task and the additional score calculation task to the first thread and the second thread comprises assigning tasks to the first and second processing cores of an on-chip multi-processor system. Containing steps

Structure learning method.

The method of claim 1,

The first thread accessing the first partition includes a first thread accessing the first partition of the global score cache shared without a critical section.

Structure learning method.

The method of claim 1,

The first and second dividers include addressable elements of a divider array.

Structure learning method.

The method of claim 5, wherein

The addressable element of the partition array comprises an element associated with a node of a Bayesian network,

Each said node associated with a device of said partition array.

Structure learning method.

The method of claim 1,

The structure of the additional score calculation includes a node, and the scoring operation includes neighbor scoring of a neighbor having a node different from the node of the structure of the additional score calculation.

Structure learning method.

A computer readable storage medium having contents providing instructions for causing a machine to perform an action,

The operation is,

Associating a direct addressable cache partition with each node of the Bayesian network;

Distributing neighbor calculations associated with one of the nodes of the Bayesian network to a first parallel thread, the first parallel thread accessing the direct addressable cache partition associated with the one node to perform the neighbor calculation. Obtaining information about-and,

Distributing neighbor calculations associated with additional ones of said nodes of said Bayesian network to second parallel threads, said second parallel threads accessing said direct addressable cache partition associated with said additional nodes to provide information about said neighbor calculations; Obtained-containing

Computer-readable storage media.

The method of claim 8,

Content providing instructions that cause the machine to perform the operation of associating the direct addressable cache partition to each node provide a direct unique address that indexes to a memory location within the data structure in which the machine has an address. Containing content that provides instructions to enable

Computer-readable storage media.

The method of claim 8,

Content providing instructions that cause the machine to distribute the neighbor computations provide instructions that cause the machine to distribute computations for separate neighbors of non-interdependent nodes to the first and second parallel threads. Containing content

Computer-readable storage media.

The method of claim 8,

The first and second parallel threads comprise first and second computational resources of a system having a plurality of parallel computational resources sharing access to memory data structures.

Computer-readable storage media.

The method of claim 8,

The direct addressable cache partition associated with the one node includes an entry for a family score of each family associated with the one node.

Computer-readable storage media.

The method of claim 8,

The first storing the score of the calculated family for the one node and the additional node in the neighbor calculation, in the cache partition associated with the one node and the cache partition associated with the additional node; Further comprising a second parallel thread

Computer-readable storage media.

The method of claim 13,

The content providing instructions for causing the machine to distribute the neighbor calculation is

Content providing instructions for causing the machine to dispatch neighbor calculations for a first neighbor scoring sub-loop of a neighbor scoring loop of a hill-climbing algorithm;

Further comprising content providing instructions for causing the machine to dispatch neighbor calculations for a second neighbor scoring sub-loop of the neighbor scoring loop of the hill-climbing algorithm;

The first and second parallel threads access the family score stored in the cache partition during the first neighbor scoring sub-loop.

Computer-readable storage media.

A memory having data defining execution of an operation of repeating network structure learning;

A processor coupled to the memory and executing the operation defined in the memory,

The operation is,

In a repetition of the structure learning, a neighboring scoring function is distributed to parallel threads for a first loop of neighbor scoring such that a first thread accesses a first element of a scoring cache array, and a second thread is assigned to the first element of the scoring cache array. 2 operation to access the device,

Distributing an additional neighbor scoring function to the parallel threads for a second loop of neighbor scoring in the iteration of the structure learning so that the first thread is the first element of the score cache array and the second element of the score cache array. Or to access other elements of the score cache array.

Device.

The method of claim 15,

Each element of the score cache array corresponds to a single node of the network structure to learn.

Device.

The method of claim 16,

The memory includes the first and second elements of the score cache array in which the first and second threads have a family score for the family of single nodes corresponding to the first and second elements within the first loop. Further comprising data defining an operation of loading each of them so that they can be used later in the second loop.

Device.

Scoring the neighbor in a first loop, including accessing a first element of the score cache array with a first processing core, and accessing a second element of the score cache array with a second processing core;

Scoring additional neighbors in a second loop, including accessing the first element of the score cache array or the second element of the score cache array with the first processing core

A memory having data defining the execution of the operation of scoring a neighbor of a Bayesian network comprising:

The first and second processing cores coupled to the memory and executing the operations defined in the memory;

A database coupled to the memory and storing data for allowing the Bayesian network structure to be learned and providing the data to the memory

System comprising a.

The method of claim 18,

The first and second processing cores include a first processor and a second processor of a plurality of processors in a multi-processor machine.

system.

The method of claim 18,

The first and second processing cores include a first processing core and a second processing core of a plurality of processor cores of a multi-processor integrated circuit chip.

system.