KR20170085396A

KR20170085396A - Feature Vector Clustering and Database Generating Method for Scanning Books Identification

Info

Publication number: KR20170085396A
Application number: KR1020160004983A
Authority: KR
Inventors: 이상훈
Original assignee: 연세대학교 산학협력단
Priority date: 2016-01-14
Filing date: 2016-01-14
Publication date: 2017-07-24

Abstract

스캔도서 식별을 위한 특징벡터 클러스터링 및 데이터베이스 생성 방법이 개시된다. 개시된 방법은 스캔 도서들로부터 특징 벡터들을 추출하는 단계; 상기 특징 벡터들간의 해밍 거리를 획득하는 단계; 상기 획득된 해밍 거리에 기초하여 특징 벡터들의 클러스터를 형성하는 단계를 포함하되, 상기 해밍 거리를 획득하는 단계는 상기 특징 벡터들을 다수의 블록으로 구분하여 병렬 분산 처리에 의해 수행되는 것을 특징으로 한다. 개시된 방법에 따르면, 특징벡터 클러스터링 및 데이터베이스 생성에 있어서 종래기술보다 빠른 속도로 거리 행렬을 구할 수 있는 장점이 있다.A feature vector clustering and database generation method for identifying a scanned book is disclosed. The disclosed method includes extracting feature vectors from scanned books; Obtaining a humming distance between the feature vectors; And forming a cluster of feature vectors based on the obtained hamming distance, wherein the step of acquiring the hamming distance is performed by parallel dispersion processing by dividing the feature vectors into a plurality of blocks. According to the disclosed method, in the feature vector clustering and database creation, there is an advantage that the distance matrix can be obtained at a faster speed than the prior art.

Description

{Feature Vector Clustering and Database Generating Method for Scanning Book Identification}

본 발명은 특징벡터 클러스터링 및 데이터베이스 생성 방법에 관한 것으로서, 더욱 상세하게는 스캔도서 식별을 위한 특징벡터 클러스터링 및 데이터베이스 생성 방법에 관한 것이다.The present invention relates to feature vector clustering and a database generation method, and more particularly, to feature vector clustering and database generation method for identification of a scan book.

스캔 도서의 식별은 각 도서의 이미지별로 특징벡터를 생성하여 특징벡터간 비교를 통해 이루어진다. 특징벡터간 용이한 비교를 위해 클러스터링을 수행할 경우 각 특징벡터간 거리를 구해야 하나 방대한 수의 특징벡터로 인해 각 특징벡터간 거리를 산출하는 데에는 상당한 양의 연산이 요구되는 문제점이 있다.Identification of scanned books is done by generating feature vectors for each image of each book and comparing feature vectors. When clustering is performed for easy comparison between feature vectors, the distances between feature vectors must be obtained. However, there is a problem that a considerable amount of computation is required to calculate distances between feature vectors due to a large number of feature vectors.

상기한 바와 같은 종래기술의 문제점을 해결하기 위해, 본 발명은 종래기술보다 빠른 속도로 거리 행렬을 구할 수 있는 스캔도서 식별을 위한 특징벡터 클러스터링 및 데이터베이스 생성 방법을 제공한다.In order to solve the problems of the prior art as described above, the present invention provides a feature vector clustering and database generation method for identification of a scan book which can obtain a distance matrix at a higher speed than the prior art.

상기한 목적을 달성하기 위해 본 발명의 바람직한 일 실시예에 따르면, 스캔 도서들로부터 특징 벡터들을 추출하는 단계; 상기 특징 벡터들간의 해밍 거리를 획득하는 단계; 상기 획득된 해밍 거리에 기초하여 특징 벡터들의 클러스터를 형성하는 단계를 포함하되, 상기 해밍 거리를 획득하는 단계는 상기 특징 벡터들을 다수의 블록으로 구분하여 병렬 분산 처리에 의해 수행되는 것을 특징으로 하는 특징벡터 클러스터링 방법이 제공된다.According to a preferred embodiment of the present invention, there is provided a method for extracting feature vectors from scanned books, Obtaining a humming distance between the feature vectors; And forming a cluster of feature vectors on the basis of the obtained hamming distance, wherein the obtaining of the hamming distance is performed by parallel dispersion processing by dividing the feature vectors into a plurality of blocks. A vector clustering method is provided.

본 발명은 특징벡터 클러스터링 및 데이터베이스 생성에 있어서 종래기술보다 빠른 속도로 거리 행렬을 구할 수 있는 장점이 있다.The present invention is advantageous in that the distance matrix can be obtained at a faster speed than the prior art in feature vector clustering and database creation.

도 1은 본 발명의 일 실시예에 따른 특징벡터 클러스터링 및 데이터베이스 생성 방법에 대한 개요를 도시한 것이다.
도 2는 본 발명의 일 실시예에 따른 특징벡터 테이블을 도시한 것이다.
도 3은 본 발명의 일 실시예에 따른 분산적 거리행렬 계산을 도시한 것이다.
도 4는 본 발명의 일 실시예에 따른 맵리듀스 과정을 이용한 분산 병렬적 거리행렬 계산을 도시한 것이다.
도 5는 본 발명의 일 실시예에 따른 스플릿 함수의 의사코드를 도시한 것이다.
도 6은 본 발명의 일 실시예에 따른 맵 함수의 의사코드를 도시한 것이다.
도 7은 본 발명의 일 실시예에 따른 리듀스 함수의 의사코드를 도시한 것이다.
도 8은 본 발명의 일 실시예에 따른 특징벡터 열 및 클러스터링 결과를 나타낸 표이다.
도 9는 본 발명의 일 실시예에 따른 특징벡터 데이터베이스를 나타낸 표이다.FIG. 1 shows an outline of a feature vector clustering and a database generation method according to an embodiment of the present invention.
FIG. 2 shows a feature vector table according to an embodiment of the present invention.
Figure 3 illustrates a distributed distance matrix calculation in accordance with an embodiment of the present invention.
FIG. 4 illustrates a distributed parallel distance matrix calculation using the MapReduce process according to an embodiment of the present invention.
5 illustrates a pseudo code of a split function according to an embodiment of the present invention.
FIG. 6 illustrates a pseudo code of a map function according to an embodiment of the present invention.
FIG. 7 illustrates a pseudo code of a reduction function according to an embodiment of the present invention.
FIG. 8 is a table showing characteristic vector sequences and clustering results according to an embodiment of the present invention.
9 is a table showing a feature vector database according to an embodiment of the present invention.

본 발명은 다양한 변경을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 자세히 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다. While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

제1, 제2 등의 용어는 다양한 구성요소들을 설명하는데 사용될 수 있지만, 상기 구성요소들은 상기 용어들에 의해 한정되어서는 안 된다. 상기 용어들은 하나의 구성요소를 다른 구성요소로부터 구별하는 목적으로만 사용된다. 예를 들어, 본 발명의 권리 범위를 벗어나지 않으면서 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 이하에서, 본 발명에 따른 실시예들을 첨부된 도면을 참조하여 상세하게 설명한다.The terms first, second, etc. may be used to describe various components, but the components should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 특징벡터 클러스터링 및 데이터베이스 생성 방법에 대한 개요를 도시한 것이다.FIG. 1 shows an outline of a feature vector clustering and a database generation method according to an embodiment of the present invention.

도 1을 참조하면, 스캔도서는 일련의 이미지로 구성되어있다. 본 발명에서는 스캔도서 식별을 위하여 이미지 하나 당 하나의 특징벡터(

)를 추출한다. 특징벡터는 0 과 1로이루어진 이진 벡터의 형태를 이용하며 그 길이는

로 정해진다. 스캔도서 식별을 위해 사용되는 도서식별자

는 원본도서 한 권당 하나가 배정되어 특징벡터 테이블에서는

개의 도서식별자를 관리한다. Referring to FIG. 1, a scan book is composed of a series of images. In the present invention, one feature vector per image

). The feature vector uses the form of binary vector consisting of 0 and 1 and its length is

Respectively. Book identifiers used to identify scanned books

Is allocated to one original book and the feature vector table

Manage book identifiers.

도 2는 본 발명의 일 실시예에 따른 특징벡터 테이블을 도시한 것이다.FIG. 2 shows a feature vector table according to an embodiment of the present invention.

식별을 위해 한 권당

페이지를 추출할 경우 도 2와 같이 특징벡터 테이블을 구성할 수 있다. 여기서 특징벡터열(

)은

개의 연속된 특징벡터로 구성되며,

번째 특징벡터열은 하기 수학식과 같이 나타낼 수 있다.One for identification

When extracting a page, a feature vector table can be constructed as shown in FIG. Where the feature vector column (

)silver

&Lt; / RTI > consecutive feature vectors,

Th feature vector row can be expressed by the following equation.

수학식 1과 같이, 특징벡터 테이블은 스캔도서의 효율적인 식별을 위해 클러스터링되고, 그 결과 특징벡터 데이터베이스가 구축된다.As shown in Equation (1), the feature vector table is clustered for efficient identification of scanned books, and a feature vector database is constructed as a result.

특징벡터 테이블은 효율적인 식별을 위해 클러스터링 된다. 클러스터링에는 K-medoids 방법을 이용하며, 클러스터링 결과를 이용해 특징벡터 데이터베이스를 구축한다. 일반적인 클러스터링 방법과 마찬가지로 K-mediods 방법은 특징벡터 사이의 거리를 기준으로 가까운 특징벡터들끼리 묶는다. 두 특징벡터

와

사이의 거리 (

)는 해밍거리를 이용해 하기 수학식과 같이 계산될 수 있다.Feature vector tables are clustered for efficient identification. K-medoids method is used for clustering, and a feature vector database is constructed using clustering results. Similar to the general clustering method, the K-mediods method combines close feature vectors based on the distance between feature vectors. Two feature vectors

Wow

Distance between

) Can be calculated as follows using the Hamming distance.

수학식 2에서

은 특징벡터

의

번째 비트를 의미한다. 결과적으로 특징벡터 클러스터링을 위해서는

개의 특징벡터 중 모든 특징벡터 쌍 사이의 해밍거리가 모두 계산되어야 한다.In Equation 2,

Is a feature vector

of

Th bit. As a result, for feature vector clustering

All of the hamming distances between all pairs of feature vectors among the feature vectors should be calculated.

본 발명에서는 특징벡터의 모든 쌍의 해밍거리를 행렬 형태로 나타낸 것을 거리행렬

로 정의한다. 이를 수식으로 나타내면 하기 수학식과 같다.In the present invention, the Hamming distances of all pairs of feature vectors are represented in a matrix form as a distance matrix

. This can be expressed by the following equation.

또한 해밍거리의 대칭성에 의해

의 특징을 갖는다. 거리행렬이 완성된 후에는 하기 수학식의 유사도 기준을 이용하여 첫 번째

개의 medoid를 결정한다.Also, due to the symmetry of Hamming distance

. After the distance matrix is completed, the first

Determine the number of medoids.

수학식 4의 유사도 기준(

)은

와 이를 제외한 나머지 특징벡터들 사이의 거리가 평균적으로 증가할수록 그 값이 감소하므로, 유사도가 작을수록 다른 특징벡터와 멀리 떨어져 있다고 생각할 수 있다. 그러므로

개의 특징벡터 중 유사도 기준이 작은 순서대로

개를 선택하여 이를 초기 medoid로 할당할 경우 효율적으로 초기 클러스터링 결과를 얻을 수 있다.The similarity criterion (

)silver

And its value decreases as the distance between the other feature vectors increases, the smaller the similarity, the farther it is from the other feature vectors. therefore

The order of the similarity measure among the feature vectors is

The initial clustering result can be efficiently obtained by selecting the seed and assigning it to the initial medoid.

다음으로, 선택되지 않은 특징벡터들은

개의 medoid와 해밍거리 비교를 통해 가장 가까운 medoid에 할당되며, 모든 특징벡터들에 대하여 할당이 완료된 이후에는 medoid를 갱신한다. 이 과정을 갱신된 medoid의 변동이 없을 때까지 반복하면 클러스터링이 완료된다.Next, the unselected feature vectors are

It is assigned to the closest medoid through comparison of the medoids and hamming distances. After all the feature vectors have been allocated, the medoid is updated. Repeat this process until there is no change in the updated medoid, and clustering is complete.

K-medoids 방법에서는 거리행렬

를 계산하는 것이 가장 시간이 많이 소요되는 작업이다. 그러므로 거리행렬

를 부거리행렬(Sub-distance Matrix)

로 쪼개어 병렬 및 분산적으로 계산한다면 효율적으로 시간을 줄일 수 있다. 본 발명에서 부거리행렬은

개의 특징벡터 사이의 거리정보를 나타내며, 행과 열의 길이가 모두

로 같은 정방행렬이다.In the K-medoids method,

Is the most time-consuming task. Therefore,

Sub-distance matrix < RTI ID = 0.0 >

It is possible to reduce the time efficiently if it is calculated in parallel and distributed manner. In the present invention,

Represents the distance information between the feature vectors, and the length of the row and the column are both

Is the same square matrix.

도 3은 본 발명의 일 실시예에 따른 분산적 거리행렬 계산을 도시한 것이다.Figure 3 illustrates a distributed distance matrix calculation in accordance with an embodiment of the present invention.

부거리행렬을 계산하기 위하여 먼저 도 3의 (a), (b)에 나타난 바와 같이

개의 특징벡터들을 묶어 특징블록을 하기 수학식과 같이 계산한다.In order to calculate the minor distance matrix, as shown in (a) and (b) of FIG. 3,

The feature blocks are grouped and the feature block is calculated according to the following equation.

수학식 5에서

는

번째 특징블록을 나타내며, 연속적인 특징벡터

개로 구성되어 있다.

는 특징블록의 개수를 나타내며

와 같이 나타낼 수 있다.In Equation (5)

The

Th feature block, and a continuous feature vector

.

Represents the number of feature blocks

As shown in Fig.

특징블록을 얻은 이후에는 특징블록들 사이의 거리벡터를 계산하여, 도 3의 (c)를 참조하여 부거리행렬을 계산한다. 예를 들어 특징블록

와

사이의 부거리행렬

는 하기 수학식과 같이 계산할 수 있다.After obtaining the feature block, a distance vector between the feature blocks is calculated, and the sub-distance matrix is calculated with reference to FIG. 3 (c). For example,

Wow

Distance matrix

Can be calculated by the following equation.

수학식 6에서

및

는 거리행렬 및 부거리행렬 내의

행

열의 원소를 나타낸다. 부거리행렬

는, 도 3의 (d)에 나타난 바와 같이, 거리행렬에서 블록 단위로

번째 행,

번째 열에 위치하며, 거리행렬의 대칭적인 성질로 인해 블록단위로

번째 행,

번째 열에는

의 전치행렬인

가 위치한다. In Equation (6)

And

Lt; RTI ID = 0.0 > distance matrix < / RTI &

line

Represents elements of a column. Backward distance matrix

As shown in (d) of Fig. 3,

Row,

, And because of the symmetric nature of the distance matrix,

Row,

The first column

&Lt; / RTI >

.

한편 서로 다른 부거리행렬은 독립적으로 연산 가능하기 때문에 병렬 프로그래밍 기술을 이용할 수 있으며, 본 발명에서는 하둡 맵리듀스 프레임워크를 이용하여 이 문제를 해결한다. 하둡은 대용량 데이터의 분산 처리를 다수의 서버를 통해 처리가능케하는 소프트웨어 플랫폼이다. 하둡에서는 데이터의 단위를 <Key, Value> 쌍으로 표현한다. Value는 연산에 필요한 정보를 나타내며, Key는 Value가 보내질 연산이 처리될 곳의 주소에 해당한다.On the other hand, since different sub-distance matrices can be computed independently, parallel programming techniques can be used. In the present invention, this problem is solved by using the Hadoop MapReduce framework. Hadoop is a software platform that enables distributed processing of large amounts of data through multiple servers. In Hadoop, units of data are represented by <Key, Value> pairs. The Value represents the information needed for the operation, and the Key corresponds to the address where the Value is to be processed.

도 4는 본 발명의 일 실시예에 따른 맵리듀스 과정을 이용한 분산 병렬적 거리행렬 계산을 도시한 것이다.FIG. 4 illustrates a distributed parallel distance matrix calculation using the MapReduce process according to an embodiment of the present invention.

도 4를 참조하면, 전형적인 맵리듀스의 과정은 스플릿, 맵, 리듀스의 세 단계로 구성된다. 도 4는 입력 특징벡터를 스플릿으로 나누어 특징블록을 만들고, 맵을 통해 연산을 할당하며, 리듀스를 통해 부거리행렬을 계산하는 일련의 과정을 나타낸다. 각각의 세 함수는 다음과 같이 정의된다.Referring to FIG. 4, a typical map deuce process consists of three steps: split, map, and reduce. FIG. 4 shows a series of processes for dividing an input feature vector by a split to create a feature block, allocating an operation through a map, and calculating a minor distance matrix through reduction. Each of the three functions is defined as follows.

스플릿 함수는 모든 특징벡터를 입력으로 받아 특징 블록을 생성한다. 특징블록은 연속적으로 구성된

개의 특징벡터이다.

번째 특징블록에 대하여 이 함수의 출력은

와 같다.

와

는 를 이용해 결정된다.The split function takes all the feature vectors as inputs and generates feature blocks. The feature blocks are configured in a continuous fashion

Are feature vectors.

The output of this function for the ith feature block is

.

Wow

Is determined using.

도 5는 본 발명의 일 실시예에 따른 스플릿 함수의 의사코드를 도시한 것이다.5 illustrates a pseudo code of a split function according to an embodiment of the present invention.

맵 함수는 하나의 <Key, Value> 쌍을 받아 다수의 <Key, Value> 쌍을 출력한다. 입력 <Key, Value> 쌍은 특징블록 인덱스와 특징블록으로 구성되며,

번째 특징블록에 대하여

와 같이 나타낼 수 있다. 출력 <Key,Value> 쌍은

의 형태를 가진다.

는 입력으로 들어온 특징블록과 같으며,

는 부거리행렬 인덱스를 나타낸다.The map function takes one <Key, Value> pair and prints out multiple <Key, Value> pairs. The input <Key, Value> pair consists of feature block index and feature block,

Th feature block

As shown in Fig. The output <Key, Value> pair

.

Is the same as the input feature block,

Represents the minor distance matrix index.

도 6은 본 발명의 일 실시예에 따른 맵 함수의 의사코드를 도시한 것이다.FIG. 6 illustrates a pseudo code of a map function according to an embodiment of the present invention.

리듀스 함수는 같은 Key 값을 갖는 두 개의 <Key,Value> 쌍을 입력으로 받아 하나의 부거리행렬을 계산하여 출력한다. The Reduce function takes two pairs of <Key, Value> having the same Key value as an input, calculates and outputs one sub-distance matrix.

도 7은 본 발명의 일 실시예에 따른 리듀스 함수의 의사코드를 도시한 것이다.FIG. 7 illustrates a pseudo code of a reduction function according to an embodiment of the present invention.

스캔도서는 일련의 독립된 이미지로 구성되어 있으며, 따라서 스캔도서에서 추출한 특징벡터 또한 열(series)을 이룬다. 이 경우 하나의 스캔도서는 클러스터 인덱스와 특징벡터 쌍의 열로 표현할 수 있다. 예를 들어

번째 스캔도서는 하기 수학식과 같이 표현할 수 있다.Scanned books consist of a series of independent images, so feature vectors extracted from scanned books also form a series. In this case, one scan book can be expressed as a cluster index and a pair of feature vectors. E.g

Th scan book can be expressed by the following equation.

수학식 7에서

는 K개의 클러스터 medoid이다. 이것을 바탕으로 특징벡터 데이터베이스를 다음과 같이 구성할 수 있다. 제안하는 구조에서 특징벡터 데이터베이스의 인덱스는 클러스터 medoid로 하며, 데이터는 도서식별자와 특징벡터의 쌍으로

와 같이 나타낸다.In Equation (7)

Is the K cluster medoid. Based on this, the feature vector database can be configured as follows. In the proposed structure, the index of the feature vector database is a cluster medoid, and the data is a pair of book identifier and feature vector

Respectively.

도 8은 본 발명의 일 실시예에 따른 특징벡터 열 및 클러스터링 결과를 나타낸 표이다.FIG. 8 is a table showing characteristic vector sequences and clustering results according to an embodiment of the present invention.

예를 들어 스캔도서 3권에 대한 특징벡터 데이터베이스를 구성할 경우, 한 권당

페이지의 특징벡터를 추출하고 클러스터링할 경우 도 8과 같은 결과를 얻을 수 있다.For example, when constructing a feature vector database for three scanned books,

When the feature vector of the page is extracted and clustered, the result shown in FIG. 8 can be obtained.

도 8에서 클러스터의 개수는

이다. In Fig. 8, the number of clusters is

to be.

도 9는 본 발명의 일 실시예에 따른 특징벡터 데이터베이스를 나타낸 표이다.9 is a table showing a feature vector database according to an embodiment of the present invention.

이 경우 클러스터링 결과를 이용하여 도 9와 같이 특징벡터 데이터베이스를 구성할 수 있다.In this case, the feature vector database can be constructed using the clustering result as shown in FIG.

예를 들어 특징벡터 데이터베이스의 인덱스

를 이용해서는 첫 번째 스캔도서(

)의 세 번째 이미지(

)와 두 번째 스캔도서(

)의 여섯 번째 이미지(

)에 접근할 수 있다.For example, the index of the feature vector database

You can use the first scanned book (

) Third image of

) And the second scan book (

) Of the sixth image (

).

이와 같이, 본 발명의 일 실시예에 따른 특징벡터 클러스터링 및 데이터베이스 생성 방법은 특징벡터 클러스터링 및 데이터베이스 생성에 있어서 종래기술보다 빠른 속도로 거리 행렬을 구할 수 있는 장점이 있다.As described above, the feature vector clustering and database generation method according to an embodiment of the present invention is advantageous in that the distance matrix can be obtained at a faster speed than the conventional technique in feature vector clustering and database creation.

이상과 같이 본 발명에서는 구체적인 구성 요소 등과 같은 특정 사항들과 한정된 실시예 및 도면에 의해 설명되었으나 이는 본 발명의 보다 전반적인 이해를 돕기 위해서 제공된 것일 뿐, 본 발명은 상기의 실시예에 한정되는 것은 아니며, 본 발명이 속하는 분야에서 통상적인 지식을 가진 자라면 이러한 기재로부터 다양한 수정 및 변형이 가능하다는 것을 이해할 것이다. 따라서, 본 발명의 사상은 설명된 실시예에 국한되어 정해져서는 아니되며, 후술하는 특허청구범위뿐 아니라 이 특허청구범위와 균등하거나 등가적 변형이 있는 모든 것들은 본 발명 사상의 범주에 속한다고 할 것이다.As described above, the present invention has been described with reference to particular embodiments, such as specific elements, and specific embodiments and drawings. However, it should be understood that the present invention is not limited to the above- Those skilled in the art will appreciate that various modifications and changes may be made thereto without departing from the scope of the present invention. Accordingly, the spirit of the present invention should not be construed as being limited to the embodiments described, and all of the equivalents or equivalents of the claims, as well as the following claims, belong to the scope of the present invention .

Claims

Extracting feature vectors from scanned books;
Obtaining a humming distance between the feature vectors;
And forming a cluster of feature vectors based on the obtained Hamming distance,
Wherein the step of acquiring the Hamming distance is performed by parallel dispersion processing by dividing the feature vectors into a plurality of blocks.