KR102045533B1

KR102045533B1 - System for recognizing music symbol using deep network and method therefor

Info

Publication number: KR102045533B1
Application number: KR1020180012157A
Authority: KR
Inventors: 양형정; 도루녹
Original assignee: 전남대학교산학협력단
Priority date: 2018-01-31
Filing date: 2018-01-31
Publication date: 2019-11-18
Also published as: KR20190098812A

Abstract

본 발명은 심층 네트워크를 이용한 악보인식 시스템 및 그 방법에 관한 것으로서, 악보 이미지를 포함하는 영상을 촬영하는 촬영부; 영상을 슬라이딩 레이어로 구성하여 전체 영상에 대한 특징 맵을 추출하는 특징맵 추출부; 특징 맵으로부터 악보 기호를 포함하는 후보 지역을 검출하는 후보지역 검출부; 및 후보지역으로부터 특징벡터를 추출하는 특징벡터 추출부를 포함한다.The present invention relates to a music score recognition system and a method using a deep network, comprising: a photographing unit for photographing an image including a score image; A feature map extractor configured to extract a feature map of the entire image by configuring the image as a sliding layer; A candidate region detection unit that detects a candidate region including a music score from the feature map; And a feature vector extraction unit for extracting feature vectors from the candidate region.

Description

Music score recognition system using deep network and its method {SYSTEM FOR RECOGNIZING MUSIC SYMBOL USING DEEP NETWORK AND METHOD THEREFOR}

본 발명은심층 네트워크를 이용한 악보인식 시스템 및 그 방법에 관한 것으로 더욱 상세하게는, 음악을 재생하기 위해 기계가 이해할 수 있는 형식으로 음악 기호를 자동으로 변환, 재구성 및 인식하는 기술에 관한 것이다.The present invention relates to a music score recognition system and a method using a deep network, and more particularly, to a technology for automatically converting, reconstructing, and recognizing music symbols in a machine understandable format for playing music.

컨볼루션 신경망(CNN: Convolutional Neural Network)을 이용한 객체 인식 기술은 기존 객체 인식 기술의 인식률을 뛰어 넘는 정확도를 보이고 있는바, 컨볼루션 신경망을 이용해 디지털화된 악보를 인식하기 위한 연구가 이루어지고 있다.The object recognition technology using the convolutional neural network (CNN) is more accurate than the recognition rate of the existing object recognition technology, and researches for recognizing digitized music using the convolutional neural network have been conducted.

일반적으로, 악보는 판독 된 데이터를 자동으로 인식하고 재구성하여 XML과 같은 기계 가독 형 형식으로 변환하여 음악을 재생할 수 있는 악보 인식 (OMR)시스템이 사용되고 있다.In general, sheet music recognition (OMR) systems are used to automatically recognize and reconstruct the read data, convert it into a machine-readable format such as XML, and play music.

그러나, 악보를 캡쳐하여 자동으로 음악 기호를 인식하는 기술은, 음악 스타일, 기호 표기법 및 기타 왜곡의 큰 변화로 인해 정확한 인식이 어렵고, 특히, 조명이 변경되는 경우 잡음이 생성되어 원하는 기호와 다르게 인식되는 경우가 빈번하게 발생하는 문제점이 있다.However, the technique of capturing musical scores and automatically recognizing musical symbols is difficult to accurately recognize due to large changes in musical styles, symbolic notations and other distortions, and in particular, when the lighting changes, noise is generated to recognize differently from the desired symbols. There is a problem that occurs frequently.

또한, 악보에 표기된 음악 기호는 장르별로 그 표기 스타일과 모양이 상이하기 때문에 이러한 특수성을 고려하여 조명 변화와 음악 기호 모양의 변화에도 안정적으로 음악 기호를 인식할 수 있는 기술 개발이 요구되고 있다.In addition, since musical notation written on the score is different in the notation style and shape of each genre, it is required to develop a technology capable of stably recognizing the musical sign even when the lighting changes and the change in the shape of the musical symbol in consideration of such specificity.

한국공개특허 제10-2017-0028591호Korean Patent Publication No. 10-2017-0028591

본 발명의 목적은, 주변의 조명 변화에도 악보 이미지로부터 음악 기호를 정확하게 인식하는데 있다.An object of the present invention is to accurately recognize musical symbols from sheet music images, even with changes in ambient lighting.

본 발명의 목적은, 컨볼루션 신경망(CNN: Convolutional Neural Network)을 이용하여 악보 이미지 전체 영상에 대한 특징 맵을 추출하는데 있다.An object of the present invention is to extract a feature map for a whole score image using a convolutional neural network (CNN).

본 발명의 목적은, 단일 풀링 레이어, softmax 분류기 및 박스 회기 분석기를 포함하는 지역 제안 네트워크(RPNs: Region Proposal Networks)를 이용하여 악보 기호를 포함하는 후보 지역을 검출하는데 있다.An object of the present invention is to detect candidate regions containing sheet music symbols using Region Proposal Networks (RPNs), which include a single pooling layer, a softmax classifier, and a box regression analyzer.

이러한 기술적 과제를 달성하기 위한 본 발명은 심층 네트워크를 이용한 악보인식 시스템으로서, 악보 이미지를 포함하는 영상을 촬영하는 촬영부; 영상을 슬라이딩 레이어로 구성하여 전체 영상에 대한 특징 맵을 추출하는 특징맵 추출부; 특징 맵으로부터 악보 기호를 포함하는 후보 지역을 검출하는 후보지역 검출부; 및 후보지역으로부터 특징벡터를 추출하는 특징벡터 추출부를 포함한다.According to an aspect of the present invention, there is provided a music score recognition system using a deep network, including: a photographing unit photographing an image including a score image; A feature map extractor configured to extract a feature map of the entire image by configuring the image as a sliding layer; A candidate region detection unit that detects a candidate region including a music score from the feature map; And a feature vector extraction unit for extracting feature vectors from the candidate region.

바람직하게는, 후보지역 검출부는, 256 차원의 특성 벡터로 매핑된 각 슬라이딩 윈도우를 박스 회귀 및 박스 분류를 위해 완전연결층(fully connected layers)의 합성곱 레이어(convolutinal layer) 및 풀링 레이어(pooling layer)로 인가하고, 설정된 축적(x1, x2, x3) 및 종횡비(1:1, 1:2, 1:3)를 포함하는 참조 세트를 구성하는 것을 특징으로 한다.Preferably, the candidate area detection unit is configured to move each sliding window mapped to a feature vector of 256 dimensions into a convolutinal layer and a pooling layer of fully connected layers for box regression and box classification. ), And constitute a reference set comprising the set accumulation (x1, x2, x3) and aspect ratio (1: 1, 1: 2, 1: 3).

후보지역에서의 특징벡터 검출을 초기화하고, 후보지역 검출부를 통해 조정된 새로운 후보지역을 설정하도록 제어하며, 특징벡터 추출부를 통해 새로운 후보지역으로부터 특징벡터를 추출하도록 제어하는 훈련 처리부를 더 포함하는 것을 특징으로 한다.And a training processor for initializing the feature vector detection in the candidate region, controlling to set a new candidate region adjusted through the candidate region detection unit, and controlling to extract the feature vector from the new candidate region through the feature vector extraction unit. It features.

또한, 본 발명은 심층 네트워크를 이용한 악보인식 방법에 있어서, 촬영부가 촬영을 통해 악보 이미지를 포함하는 영상을 입력받는 (a) 단계; 특징맵 추출부가 영상에 대한 특징 맵을 추출하는 (b) 단계; 후보지역 검출부가 특징 맵으로부터 악보 기호를 포함하는 후보 지역을 검출하는 (c) 단계; 및 특징벡터 추출부가 후보지역으로부터 특징 벡터를 추출하는 (d) 단계를 포함하는 것을 다른 측면으로 한다.In addition, the present invention provides a method for recognizing a score using a deep network, comprising: (a) receiving an image including a score image through a photographing unit; (B) extracting, by the feature map extractor, the feature map for the image; (C) detecting, by the candidate region detection unit, the candidate region including the music score symbol from the feature map; And (d) extracting the feature vector from the candidate region by the feature vector extractor.

바람직하게는, (c) 단계는 후보지역 검출부가 256 차원의 특성벡터로 매핑된 각 슬라이딩 윈도우를 구성하는 (c-1) 단계; 후보지역 검출부가 슬라이딩 윈도우를 완전연결층(fully connected layers)의 합성곱 레이어(convolutinal layer) 및 풀링 레이어(pooling layer)로 인가하는 (c-2)단계; 후보지역 검출부가 기 설정된 축적(x1, x2, x3) 및 종횡비(1:1, 1:2, 1:3)를 포함하는 참조 세트를 구성하는 (c-3) 단계; 및 후보지역 검출부가 각 슬라이딩 윈도우 위치에서 동시에 다수의 후보 지역을 검출하는 (c-4) 단계를 한다.Preferably, the step (c) comprises the step (c-1) of the candidate area detection unit constituting each sliding window mapped to the feature vector of 256 dimensions; (C-2) the candidate region detecting unit applying the sliding window to a convolutinal layer and a pooling layer of fully connected layers; (C-3) forming, by the candidate region detection unit, a reference set including preset accumulations (x1, x2, x3) and aspect ratios (1: 1, 1: 2, 1: 3); And (c-4) detecting, by the candidate region detecting unit, a plurality of candidate regions simultaneously at each sliding window position.

(d) 단계는, 특징벡터 추출부가 풀링 레이어(pooling layer)를 통해 후보 지역 내의 마지막 두 개의 출력 레이어로 분기하는 (d-1) 단계; 특징벡터 추출부가 첫 번째 레이어에서 특징벡터로부터 음악 기호 클래스와 백그라운드 클래스를 softmax 확률 함수를 통해 인식하는 (d-2) 단계; 및 특징벡터 추출부가 두 번째 레이어에서 특징벡터로부터 음악 기호 각각의 위치를 계산하는 (d-3) 단계를 포함한다.Step (d) comprises: (d-1) the feature vector extractor branching to the last two output layers in the candidate region through a pooling layer; (D-2) the feature vector extractor recognizing the music symbol class and the background class from the feature vector in the first layer through a softmax probability function; And (d-3) calculating a position of each music symbol from the feature vector in the second layer.

상기와 같은 본 발명에 따르면, 주변의 조명 변화에도 악보 이미지로부터 음악 기호를 정확하게 인식하여 모바일 기기에 적용이 가능한 효과가 있다.According to the present invention as described above, there is an effect that can be applied to the mobile device by accurately recognizing the music symbol from the sheet music image even in the ambient light changes.

본 발명에 따르면, 컨볼루션 신경망(CNN: Convolutional Neural Network)을 이용하여 악보 이미지 전체 영상에 대한 특징 맵을 추출하고, 지역 제안 네트워크(RPNs: Region Proposal Networks)를 이용하여 악보 기호를 포함하는 후보 지역을 검출이 가능한 효과가 있다.According to the present invention, a feature map of a whole score image is extracted using a convolutional neural network (CNN), and a candidate region including a score symbol using region proposal networks (RPNs). There is an effect that can be detected.

도 1은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템을 도시한 구성도.
도 2는 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템의 입력 영상을 도시한 예시도.
도 3은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템의 슬라이딩 레이어 구성을 도시한 예시도.
도 4는 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템을 통해 도출한 특징벡터를 도시한 예시도.
도 5는 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템을 통해 도출한 음악 기호 위치를 도시한 예시도.
도 6은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 방법을 도시한 순서도.
도 7은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 방법의 제S30단계의 세부과정을 도시한 순서도.
도 8은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 방법의 제S40단계의 세부과정을 도시한 순서도.1 is a block diagram showing a music score recognition system using a deep network in accordance with an embodiment of the present invention.
2 is an exemplary view showing an input image of a music score recognition system using a deep network according to an embodiment of the present invention.
3 is an exemplary diagram illustrating a sliding layer configuration of a music score recognition system using a deep network according to an embodiment of the present invention.
4 is an exemplary view showing a feature vector derived through a music score recognition system using a deep network in accordance with an embodiment of the present invention.
5 is an exemplary diagram showing a music symbol position derived through a music score recognition system using a deep network according to an embodiment of the present invention.
6 is a flow chart illustrating a music score recognition method using a deep network in accordance with an embodiment of the present invention.
7 is a flow chart showing the detailed process of step S30 of the music score recognition method using a deep network in accordance with an embodiment of the present invention.
8 is a flowchart illustrating a detailed process of step S40 of the music score recognition method using a deep network according to an embodiment of the present invention.

본 발명의 구체적인 특징 및 이점들은 첨부 도면에 의거한 다음의 상세한 설명으로 더욱 명백해질 것이다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 발명자가 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야 할 것이다. 또한, 본 발명에 관련된 공지 기능 및 그 구성에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우에는, 그 구체적인 설명을 생략하였음에 유의해야 할 것이다.Specific features and advantages of the present invention will become more apparent from the following detailed description based on the accompanying drawings. Prior to this, the terms or words used in the present specification and claims are defined in the technical spirit of the present invention on the basis of the principle that the inventor can appropriately define the concept of the term in order to explain his invention in the best way. It should be interpreted to mean meanings and concepts. In addition, when it is determined that the detailed description of the known function and its configuration related to the present invention may unnecessarily obscure the subject matter of the present invention, it should be noted that the detailed description is omitted.

도 1은 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템(100)을 도시한 구성도이다.1 is a block diagram showing a music score recognition system 100 using a deep network in accordance with an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 시스템(100)은 촬영부(101), 특징맵 추출부(102), 후보지역 검출부(103), 특징벡터 추출부(104) 및 훈련 처리부(105)를 포함하여 구성된다.As shown in FIG. 1, the music score recognition system 100 using a deep network according to an exemplary embodiment of the present invention may include a photographing unit 101, a feature map extracting unit 102, a candidate region detecting unit 103, and a feature vector. It is configured to include an extraction unit 104 and the training processing unit 105.

촬영부(101)는 촬영을 통해 악보 이미지를 포함하는 영상을 입력받는다. 이때, 촬영부(101)는 카메라 또는 카메라를 구비한 모바일 단말기 중에 어느 하나에 구비되며, 입력 영상은 도 2에 도시된 바와 같은 악보 이미지일 수 있다.The photographing unit 101 receives an image including a sheet music image through photographing. At this time, the photographing unit 101 is provided in any one of a camera or a mobile terminal having a camera, the input image may be a score image as shown in FIG.

특징맵 추출부(102)는 촬영부(101)로부터 인가받은 영상을 3*3 슬라이딩 레이어로 구성하여 전체 영상에 대한 특징 맵을 추출한다. 이때, 특징 맵은 도 3에 도시된 바와 같이 256 차원의 특징벡터로 매핑된다.The feature map extractor 102 configures an image applied from the photographing unit 101 as a 3 * 3 sliding layer to extract a feature map of the entire image. At this time, the feature map is mapped to a feature vector of 256 dimensions as shown in FIG. 3.

후보지역 검출부(103)는 상기 특징 맵으로부터 악보 기호를 포함하는 후보 지역을 검출한다. 이때, 후보지역 검출부(103)는 256 차원의 특성벡터로 매핑된 각 슬라이딩 윈도우를 박스 회귀 및 박스 분류를 위해 완전연결층(fully connected layers)의 합성곱 레이어(convolutinal layer) 및 풀링 레이어(pooling layer)로 인가하고, 설정된 축적(x1, x2, x3) 및 종횡비(1:1, 1:2, 1:3)를 포함하는 참조 세트를 구성한다.The candidate region detection unit 103 detects a candidate region including the music score symbol from the feature map. At this time, the candidate region detection unit 103 uses the sliding window mapped to the feature vectors of 256 dimensions for convolutional and pooling layers of fully connected layers for box regression and box classification. ), And construct a reference set that includes the set accumulation (x1, x2, x3) and aspect ratio (1: 1, 1: 2, 1: 3).

또한, 후보지역 검출부(103)는 각 슬라이딩 윈도우 위치에서 동시에 다수의 후보 지역을 도출하도록 구성되며, 각 위치에 대한 가능한 최대 후보의 수는 기준 참조 박스의 수와 대응된다.In addition, the candidate area detector 103 is configured to derive a plurality of candidate areas at the same time at each sliding window position, and the maximum number of possible candidates for each position corresponds to the number of reference reference boxes.

특징벡터 추출부(104)는 상기 후보지역으로부터 특징벡터를 추출한다. 이때, 특징벡터 추출부(104)는 풀링 레이어(pooling layer)를 통해 후보 지역 내의 마지막 두 개의 출력 레이어로 분기하여 첫 번째 레이어에서는 상기 특징벡터로부터 음악 기호 클래스와 백그라운드 클래스를 softmax 확률 함수를 통해 인식하고, 두 번째 레이어에서는 상기 특징벡터로부터 음악 기호 각각의 위치를 계산한다.The feature vector extractor 104 extracts a feature vector from the candidate region. At this time, the feature vector extractor 104 branches to the last two output layers in the candidate region through a pooling layer, and the first layer recognizes a music symbol class and a background class from the feature vector through a softmax probability function. In the second layer, the position of each music symbol is calculated from the feature vector.

여기서, 특징벡터 추출부(104)는 도 4에 도시된 바와 같이 특징벡터를 도출하고, 음악 기호 위치는 도 5에 도시된 바와 같이 도출된다.Here, the feature vector extracting unit 104 derives the feature vector as shown in FIG. 4, and the music symbol position is derived as shown in FIG. 5.

훈련 처리부(105)는 상기 후보지역에서의 특징벡터 검출을 초기화하고, 후보지역 검출부(103)를 통해 조정된 새로운 후보지역을 설정하도록 제어하고, 특징벡터 추출부(104)를 통해 새로운 후보지역으로부터 특징벡터를 추출하도록 제어한다.The training processor 105 initializes the feature vector detection in the candidate region, controls to set a new candidate region adjusted through the candidate region detection unit 103, and from the new candidate region through the feature vector extraction unit 104. Control to extract feature vectors.

한편, 특징벡터에 포함된 음악 기호는 음표(black note/white note), 음표의 꼬리(flag), 내림표(flat), 올림표(sharp), 제자리표(natural), 4분 음표(quarter rest), 16분 음표(semi-quaver rest), 온음표(whole), 쉼표/2분 쉼표(full rest and half rest), 폭(beam) 또는 음자리표(clef) 중에 어느 하나를 포함한다.On the other hand, the music symbols included in the feature vector are black note / white note, flag, flat, sharp, natural, quarter rest, 16 It includes any of a semi-quaver rest, whole note, rest, full rest and half rest, beam, or clef.

또한, 특징벡터에 포함된 음악 기호는 임시표(accidental), 정렬(align), 앞 꾸밈음(appoggiatura), 세로줄 마디(bar line), bass(저음(부)), 박자에 맞추어 내기(beat), 오선지를 2개 이상 묶는 괄호(brace), 종결부(coda), 장식음(cue), 코드/화음(chord), 다 음자리표(가온 음자리표, C clef), 낮은 음자리표(바 음자리표, F[bass] clef), 높은 음자리표(사 음자리표, G[treble] clef), 부점(dot), 이중주(duet(참고 quartet, quintet, sextet, solo, trio)), 지속시간(duration), 끝/종결(ending), 늘임표(fermata), 음표의 꼬리(flag), 내림표(flat), 꾸밈음(grace), 조표(key signature), 박자/가락/리듬(measure), 제자리표(natural sign), 음표(musical note), 2분 음표(half note), 4분 음표(a quarter(crotchet) note), 8분 음표(an eighth note), 16분 음표(sixteenth note), 조금씩 밀기(nudge), 8도음(octave), 타악기 연주(진동/타격, percussion) 소리[음성]의 높이/가락/음높이(pitch), 보통의(음표에서 점 같은 것이 없는 음표, plain), 4중주(quartet/quartette(참고 duet, quintet, sextet, solo, trio)), 5중주(quintet/quintte(참고 duet, quartet, sextet, solo, trio)), 다섯 잇단음표(quintuplet), 쉼/쉼표(rest), 리듬/율동(rhythm), 박자(duple[triple, three-four] ∼ 2[3, 4분의 3] 박자), 화음의 기음(root), 악보 한권의 총보(score), 성악 악보(a vocal), 발췌 악보(a short ~ ), 영화 음악(a film ~), 일곱 잇단음표(septuplet, 참고(quintuplet, sextuplet, triplet), 육중주( sextet/sextette, 참고 duet, quartet, quintet, solo, trio), 여섯 잇단음표(sextuplet, 참고(quintuplet, septuplet, sextuplet, triplet)), 올림표/반음 올린 음(sharp), 올림 [바]조(F sharp), 이음줄/슬러(slur), 독주(solo, 참고 duet, quartet, quintet, sextet, trio), 음표의 수직선(stem), 오선지/보표 system(staff(pl. staves)), 문자/숫자로 나타낸 악보/표보(tablature), 속도/리듬/1 분당 박자의 수(tempo), 붙임줄(tie), 박자기호/박자표(time signature), 삼중주(trio, 참고 duet, quartet, quintet, sextet, solo), 셋잇단음표(triplet, 참고 quintuplet, septuplet, sextuplet), 또는 성부/음색의 조정/음성(voice) 중에 어느 하나의 음악 기호가 포함할 수 있다.Also, the music symbols included in the feature vectors are accidental, align, appoggiatura, bar line, bass, bass, beat and stave. Brackets, coda, cue, chord / chord, clef (warm clef, C clef), low clef (bar clef, F [bass] clef) , Treble clef (G clef, G [treble] clef), dot, duet (duet (reference quartet, quintet, sextet, solo, trio)), duration, end / ending, elongation (fermata), the note's tail, flat, grace, key signature, beat / rhythm / measure, natural sign, musical note, two minutes Half note, a quarter (crotchet note), an eighth note, sixteenth note, nudge, eighth note (octave), percussion instrument ( Vibration / hit, percussion) height / rhythm / Pitch, normal (note that there is no point in the note, plain), quartet (quartet / quartette (reference duet, quintet, sextet, solo, trio)), quintet (quintet / quintte (reference duet, quartet) , sextet, solo, trio)), five quintuplets, rest / rest, rhythm / rhythm, beats [triple, three-four] to 2 [3/4] Beat, chord root, chord score, a vocal, excerpt (a short ~), film music (a film ~), seven septuplet, quintuplet , sextuplet, triplet), quartet (sextet / sextette, note duet, quartet, quintet, solo, trio), six doubles (sextuplet, note (quintuplet, septuplet, sextuplet, triplet)) ), F [sharp], seam / slur, solo, note duet, quartet, quintet, sextet, trio, vertical line of notes, stave / vowel system (staff (pl . staves), letters / numeric scores / tablatures, speed / rhythm / 1 tempo per minute, tie, time signature / time signature, trio, note duet, quartet, quintet, sextet, solo), triplet (triplet, reference quintuplet, septuplet, sextuplet), or a voice / voice adjustment / voice (voice) of any one may include.

이하, 도 6을 참조하여 전술한 시스템을 기반으로 하는 본 발명의 일 실시예에 따른 심층 네트워크를 이용한 악보인식 방법에 대해 살피면 아래와 같다.Hereinafter, a music score recognition method using a deep network according to an embodiment of the present invention based on the system described above with reference to FIG. 6 will be described below.

먼저, 촬영부가 촬영을 통해 악보 이미지를 포함하는 영상을 입력받는다(S10).First, the photographing unit receives an image including a sheet music image by photographing (S10).

이어서, 특징맵 추출부가 상기 영상을 3*3 슬라이딩 레이어로 구성하여 전체 영상에 대한 특징 맵을 추출한다(S20).Subsequently, the feature map extractor configures the image as a 3 * 3 sliding layer to extract a feature map for the entire image (S20).

뒤이어, 후보지역 검출부가 상기 특징 맵으로부터 악보 기호를 포함하는 후보 지역을 검출한다(S30).Subsequently, the candidate area detection unit detects a candidate area including the music score symbol from the feature map (S30).

이어서, 특징벡터 추출부가 상기 후보지역으로부터 고정 길이 특징 벡터를 추출한다(S40).Subsequently, the feature vector extracting unit extracts a fixed length feature vector from the candidate region (S40).

뒤이어, 훈련 처리부가 후보지역에서의 특징벡터 검출을 초기화한다(S50).Subsequently, the training processor initializes the feature vector detection in the candidate area (S50).

이어서, 후보지역 검출부가 조정된 새로운 후보지역을 설정한다(S60).Subsequently, the candidate area detection unit sets a new candidate area adjusted (S60).

그리고, 특징벡터 추출부가 새로운 후보지역으로부터 특징벡터를 추출하도록 제S30단계로 절차를 이행한다.In step S30, the feature vector extracting unit extracts the feature vector from the new candidate region.

바람직하게, 상기 제S30단계는 도 7에 도시된 바와 같이 제S31단계 내지 제S34단계를 포함하여 구성된다.Preferably, step S30 includes step S31 to step S34 as shown in FIG. 7.

제S20단계 이후, 후보지역 검출부가 256 차원의 특성벡터로 매핑된 각 슬라이딩 윈도우를 구성한다(S31).After operation S20, the candidate region detector configures each sliding window mapped to the feature vectors of 256 dimensions (S31).

이어서, 후보지역 검출부가 박스 회귀 및 박스 분류를 위해 슬라이딩 윈도우를 완전연결층(fully connected layers)의 합성곱 레이어(convolutinal layer) 및 풀링 레이어(pooling layer)로 인가한다(S32).Subsequently, the candidate region detection unit applies a sliding window to a convolutinal layer and a pooling layer of fully connected layers for box regression and box classification (S32).

뒤이어, 후보지역 검출부가 기 설정된 축적(x1, x2, x3) 및 종횡비(1:1, 1:2, 1:3)를 포함하는 참조 세트를 구성한다(S33).Subsequently, the candidate region detection unit constructs a reference set including preset accumulations (x1, x2, x3) and aspect ratios (1: 1, 1: 2, 1: 3) (S33).

그리고, 후보지역 검출부가 각 슬라이딩 윈도우 위치에서 동시에 다수의 후보 지역을 검출한다(S34).Then, the candidate area detector detects a plurality of candidate areas at the same time at each sliding window position (S34).

바람직하게, 상기 제S40단계는 도 8에 도시된 바와 같이 제S41단계 내지 제S43단계를 포함하여 구성된다.Preferably, the step S40 is configured to include steps S41 to S43 as shown in FIG.

제S30단계 이후, 특징벡터 추출부가 풀링 레이어(pooling layer)를 통해 후보 지역 내의 마지막 두 개의 출력 레이어로 분기한다(S41).After operation S30, the feature vector extractor branches to the last two output layers in the candidate region through a pooling layer (S41).

이어서, 특징벡터 추출부가 첫 번째 레이어에서 상기 특징벡터로부터 음악 기호 클래스와 백그라운드 클래스를 softmax 확률 함수를 통해 인식한다(S42).Subsequently, the feature vector extractor recognizes the music symbol class and the background class from the feature vector in the first layer through a softmax probability function (S42).

그리고, 특징벡터 추출부가 두 번째 레이어에서 상기 특징벡터로부터 음악 기호 각각의 위치를 계산한다(S43).The feature vector extracting unit calculates a position of each music symbol from the feature vector in the second layer (S43).

이상으로 본 발명의 기술적 사상을 예시하기 위한 바람직한 실시예와 관련하여 설명하고 도시하였지만, 본 발명은 이와 같이 도시되고 설명된 그대로의 구성 및 작용에만 국한되는 것이 아니며, 기술적 사상의 범주를 일탈함이 없이 본 발명에 대해 다수의 변경 및 수정이 가능함을 당업자들은 잘 이해할 수 있을 것이다. 따라서 그러한 모든 적절한 변경 및 수정과 균등 물들도 본 발명의 범위에 속하는 것으로 간주되어야 할 것이다.As described above and described with reference to a preferred embodiment for illustrating the technical idea of the present invention, the present invention is not limited to the configuration and operation as shown and described as described above, it is a deviation from the scope of the technical idea It will be understood by those skilled in the art that many modifications and variations can be made to the invention without departing from the scope of the invention. Accordingly, all such suitable changes, modifications, and equivalents should be considered to be within the scope of the present invention.

100: 심층 네트워크를 이용한 악보인식 시스템
101: 촬영부
102: 특징맵 추출부
103: 후보지역 검출부
104: 특징벡터 추출부
105: 훈련 처리부100: Music Recognition System Using Deep Network
101: filming unit
102: feature map extractor
103: candidate area detection unit
104: feature vector extraction unit
105: training processor

Claims

In the music score recognition system using a deep network,
A photographing unit which photographs an image including a sheet music image;
A feature map extractor configured to extract the feature map of the entire image by configuring the image as a sliding layer;
A candidate region detector for detecting a candidate region including a musical score symbol from the feature map;
A feature vector extractor extracting a feature vector from the candidate region; And
A training processor for initializing feature vector detection in the candidate region, controlling to set a new candidate region adjusted through the candidate region detection unit, and controlling to extract a feature vector from the new candidate region through the feature vector extraction unit; But
The feature vector extraction unit,
Branching to the last two output layers in the candidate region through a pooling layer, the first layer recognizes the music symbol class and background class from the feature vector through a softmax probability function, and from the feature vector on the second layer. Music score recognition system using a deep network, characterized in that for calculating the position of each music symbol.

The method of claim 1,
The candidate region detection unit,
Each sliding window mapped to a feature vector of 256 dimensions is applied to a convolutinal and pooling layer of fully connected layers for box regression and box classification, and the set accumulation (x1). A score recognition system using a deep network, comprising: configuring a reference set including x2, x3) and aspect ratios (1: 1, 1: 2, 1: 3).

delete

In the music score recognition method using a deep network,
(a) receiving an image including a music score image by the photographing unit;
(b) a feature map extractor extracting a feature map for the image;
(c) a candidate region detecting unit detecting a candidate region including a music score from the feature map;
(d) a feature vector extracting unit extracting a feature vector from the candidate region;
(e) a training processor initiating feature vector detection in the candidate region;
(f) controlling the training processor to set a new candidate region adjusted through the candidate region detection unit; And
(g) controlling the training processor to extract the feature vector from the new candidate region through the feature vector extractor;
In step (c),
(c-1) the candidate region detector constituting each sliding window mapped to the feature vectors of 256 dimensions;
(c-2) the candidate region detecting unit applying the sliding window to a convolutinal layer and a pooling layer of fully connected layers;
(c-3) the candidate region detecting unit constructing a reference set including preset accumulations (x1, x2, x3) and aspect ratios (1: 1, 1: 2, 1: 3); And
(c-4) the candidate region detecting unit detecting a plurality of candidate regions at the same time at each sliding window position,
In step (d),
(d-1) the feature vector extractor branching to the last two output layers in the candidate region through a pooling layer;
(d-2) a feature vector extractor recognizing a music symbol class and a background class from the feature vector through a softmax probability function in a first layer; And
(d-3) the feature vector extracting unit calculating a position of each music symbol from the feature vector in the second layer;
Music score recognition system using a deep network comprising a.

delete