KR102259299B1

KR102259299B1 - Book sound classification method using machine learning model of a book handling sounds

Info

Publication number: KR102259299B1
Application number: KR1020190176172A
Authority: KR
Inventors: 김혜주; 김승찬
Original assignee: 한림대학교 산학협력단
Priority date: 2019-12-27
Filing date: 2019-12-27
Publication date: 2021-06-01

Abstract

The present invention relates to a book sound classification using a convolutional neural network (CNN) and, more specifically, to a book sound classification method using a machine learning model for book handling sounds to quantify a user's book handling method. To this end, the method comprises: a step (S100) of acquiring and preparing data of book handling sound; a data augmentation step (S110) of changing the data to increase the number of data in order to increase strength of machine learning; a step (S120) of converting the data of the book handling sound and the augmented data of the book handling sound into a format for deep learning; a step of (S130) of generating a CNN model and performing deep learning-based machine learning by using the converted data; a step (S150) of completing the CNN model after the deep learning is completed; a step (S160) of inputting book sound data into the completed CNN model; and a step (S170) of allowing the CNN model to operate the book sound data to classify book sound.

Description

Book sound classification method using machine learning model of a book handling sounds}

본 발명은 CNN(Convolutional Neural Network)을 이용한 책 소리 분류 방법에 관한 것으로, 보다 상세하게는 책 다룰 때 나는 미세한 소리를 CNN (Convolutional Neural Network) 기반의 딥러닝을 이용하여 분류하고 사용자의 책에대한 맥락(context) 을 이해하고자 하는 기계학습 방법에 관한 것이다.The present invention relates to a method for classifying book sounds using a Convolutional Neural Network (CNN), and more particularly, to classifying the fine sounds made when handling books using deep learning based on CNN (Convolutional Neural Network) and analyzing the user's book. It is about a machine learning method that tries to understand the context.

소비자들은 도서 구매를 할 때 여러 책을 훑어보면서 관심이 가는 책은 좀 더 꼼꼼히 살펴보는 행동을 하게 된다. 책에 대한 소비자의 관심 정도를 파악하는 방법은 대부분 설문조사를 통해서 이루어진다. 만약 주관적인 설문조사 방식이 아닌 소비자가 책을 어떠한 방식으로 읽고 있는지 정량적으로 파악할 수 있다면 어떤 책이 관심을 끌었는지 쉽게 파악할 수 있을 것이다. When purchasing a book, consumers tend to scan through several books and look more closely at the books they are interested in. Most of the methods to determine the level of interest of consumers in books are through surveys. If it is possible to quantitatively understand how consumers read books rather than using a subjective survey method, it will be easy to determine which books have attracted attention.

책은 가장 보편화 되어있는 매체 중 하나로 정보를 얻거나, 재미를 얻기 위한 목적으로 사용된다. 최근에는 기술의 발달로 전자책 시장 또한 활성화되고 있다고 하지만 종이 책이 가지는 물리적인 장점 때문에 여전히 종이책의 수요도 많다.특히 소장하고 싶은 책은 종이 책을 구입하는 경우가 많은데 책을 구매하는 과정에서 사람들은 책을 훑어보기도 하고, 꼼꼼히 읽어보는 등의 다양한 동작을 하게 된다. 사람들이 도서 구입을 하는 기준은 다양한데 책의 명성이나 저자의 유명도, 베스트셀러 여부 등이 책 구입 결정에 영향을 미친다. 따라서 이러한 유명세가 없는 책들은 사람들의 관심을 끌기 위해 책 표지 디자인이나 문구에 신경을 쓰게 된다. 하지만 이러한 요소가 정말 구매자의 시선을 유도하고 실제 구매까지 유도했는지는 알기 어렵다. 이를 파악하기 위해 설문조사를 이용하고 있지만 다량의 책에 대해 구매자의 관심정도를 파악하기는 어려운 실정이다. Books are one of the most common media, and are used for the purpose of obtaining information or having fun. Recently, the e-book market is also being revitalized due to the development of technology, but there is still a lot of demand for paper books due to the physical advantages of paper books. People do various actions such as scanning a book or reading it thoroughly. There are various criteria for people to purchase a book, but the reputation of the book, the author's popularity, and whether it is a bestseller affect the decision to buy a book. Thus, books without such fame tend to pay attention to book cover design and text to attract people's attention. However, it is difficult to know whether these factors really attracted the attention of buyers and even led to actual purchase. Although a survey is used to understand this, it is difficult to determine the degree of interest of buyers for a large number of books.

1. 정재열 and 신동희, "전자책과 종이책 독자의 도서구매 의도와 성향 연구: 서점에서 구매하는 상황을 중점으로," 한국 HCI 학회 학술대회, pp 1135-1137, 2014.1. Jeong Jae-Yeol and Shin Dong-Hee, "A Study on the Book Purchase Intention and Propensity of E-Book and Paper Book Readers: Focusing on the Situation of Purchasing at Bookstores," Korean HCI Society Conference, pp 1135-1137, 2014. 2. Convolutional neural network to classify urban sounds," in TENCON 2017-2017 IEEE Region 10 Conference, 2017: IEEE, pp 3089-30922. Convolutional neural network to classify urban sounds," in TENCON 2017-2017 IEEE Region 10 Conference, 2017: IEEE, pp 3089-3092 3. J Salamon, C Jacoby, and J P Bello, "A Dataset and Taxonomy for Urban Sound Research," presented at the Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, 20143. J Salamon, C Jacoby, and J P Bello, "A Dataset and Taxonomy for Urban Sound Research," presented at the Proceedings of the 22nd ACM international conference on Multimedia, Orlando, Florida, USA, 2014 4. 박대서, 방준일, 김화종, and 고영준, "CNN을 이용한 음성 데이터 성별 및 연령 분류 기술 연구," 한국정보기술학회논문지, vol 16, no 11, pp 11-21, 2018,4. Dae-Seo Park, Jun-Il Bang, Hwa-Jong Kim, and Young-Jun Ko, "Study on gender and age classification technology for voice data using CNN," Journal of the Korean Society for Information Technology, vol 16, no 11, pp 11-21, 2018, 5. 김세영, 김현웅, 박찬호, and 정목동, "청각 장애인을 위한 딥 러닝 기반 소리 방향 및 종류 식별 시스템," 한국정보과학회 학술발표논문집, pp 1896-1898, 2017,5. Se-Young Kim, Hyun-Woong Kim, Chan-Ho Park, and Mok-Dong Jeong, "Deep Learning-based Sound Direction and Type Identification System for the Hearing Impaired," Proceedings of the Korean Society for Information Science and Technology, pp 1896-1898, 2017, 6. B McFee et al., "librosa: Audio and music signal analysis in python," in Proceedings of the 14th python in science conference, 2015, vol 8,6. B McFee et al., “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th python in science conference, 2015, vol 8, 7. L Muda, M Begam, and I Elamvazuthi, "Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques," arXiv preprint arXiv:1003.4083, 2010.7. L Muda, M Begam, and I Elamvazuthi, “Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques,” arXiv preprint arXiv: 1003.4083, 2010.

따라서, 본 발명은 상기와 같은 문제점을 해결하기 위하여 안출된 것으로서, 본 발명의 해결하고자 하는 과제는 책을 다룰 때 발생하는 미세한 소리에 근거하여 사용자가 책을 다루는 방법을 정량화할 수 있는 책 다루는 소리의 기계 학습모델을 이용한 책 소리 분류 방법을 제공하는 것이다. Accordingly, the present invention has been devised to solve the above problems, and the object of the present invention is to quantify how the user handles the book based on the minute sound generated when handling the book. To provide a book sound classification method using a machine learning model of

또한, 본 발명의 또 다른 목적은, 합성곱 신경망(Convolutional Neural Network, CNN)을 이용하여 사용자의 책을 읽는 동작을 미세한 소리에 근거하여 분류 및 예측할 수 있는 책 다루는 소리의 기계 학습모델을 이용한 책 소리 분류 방법을 제공하는 것이다. In addition, another object of the present invention is a book using a machine learning model of book handling sound that can classify and predict a user's book reading operation based on a fine sound using a convolutional neural network (CNN) To provide a sound classification method.

다만, 본 발명에서 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급하지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the technical problems to be achieved in the present invention are not limited to the technical problems mentioned above, and other technical problems that are not mentioned are clearly to those of ordinary skill in the technical field to which the present invention belongs from the following description. It will be understandable.

상기의 기술적 과제를 달성하기 위하여, 책 다루는 소리의 데이터를 획득하여 준비하는 단계(S100); 기계학습 시스템의 강인도(Robustness)를 올리기 위하여데이터를 변조시켜 상기 데이터의 수를 증가시키는 데이터 증강단계(S110); 원본 데이터와 증강된 상기 데이터를 딥러닝하기 위한 포맷으로 변환하는 단계(S120); 합성곱 신경망 모델을 생성하고, 변환된 상기 데이터를 이용하여 딥러닝 기반의 기계학습을 하는 단계(S130); 딥러닝이 완료된 후 상기 합성곱 신경망 모델을 완성하는 단계(S150); 학습이 완성된 상기 합성곱 신경망 모델에 책소리 데이터를 입력하는 단계(S160); 및 상기 합성곱 신경망 모델이 상기 책소리 데이터를 연산하여 책 소리를 분류하는 추론단계(S170);를 포함하는 것을 특징으로 하는 책 다루는 소리의 기계 학습모델을 이용한 책 소리 분류 방법이 제공된다.In order to achieve the above technical task, obtaining and preparing the data of the sound of the book (S100); A data augmentation step (S110) of modulating data to increase the number of data in order to increase the robustness of the machine learning system (S110); converting the original data and the augmented data into a format for deep learning (S120); generating a convolutional neural network model, and performing deep learning-based machine learning using the converted data (S130); Completing the convolutional neural network model after the deep learning is completed (S150); inputting the book sound data to the convolutional neural network model for which learning is completed (S160); and an inference step (S170) in which the convolutional neural network model classifies the book sound by calculating the book sound data; a book sound classification method using a machine learning model of book handling sound is provided.

또한, 책 다루는 소리는, 책 하단을 한장씩 넘기는 동작(C0)에 관한 소리, 한 번에 넘기는 동작(C1)에 관한 소리, 책 중심선을 꾹꾹 누르는 동작(C2)에 관한 소리, 한 장씩 넘어가게하는 동작(C3)에 관한 소리, 책장을 덮는 동작(C4)에 관한 소리, 책을 뒤집어서 펄럭이는 동작(C5)에 관한 소리, 책장 위에서 손가락을 움직이는 동작(C6)에 관한 소리, 글을 손가락으로 밑줄 긋는 동작(C7)에 관한 소리 중 적어도 하나를 포함한다. In addition, the sound of book handling includes the sound of turning the bottom of the book one by one (C0), the sound of turning one at a time (C1), the sound of pressing the centerline of the book (C2), and the sound of turning pages one by one. The sound of motion (C3), the sound of closing the bookshelf (C4), the sound of flipping a book over (C5), the sound of moving a finger on the bookshelf (C6), the sound of writing with fingers and at least one sound related to the underlining operation C7.

또한, 데이터 증강단계(S110)는, 책 다루는 소리의 데이터에 배경음악을 추가하는 단계, 화이트 노이즈를 추가하는 단계, 음원의 피치를 바꾸는 단계, 음원의 시간축을 이동하는 단계 중 적어도 하나를 포함할 수 있다.In addition, the data augmentation step (S110) may include at least one of a step of adding background music to the data of sound handling a book, a step of adding white noise, changing the pitch of the sound source, and moving the time axis of the sound source. can

또한, 변환단계(S120)는, 상기 책 다루는 소리의 데이터를 0.5 초 ~ 2 초 범위의 mp3 파일로 변환한다.In addition, in the conversion step (S120), the data of the sound handling the book is converted into an mp3 file in the range of 0.5 seconds to 2 seconds.

또한, 신경망 모델의 딥러닝 단계(S130)는, 변환된 상기 데이터를 이용하여 제 1 합성곱하는 단계(S131); 제 1 활성함수를 이용하여 연산하는 단계(S132); 연산된 상기 데이터를 이용하여 제 2 합성곱하는 단계(S133); 제 2 활성함수를 이용하여 연산하는 단계(S134); 연산된 상기 데이터를 풀링하는 단계(S135); 및 상기 신경망 모델을 드롭아웃하는 단계(S136);를 포함한다.In addition, the deep learning step (S130) of the neural network model, the first convolution step (S131) using the transformed data; calculating using the first activation function (S132); performing a second convolution using the calculated data (S133); calculating using the second activation function (S134); pooling the calculated data (S135); and dropping out the neural network model (S136).

또한, 제 1 합성곱단계(S131) 내지 상기 드롭아웃 단계(S136)가 복수회 반복 실행될 수 있다. In addition, the first convolution step ( S131 ) to the dropout step ( S136 ) may be repeatedly performed a plurality of times.

또한, 신경망 모델을 평탄화하는 FC(Fully Connected Layer)단계(S137); 및 책소리의 분류를 결정하는 소프트 맥스단계(S138);를 더 포함한다.In addition, FC (Fully Connected Layer) step of flattening the neural network model (S137); and a soft max step (S138) of determining the classification of book sounds.

또한, 딥러닝 기반의 기계학습을 하는 단계(S130)와 상기 합성곱 신경망 모델 완성단계(S150) 사이에, 테스트 데이터를 입력하여 완성된 상기 합성곱 신경망 모델의 성능을 확인하는 단계(S140)를 더 포함한다.In addition, between the step of deep learning-based machine learning (S130) and the step of completing the convolutional neural network model (S150), input test data to confirm the performance of the completed convolutional neural network model (S140) include more

본 발명의 일실시예에 따르면, 책을 통한 8가지 동작에 대한 분석을 시도하여 소리만으로 사용자의 동작을 정교하게 예측해 낼 수 있다. According to an embodiment of the present invention, it is possible to accurately predict the user's motion only by sound by trying to analyze 8 motions through the book.

또한, 독자의 책에 대한 관심 정도를 분석하기 위해 종래와 같이 설문조사를 하는 것이 아니라 소리 데이터와 합성곱 신경망을 이용함으로써 매우 객관적이고 체계적인 분석이 가능하다. In addition, a very objective and systematic analysis is possible by using sound data and a convolutional neural network instead of conducting a survey as in the prior art to analyze the reader's degree of interest in the book.

또한, 책을 살펴보는 과정에서 발생하는 다양한 소리를 분석한다면 어떤 책이 사람들의 관심을 끌었는지 좀 더 쉽고 신뢰성 있게 파악할 수 있는 장점이 있다. 즉, 독자가 관심이 있는 책이라면 한 장씩 천천히 넘기거나 밑줄을 그으며 읽을 것이고, 관심이 없는 책이라면 빠르게 넘겨보거나 책장을 금방 덮을 것이다. 이러한 동작을 할 때 나는 소리를 바탕으로 독자가 책을 어떠한 방식으로 보고 있는지 파악한다면 관심 수준을 정량적으로 판단하는데 활용될 수 있을 것이다.In addition, if you analyze the various sounds generated in the process of looking at a book, it has the advantage of being able to easily and reliably determine which book attracted people's attention. In other words, if the reader is interested in a book, he will slowly turn through or underline one page at a time, and if he is not interested, he will flip through it quickly or quickly close the bookshelf. If the reader understands how the reader is looking at the book based on the sound made during these actions, it can be used to quantitatively determine the level of interest.

또한 제시된 합성곱신경망의 구조는 가장 일반적인 형태로 기술된 것이며, 지름길 연결(shortcut connection 또는 skip connection), 층간 교차 연결(cross-layer connections) 등의 다양한 합성곱 신경망의 기술을 이용하여 보다 개선된 성능을 도출해낼 수 있다. In addition, the proposed convolutional neural network structure is described in the most general form, and performance is improved by using various convolutional neural network technologies such as shortcut connection or skip connection and cross-layer connections. can be derived.

다만, 본 발명에서 얻을 수 있는 효과는 이상에서 언급한 효과들로 제한되지 않으며, 언급하지 않은 또 다른 효과들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.However, the effects obtainable in the present invention are not limited to the above-mentioned effects, and other effects not mentioned will be clearly understood by those of ordinary skill in the art from the following description. I will be able to.

본 명세서에서 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 후술하는 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어서 해석되어서는 아니된다.
도 1은 본 발명의 데이터 증강에 사용되는 일예를 나타낸 예시 스펙트럼,
도 2는 본 발명에서 책으로 할 수 있는 8가지 동작을 분류한 예시,
도 3은 본 발명에서 분류별 스펙트럼의 일예,
도 4a는 본 발명의 일실시예에 따른 책 다루는 소리의 기계 학습모델을 이용한 책 소리 분류 방법을 흐름도,
도 4b는 도 4a중 합성곱 신경망 모델의 구축 및 딥러닝 단계(S130)의 세부 흐름도,
도 5는 본 발명의 일실시예에 따른 모델의 성능을 나타내는 혼동 행렬(Confusion Matrix) 그래프,
도 6은 본 발명과 비교하기 위하여 종래의 MFCC 알고리즘을 이용한 모델의 성능을 나타내는 컨퓨젼 행렬 그래프이다.The following drawings attached to this specification illustrate preferred embodiments of the present invention, and serve to further understand the technical spirit of the present invention together with the detailed description of the present invention to be described later, so the present invention is described in such drawings It should not be construed as being limited only to
1 is an exemplary spectrum showing an example used for data augmentation of the present invention;
2 is an example of classifying eight operations that can be done with books in the present invention;
3 is an example of a spectrum for each classification in the present invention,
4a is a flowchart of a book sound classification method using a machine learning model of book handling sound according to an embodiment of the present invention;
Figure 4b is a detailed flowchart of the construction and deep learning step (S130) of the convolutional neural network model in Figure 4a;
5 is a confusion matrix graph showing the performance of a model according to an embodiment of the present invention;
6 is a fusion matrix graph showing the performance of a model using a conventional MFCC algorithm for comparison with the present invention.

아래에서는 첨부한 도면을 참고로 하여 본 발명의 실시예에 대하여 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자가 용이하게 실시할 수 있도록 상세히 설명한다. 그러나 본 발명에 관한 설명은 구조적 내지 기능적 설명을 위한 실시예에 불과하므로, 본 발명의 권리범위는 본문에 설명된 실시예에 의하여 제한되는 것으로 해석되어서는 아니 된다. 즉, 실시예는 다양한 변경이 가능하고 여러 가지 형태를 가질 수 있으므로 본 발명의 권리범위는 기술적 사상을 실현할 수 있는 균등물들을 포함하는 것으로 이해되어야 한다. 또한, 본 발명에서 제시된 목적 또는 효과는 특정 실시예가 이를 전부 포함하여야 한다거나 그러한 효과만을 포함하여야 한다는 의미는 아니므로, 본 발명의 권리범위는 이에 의하여 제한되는 것으로 이해되어서는 아니 될 것이다.Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those of ordinary skill in the art may easily implement the present invention. However, since the description of the present invention is merely an embodiment for structural or functional description, the scope of the present invention should not be construed as being limited by the embodiment described in the text. That is, since the embodiments can be variously changed and have various forms, the scope of the present invention should be understood to include equivalents capable of realizing the technical idea. In addition, since the object or effect presented in the present invention does not mean that a specific embodiment should include all or only such effects, the scope of the present invention should not be understood as being limited thereto.

본 발명에서 서술되는 용어의 의미는 다음과 같이 이해되어야 할 것이다.The meaning of the terms described in the present invention should be understood as follows.

"제1", "제2" 등의 용어는 하나의 구성요소를 다른 구성요소로부터 구별하기 위한 것으로, 이들 용어들에 의해 권리범위가 한정되어서는 아니 된다. 예를 들어, 제1 구성요소는 제2 구성요소로 명명될 수 있고, 유사하게 제2 구성요소도 제1 구성요소로 명명될 수 있다. 어떤 구성요소가 다른 구성요소에 "연결되어" 있다고 언급된 때에는, 그 다른 구성요소에 직접적으로 연결될 수도 있지만, 중간에 다른 구성요소가 존재할 수도 있다고 이해되어야 할 것이다. 반면에, 어떤 구성요소가 다른 구성요소에 "직접 연결되어" 있다고 언급된 때에는 중간에 다른 구성요소가 존재하지 않는 것으로 이해되어야 할 것이다. 한편, 구성요소들 간의 관계를 설명하는 다른 표현들, 즉 "~사이에"와 "바로 ~사이에" 또는 "~에 이웃하는"과 "~에 직접 이웃하는" 등도 마찬가지로 해석되어야 한다.Terms such as "first" and "second" are used to distinguish one component from other components, and the scope of rights is not limited by these terms. For example, a first component may be referred to as a second component, and similarly, a second component may be referred to as a first component. When a component is referred to as being "connected" to another component, it should be understood that although it may be directly connected to the other component, another component may exist in the middle. On the other hand, when a component is referred to as being "directly connected" to another component, it should be understood that there is no other component in the middle. On the other hand, other expressions describing the relationship between components, that is, "between" and "just between" or "neighboring to" and "directly neighboring to" should be interpreted as well.

단수의 표현은 문맥상 명백하게 다르게 뜻하지 않는 한 복수의 표현을 포함하는 것으로 이해되어야 하고, "포함하다" 또는 "가지다" 등의 용어는 설시된 특징, 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것이 존재함을 지정하려는 것이며, 하나 또는 그 이상의 다른 특징이나 숫자, 단계, 동작, 구성요소, 부분품 또는 이들을 조합한 것들의 존재 또는 부가 가능성을 미리 배제하지 않는 것으로 이해되어야 한다.Singular expressions are to be understood as including plural expressions unless the context clearly indicates otherwise, and terms such as "comprises" or "have" refer to the specified features, numbers, steps, actions, components, parts, or these. It is to be understood that it is intended to designate that a combination exists and does not preclude the presence or addition of one or more other features, numbers, steps, actions, components, parts, or combinations thereof.

여기서 사용되는 모든 용어들은 다르게 정의되지 않는 한, 본 발명이 속하는 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가진다. 일반적으로 사용되는 사전에 정의되어 있는 용어들은 관련 기술의 문맥상 가지는 의미와 일치하는 것으로 해석되어야 하며, 본 발명에서 명백하게 정의하지 않는 한 이상적이거나 과도하게 형식적인 의미를 지니는 것으로 해석될 수 없다.All terms used herein have the same meaning as commonly understood by one of ordinary skill in the field to which the present invention belongs, unless otherwise defined. Terms defined in commonly used dictionaries should be interpreted as having meanings in the context of related technologies, and cannot be interpreted as having an ideal or excessively formal meaning unless explicitly defined in the present invention.

이하, 첨부된 도면을 참조하여 바람직한 실시예의 구성을 상세히 설명하기로 한다. 도 4a는 본 발명의 일실시예에 따른 책 다루는 소리의 기계 학습모델을 이용한 책 소리 분류 방법을 흐름도이다. 도 4a에 도시된 바와 같이, 먼저, 책 다루는 소리의 데이터를 획득하여 준비한다(S100). 책 다루는 소리는 수집한 15 초 길이의 책 소리파일 약 50,000 개를 사용하였다. 소리는 일반 핸드폰 녹음기/PC 마이크를 이용한 빌트인 앱을 이용하여 녹음하여 수집하였다. Hereinafter, the configuration of the preferred embodiment will be described in detail with reference to the accompanying drawings. 4A is a flowchart illustrating a method for classifying book sounds using a machine learning model of book handling sounds according to an embodiment of the present invention. As shown in Fig. 4a, first, the data of the sound of handling a book is obtained and prepared (S100). About 50,000 sound files of 15 seconds in length were used for the sound of book handling. The sound was recorded and collected using a built-in app using a general cell phone recorder/PC microphone.

한편, 책 다루는 소리는 8가지로 정의하여 분류하였다. 즉, 책 하단을 한장씩 넘기는 동작(C0)에 관한 소리, 한 번에 넘기는 동작(C1)에 관한 소리, 책 중심선을 꾹꾹 누르는 동작(C2)에 관한 소리, 한 장씩 넘어가게하는 동작(C3)에 관한 소리, 책장을 덮는 동작(C4)에 관한 소리, 책을 뒤집어서 펄럭이는 동작(C5)에 관한 소리, 책장 위에서 손가락을 움직이는 동작(C6)에 관한 소리, 글을 손가락으로 밑줄 긋는 동작(C7)에 관한 소리로 분류하였다. On the other hand, book handling sounds were defined and classified into eight categories. That is, the sound of turning the bottom of the book one by one (C0), the sound of turning one at a time (C1), the sound of pressing the center line of the book firmly (C2), and the sound of turning the pages one by one (C3). Sound about the sound of closing the bookcase (C4), the sound of turning a book over and flapping it (C5), the sound of moving a finger on the bookshelf (C6), and the sound of underlining text with a finger (C7) ) was classified as a sound related to

도 2는 본 발명에서 책으로 할 수 있는 8가지 동작을 분류한 예시이고, 도 3은 본 발명에서 분류별 스펙트럼의 일예이다. Figure 2 is an example of classifying eight operations that can be done with books in the present invention, and Figure 3 is an example of a spectrum for each classification in the present invention.

그 다음, 획득된 책 다루는 소리의 데이터를 변화시켜 데이터의 수를 증가시키는 데이터 증강(Augmentation) 단계가 수행된다(S110). 도 1은 본 발명의 데이터 증강에 사용되는 일예를 나타낸 예시 스펙트럼이다. 도 1에 도시된 바와 같이, 데이터 증강은 책 다루는 소리의 데이터에 배경음악을 추가하는 단계, 화이트 노이즈(White noise)를 추가하는 단계, 음원의 피치(Pitch)를 바꾸는 단계, 음원의 시간축을 이동(Shifting)하는 단계가 적용되었다. [표 1]은 이러한 책 다루는 소리의 분류(Class)와 데이터 및 데이터 증강 수를 나타낸다.Next, a data augmentation step of increasing the number of data by changing the acquired sound data for handling a book is performed (S110). 1 is an exemplary spectrum showing an example used for data augmentation of the present invention. As shown in Figure 1, data augmentation is a step of adding background music to the sound data of the book handling, adding white noise, changing the pitch of the sound source, moving the time axis of the sound source (Shifting) step was applied. [Table 1] shows the classification (Class) of the sound of these books and the number of data and data augmentation.

동작action ClassClass DataData AugmentationAugmentation 책 하단 한장씩 넘기기flip through the bottom of the book TurnTurn C0C0 2,2622,262 11,31011,310 책장 한번에 넘기기flip through bookshelves FlipFlip C1C1 1,6301,630 8,1508,150 책 중심선 꾹꾹 누르기press the center line of the book RubRub C2C2 1,6021,602 8,0108,010 책장 한장씩 넘어가기Skip through the pages one by one One PaperOne Paper C3C3 1,0131,013 5,0615,061 책 덮기book cover CloseClose C4C4 909909 4,5454,545 책 뒤집어 펄럭이기flip the book over FlutterFlutter C5C5 874874 4,3704,370 손가락 움직이기move your finger FingerFinger C6C6 930930 4,6504,650 손가락 밑줄긋기underline the finger UnderlineUnderline C7C7 870870 4,3504,350 합계Sum 10,09010,090 50,44650,446

그 다음, 원본데이터와 증강된 데이터를 가지고 딥러닝 기반 기계학습을 진행하기 위해 소정의 포맷으로 데이터를 변환한다(S120). 개별 데이터는 특정길이의 Wave 형식이며, 소정의 포맷은 0.5 초 ~ 2 초 범위에서 선택될 수 있으며, 너무 짧으면 소리의 특징이 녹음 및 인공신경망 시스템에서 추출되기 어렵고, 너무 길면, 불필요한 잡음이 같이 녹음될 수 있거나 학습 시간이 길어질 수 있다. 또한, 파일의 형식은 m4a, mp3, wav 등 다양한 포맷이 적용될 수 있다. 프로그래밍 라이브러리(예, Python의 LibROSA 등)를 사용하여 1차원의 소리 데이터를 시간, 주파수, 강도의 정보를 담고 있는 2 차원 스펙트로그램 데이터로 만들어 학습 데이터로 사용하였다.Next, the data is converted into a predetermined format in order to perform deep learning-based machine learning with the original data and the augmented data (S120). Individual data is in a wave format of a specific length, and a predetermined format can be selected in the range of 0.5 seconds to 2 seconds. If it is too short, it is difficult to record sound characteristics and extract from the artificial neural network system. If it is too long, unnecessary noise is recorded together. may be, or it may take longer to learn. In addition, various formats such as m4a, mp3, and wav may be applied to the file format. Using a programming library (eg, LibROSA in Python), one-dimensional sound data was used as learning data by making two-dimensional spectrogram data containing information on time, frequency, and intensity.

그 다음, 합성곱 신경망 모델을 생성하고, 변환된 상기 데이터를 이용하여 딥러닝기반 기계학습을 한다(S130). 6개의 합성곱 신경망 계층으로 구성하였고 계층별로 풀링(pooling)과 드롭아웃(Dropout)을 포함하였다. 활성 함수는 신경망의 개별 뉴런에 들어오는 입력신호의 총합을 출력신호로 변환하는 함수하며, 활성 함수는 ReLU(rectified linear unit)를 사용하였다. 여러 다양한 최적화 기법을 사용할 수 있으나, 본 예시에서는 Adam Optimizer를 이용하여 신경망 최적화 과정을 진행하였다. 도 4b는 도 4a중 합성곱 신경망 모델의 구축 및 딥러닝 단계(S130)의 세부 흐름도이다. 도 4b에 도시된 바와 같이, 변환된 데이터를 이용하여 제 1 합성곱을 한다(S131). 그리고, 제 1 활성함수를 이용하여 연산한다(S132). 제 1 활성함수는 ReLU(rectified linear unit)를 사용하였다.Then, a convolutional neural network model is generated, and deep learning-based machine learning is performed using the converted data (S130). It consists of 6 convolutional neural network layers, and includes pooling and dropout for each layer. The activation function is a function that converts the sum of input signals to individual neurons of the neural network into an output signal, and a rectified linear unit (ReLU) is used as the activation function. Although various optimization techniques can be used, in this example, the neural network optimization process was performed using the Adam Optimizer. 4B is a detailed flowchart of the construction of the convolutional neural network model and the deep learning step (S130) of FIG. 4A. As shown in FIG. 4B, a first convolution is performed using the transformed data (S131). Then, the operation is performed using the first activation function (S132). As the first activation function, a rectified linear unit (ReLU) was used.

그 다음, 연산된 데이터를 이용하여 제 2 합성곱을 한다(S133). 그리고, 제 2 활성함수를 이용하여 연산한다(S134). 제 2 활성함수도 ReLU(rectified linear unit)를 사용하였다.Next, a second convolution is performed using the calculated data (S133). Then, it is calculated using the second activation function (S134). The second activation function also used a rectified linear unit (ReLU).

그 다음, 연산된 데이터를 풀링(Pooling)한다(S135). 풀링은 데이터를 줄이는 과정으로써, 해당 범위 내에서 하나의 숫자를 대표 숫자로 선택하고 나머지를 폐기하는 과정이다. 풀링은 최대 풀링 또는 평균 풀링이 적용될 수 있다. Then, the calculated data is pooled (S135). Pooling is a process of reducing data, selecting one number as a representative number within the corresponding range and discarding the rest. For pooling, maximum pooling or average pooling may be applied.

그 다음, 신경망 모델을 드롭아웃(Dropout)하는 단계(S136)가 실행된다. 드롭아웃은 신경망 구조에서 학습에 관여하지 않거나 비중이 적은 노드와 그 연결을 끄는 기법으로 가중치에 의한 학습보다 더 효과적이다. Then, the step of dropping out (Dropout) the neural network model (S136) is executed. Dropout is a technique that turns off nodes and their connections that are not involved in learning or have a small weight in the neural network structure, and is more effective than learning by weight.

전술한 제 1 합성곱단계(S131) 내지 상기 드롭아웃 단계(S136)가 복수회(예 3회 내지 5회) 반복 실행된다. The above-described first convolution step S131 to the dropout step S136 are repeatedly performed a plurality of times (eg, 3 to 5 times).

그 다음, 신경망 모델을 평탄화하는 FC(Fully Connected Layer)단계(S137)가 실행된다. FC단계는 이전 레이어의 출력을 "평탄화"하여 다음 스테이지의 입력이 될 수 있는 단일 벡터로 변환하는 과정이다. Then, the FC (Fully Connected Layer) step (S137) for flattening the neural network model is executed. The FC stage is the process of "flattening" the output of the previous layer and transforming it into a single vector that can be the input of the next stage.

그 다음, 책소리의 분류를 결정하는 소프트맥스(Soft Max)함수가 연산되는 단계(S138)가 실행된다. 소프트맥수 함수는 0과 1.0 사이의 실수로서 함수의 출력을 분류의 확률로 해석할 수 있다. Then, a step (S138) of calculating a soft max (Soft Max) function for determining the classification of book sounds is executed. The soft pulse function is a real number between 0 and 1.0, which can be interpreted as a probability of classification.

그 다음, 딥러닝이 완료된 후 합성곱 신경망 모델을 완성한다(S150).Then, after the deep learning is completed, the convolutional neural network model is completed (S150).

그 다음, 테스트 데이터를 입력하여 완성된 합성곱 신경망 모델의 성능을 확인한다(S140). 테스트 데이터는 초기의 책 다루는 소리의 데이터 중에서 적정 갯수를 선택한다. Then, the performance of the completed convolutional neural network model is checked by inputting test data (S140). For the test data, an appropriate number is selected from the data of the sound of the initial book.

그 다음, 학습이 완료된 완성된 합성곱 신경망 모델에 실제 책소리 데이터를 입력한다(S160). 그러면, 합성곱 신경망 모델이 입력된 실제 책소리 데이터를 바탕으로 추론(inference) 연산하여 8가지 책 다루는 소리중 하나에 해당(예 : 책 하단을 한장씩 넘기는 동작(C0)에 관한 소리)된다는 것을 출력한다. 따라서, 이와 같은 합성곱 신경망 모델이 적용된 휴대기기를 도서관이나 서점 등에서 동작시키면 책을 사용하는 맥락(context) 파악을 할 수 있으며, 대표적으로 독자의 책에 관한 관심도 등을 평가할 수 있을 것이다(S170).Then, the actual book sound data is input to the completed convolutional neural network model (S160). Then, the convolutional neural network model calculates inference based on the input actual book sound data and outputs that it corresponds to one of eight book-handling sounds (e.g., a sound about the operation (C0) of turning the bottom of a book one by one) do. Therefore, if a mobile device to which such a convolutional neural network model is applied is operated in a library or bookstore, it is possible to grasp the context of using a book, and representatively, it will be possible to evaluate the reader's interest in the book (S170) .

실험 결과Experiment result

도 5는 본 발명의 일실시예에 따른 모델의 성능을 나타내는 혼동 행렬(Confusion Matrix) 그래프이다. 도 5에 도시된 바와 같이, 학습 결과, 정확도가 약 91%로 8가지 동작을 올바르게 예측하였음을 확인할 수 있었다. Precision과 Recall을 모델의 성능평가 지표로 사용하였으며, 그 결과는 [표 2]와 같다. 평균적인 Precision과 Recall을 모델의 성능평가 지표로 사용하였다. 평균적인 Precision 과 Recall이 각각 0.92 와 0.91로 나타났다.5 is a confusion matrix graph showing the performance of a model according to an embodiment of the present invention. As shown in FIG. 5 , as a result of the training, it was confirmed that 8 motions were correctly predicted with an accuracy of about 91%. Precision and Recall were used as performance evaluation indicators of the model, and the results are shown in [Table 2]. Average Precision and Recall were used as performance evaluation indicators of the model. Average Precision and Recall were 0.92 and 0.91, respectively.

동작action PrecisionPrecision RecallRecall F1-scoreF1-score TurnTurn 0.840.84 1.001.00 0.920.92 FlipFlip 0.960.96 0.840.84 0.900.90 RubRub 0.970.97 0.970.97 0.970.97 One PaperOne Paper 0.850.85 0.840.84 0.840.84 CloseClose 0.960.96 0.940.94 0.950.95 FlutterFlutter 0.980.98 0.880.88 0.930.93 FingerFinger 0.900.90 0.950.95 0.920.92 UnderlineUnderline 0.930.93 0.740.74 0.830.83 평균Average 0.920.92 0.910.91 0.910.91

MFCC 알고리즘과 비교Comparison with MFCC Algorithm

MFCC (Mel-Frequency Cepstral Coefficieiniti)는 인간이 소리를 듣는 방식을 반영한 알고리즘으로 종래의 음성인식에 널리 사용된다. 본 발명에 따른 CNN의 성능을 MFCC 알고리즘의 성능과 비교를 하였다. [표 1]에 나타난 동일한 데이터를 MFCC 알고리즘에 적용하였다. MFCC 기반 특징 추출 기법을 통해 연속적 1차원의 소리데이터를 특징값의 집합으로 형태로 변환하였다. 이때 특징값은 데이터로부터 얻을 수 있는 zero_crossing_rate, spectral_rolloff, spectral_centroid, spectral_contrast, spectral_bandwidth 등을 포함할 수 있으나, 통상적인 기술이므로 자세한 설명은 생략한다. 그리고 랜덤 포레스트 분류기(Random Forest Classifier)를 사용하여 검증 정확도 (Validation accuracy)를 구하였다. 도 6은 본 발명과 비교하기 위하여 종래의 MFCC 알고리즘을 이용한 모델의 성능을 나타내는 컨퓨젼 행렬 그래프이고, [표 3]은 MFCC 알고리즘의 실험결과이다. 도 6 및 [표 3]에서 확인할 수 있는 바와 같이, MFCC 알고리즘은 84%의 상대적으로 저조한 검증 정확도 성능을 보였고, 이는 본 발명의 CNN 기반 모델이 일반화가 가능한 장점이 있다는 것을 증명하는 것이다. MFCC (Mel-Frequency Cepstral Coefficieiniti) is an algorithm that reflects the way humans hear sound and is widely used in conventional speech recognition. The performance of CNN according to the present invention was compared with that of the MFCC algorithm. The same data shown in [Table 1] was applied to the MFCC algorithm. Through the MFCC-based feature extraction technique, continuous one-dimensional sound data was transformed into a set of feature values. At this time, the feature value may include zero_crossing_rate, spectral_rolloff, spectral_centroid, spectral_contrast, spectral_bandwidth, etc. that can be obtained from data, but since it is a conventional technique, detailed description is omitted. And the validation accuracy was obtained using a random forest classifier. 6 is a fusion matrix graph showing the performance of a model using the conventional MFCC algorithm for comparison with the present invention, and [Table 3] is the experimental result of the MFCC algorithm. As can be seen in Figure 6 and [Table 3], the MFCC algorithm showed a relatively low verification accuracy performance of 84%, which proves that the CNN-based model of the present invention has the advantage of being generalizable.

동작action PrecisionPrecision RecallRecall F1-scoreF1-score TurnTurn 0.890.89 0.960.96 0.920.92 FlipFlip 0.930.93 0.750.75 0.830.83 RubRub 0.880.88 0.910.91 0.890.89 One PaperOne Paper 0.510.51 0.700.70 0.590.59 CloseClose 0.990.99 0.970.97 0.980.98 FlutterFlutter 0.990.99 0.980.98 0.980.98 FingerFinger 0.950.95 0.960.96 0.950.95 UnderlineUnderline 0.650.65 0.410.41 0.500.50 평균Average 0.860.86 0.850.85 0.850.85

상술한 바와 같이 개시된 본 발명의 바람직한 실시예들에 대한 상세한 설명은 당업자가 본 발명을 구현하고 실시할 수 있도록 제공되었다. 상기에서는 본 발명의 바람직한 실시예들을 참조하여 설명하였지만, 해당 기술 분야의 숙련된 당업자는 본 발명의 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 이해할 수 있을 것이다. 예를 들어, 당업자는 상술한 실시예들에 기재된 각 구성을 서로 조합하는 방식으로 이용할 수 있다. 따라서, 본 발명은 여기에 나타난 실시형태들에 제한되려는 것이 아니라, 여기서 개시된 원리들 및 신규한 특징들과 일치하는 최광의 범위를 부여하려는 것이다.Detailed description of the preferred embodiments of the present invention disclosed as described above has been provided to enable those skilled in the art to implement and practice the present invention. Although the above has been described with reference to preferred embodiments of the present invention, it will be understood by those skilled in the art that various modifications and changes can be made to the present invention without departing from the scope of the present invention. For example, a person skilled in the art can use each configuration described in the above-described embodiments in a way in combination with each other. Accordingly, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

본 발명은 본 발명의 정신 및 필수적 특징을 벗어나지 않는 범위에서 다른 특정한 형태로 구체화될 수 있다. 따라서, 상기의 상세한 설명은 모든 면에서 제한적으로 해석되어서는 아니 되고 예시적인 것으로 고려되어야 한다. 본 발명의 범위는 첨부된 청구항의 합리적 해석에 의해 결정되어야 하고, 본 발명의 등가적 범위 내에서의 모든 변경은 본 발명의 범위에 포함된다. 본 발명은 여기에 나타난 실시형태들에 제한되려는 것이 아니라, 여기서 개시된 원리들 및 신규한 특징들과 일치하는 최광의 범위를 부여하려는 것이다. 또한, 특허청구범위에서 명시적인 인용 관계가 있지 않은 청구항들을 결합하여 실시예를 구성하거나 출원 후의 보정에 의해 새로운 청구항으로 포함할 수 있다.The present invention may be embodied in other specific forms without departing from the spirit and essential features of the present invention. Therefore, the detailed description above should not be construed as restrictive in all respects and should be considered as illustrative. The scope of the present invention should be determined by reasonable interpretation of the appended claims, and all changes within the equivalent scope of the present invention are included in the scope of the present invention. The invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein. In addition, claims that are not explicitly cited in the claims may be combined to form an embodiment or may be included as a new claim by amendment after filing.

Claims

Preparing to obtain and prepare data of the sound of the book (S100);
a data augmentation step (S110) of increasing the number of data by changing the data to increase the robustness of machine learning;
Converting the data of the sound of the book and the augmented data into a format for deep learning (S120);
generating a convolutional neural network model, and performing deep learning-based machine learning using the converted data (S130);
Completing the convolutional neural network model after the deep learning learning is completed (S150);
inputting book sound data into the completed convolutional neural network model (S160); and
The book sound classification method using a machine learning model of book handling sound, characterized in that it comprises; the convolutional neural network model calculating the book sound data to classify the book sound (S170).

The method of claim 1,
The sound of the book handling is a sound related to the operation (C0) of turning the bottom of the book one by one (C0), the sound of the operation of turning at once (C1), the sound of pressing the center line of the book (C2), the operation of turning pages The sound of (C3), the sound of closing the bookcase (C4), the sound of flipping a book over (C5), the sound of moving a finger on the bookshelf (C6), the sound of underlining the text with your finger A book sound classification method using a machine learning model of book handling sound, characterized in that it includes at least one of sounds related to the flicking operation (C7).

The method of claim 1,
The data augmentation step (S110) is,
A book handling sound machine comprising at least one of adding background music to the sound data of the book, adding white noise, changing the pitch of the sound source, and moving the time axis of the sound source A book sound classification method using a learning model.

The method of claim 1,
The conversion step (S120) is,
A book sound classification method using a machine learning model of book handling sound, characterized in that the data of the book handling sound is converted into an audio file set in the range of 0.5 seconds to 2 seconds.

The method of claim 1,
The deep learning step (S130) of the neural network model is,
performing a first convolution using the converted data (S131);
calculating using the first activation function (S132);
performing a second convolution using the calculated data (S133);
calculating using the second activation function (S134);
pooling the calculated data (S135); and
Dropout the neural network model (S136); Book sound classification method using a machine learning model of book handling sound, characterized in that it includes.

The method of claim 5,
The book sound classification method using a machine learning model of book handling sound, characterized in that the first convolution step (S131) to the dropout step (S136) are repeatedly executed a plurality of times.

The method of claim 5,
FC (Fully Connected Layer) step of flattening the neural network model (S137); and
A book sound classification method using a machine learning model of book handling sound, characterized in that it further comprises a soft max step (S138) of determining the classification of the book sound.

The method of claim 1,
Between the deep learning-based machine learning step (S130) and the convolutional neural network model completion step (S150),
The book sound classification method using a machine learning model of book handling sound, characterized in that it further comprises the step (S140) of confirming the performance of the completed convolutional neural network model by inputting test data.