KR20230018677A

KR20230018677A - System for recommending implementation of unborrowed book

Info

Publication number: KR20230018677A
Application number: KR1020210100501A
Authority: KR
Inventors: 김건욱; 진민하
Original assignee: 김건욱; 진민하
Priority date: 2021-07-30
Filing date: 2021-07-30
Publication date: 2023-02-07

Abstract

The present invention relates to an unborrowed book recommending system, and more specifically, to an unborrowed book recommending system, which considers diversity of books, rather than recommending books based on popular books, to provide customized services. The unborrowed book recommending system includes: a data storage unit for storing collection list information, loan history information, member information, and detailed book information of a book loan system in which a plurality of books and members are registered; a data set building unit which builds an integrated data set for the information; a text mining unit; a topic clustering unit which classifies the plurality of books into a plurality of clusters and assigns a unique topic to each cluster; a popular category extraction unit; a TF-IDF vectorization unit; a cosine similarity calculating unit; and an unborrowed book recommendation unit which recommends the unborrowed books based on an average value of a cosine similarity value.

Description

Unloaned book recommendation system {SYSTEM FOR RECOMMENDING IMPLEMENTATION OF UNBORROWED BOOK}

본 발명은 미대출 도서 추천 시스템에 관한 것으로, 인기도서 위주의 도서 추천이 아닌 도서의 다양성을 고려하고 개인 맞춤형으로 제공되는 미대출 도서 추천 시스템에 관한 것이다.The present invention relates to a system for recommending un-loaned books, and more particularly to a system for recommending un-loaned books that is personalized based on the diversity of books, rather than recommending books based on popular books.

국내 공공도서관의 역할과 기능은 다양해지고 있는 반면, 내부적으로는 편향된 도서 대출로 다양한 문제들이 나타나고 있다. 또한, 최근 공공도서관에서 인기도서 위주의 도서 추천시스템이 도입되고 있으나, 이로 인해, 이용 자가 접할 수 있는 도서의 다양성은 제한되고 있다.While the roles and functions of domestic public libraries are diversifying, internally, various problems are emerging due to biased book lending. In addition, recently, a book recommendation system focusing on popular books has been introduced in public libraries, but due to this, the diversity of books that users can access is limited.

또한 최근 4차 산업혁명으로 IT기술의 발전과 공공도서관 이용자들의 다양한 수요가 증가하여 기존의 공공도서관의 역할과 기능이 확대되어가고 있다. 이는 기존의 공공도서관의 역할인 도서 대출의 기능을 넘어서 문화센터, 전시회, 영화관람, 지역 커뮤니티 등의 이용행위가 이루어지면서 거주민들의 삶과 밀접한 관계를 가지게 되었다.In addition, the role and function of the existing public library is expanding due to the recent development of IT technology and the increase in various demands of public library users due to the 4th industrial revolution. This goes beyond the function of borrowing books, which is the role of the existing public library, and has a close relationship with the lives of residents as activities such as cultural centers, exhibitions, movie viewing, and local communities are performed.

이러한 공공도서관의 양적 성장과 역할의 다양성 증가로 중요성은 증가하고 있지만, 내부적으로 편향된 도서 대출 등의 문제가 존재하고 있으며, 이로 인해 특정 인기 도서의 대출 집중, 도서관 이용 활성화 저하, 장서 포화 등의 문제가 나타나고 있다. 이를 해결하기 위해 공공 도서관에서는 정기적으로 큐레이터를 활용한 도서 추천과 담당자 주관에 의한 이용 활성화 정책수립을 하고 있으나 많은 한계점이 존재하고 있다.Although the importance of public libraries is increasing due to the quantitative growth and diversity of roles of public libraries, there are problems such as internally biased book lending. is appearing In order to solve this problem, public libraries regularly recommend books using curators and establish policies to promote use by the person in charge, but there are many limitations.

이를 해결하기 위해, 개인 맞춤형 도서 추천시스템을 구현하기 위해 다양한 선행 연구들이 진행되어 왔으며, 대다수 선행 연구에서 협업 필터링 알고리즘, 콘텐츠 기반 필터링 알고리즘, 하이브리드 알고리즘 등을 활용하여 구현하였다. 이는 개인별 도서 대출 이력 자료를 활용하여 머신러닝 기반의 과학적인 추천시스템을 구현한 장점은 있으나, 대다수 대출 빈도가 높은 도서를 대상으로 추천을 하여 다양한 도서를 추천하기에는 한계를 가지고 있다.In order to solve this problem, various previous studies have been conducted to implement a personalized book recommendation system, and most of the previous studies have implemented using collaborative filtering algorithms, content-based filtering algorithms, and hybrid algorithms. This has the advantage of implementing a scientific recommendation system based on machine learning by using individual book loan history data, but has limitations in recommending various books by recommending books with a high frequency of loaning.

본 발명은 미대출 도서 추천 시스템에 관한 것으로, 인기도서 위주의 도서 추천이 아닌 도서의 다양성을 고려하고 개인 맞춤형으로 제공되는 미대출 도서 추천 시스템을 제공하기 위한 것이다.The present invention relates to a system for recommending unleavened books, and is intended to provide a system for recommending books that are not loaned out based on popular books but are customized for individuals considering the diversity of books.

본 발명이 이루고자 하는 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 명확하게 이해될 수 있을 것이다.The technical problems to be achieved by the present invention are not limited to the above-mentioned technical problems, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below. You will be able to.

본 발명의 미대출 도서 추천 시스템은,The non-lending book recommendation system of the present invention,

복수의 도서 및 회원이 등록된 도서 대여 시스템의 장서 목록 정보, 대출이력 정보, 회원 정보 및 도서 상세 정보가 저장되는 데이터 저장 유니트;A data storage unit for storing collection list information, loan history information, member information, and detailed book information of a book rental system in which a plurality of books and members are registered;

상기 장서 목록 정보, 상기 대출이력 정보, 상기 회원 정보 및 상기 도서 상세 정보에 대해서 통합 데이터 셋을 구축하는 데이터 셋 구축 유니트;a data set construction unit for constructing an integrated data set for the collection list information, the loan history information, the member information, and the detailed book information;

상기 통합 데이터 셋 중 텍스트 데이터에 대해서 텍스트 마이닝 처리를 하는 텍스트 마이닝 유니트;a text mining unit that performs text mining processing on text data from among the integrated data set;

상기 텍스트 마이닝 유니트에서 텍스트 마이닝 처리된 데이터를 입력 값으로 구형 K-평균(spherical K-means) 클러스터링과 LDA(latent dirichlet allocation) 기반의 토픽 모델링으로 상기 복수의 도서를 복수의 군집으로 분류하고, 각각의 군집에 고유 토픽을 부여하는 토픽 군집화 유니트;The plurality of books are classified into a plurality of clusters by topic modeling based on spherical K-means clustering and latent dirichlet allocation (LDA) using the data processed by text mining in the text mining unit as an input value, respectively. a topic clustering unit that assigns a unique topic to clusters of;

상기 통합 데이터 셋을 근거로 상기 복수의 도서를 대출 도서 목록과 미대출 도서 목록으로 분류하고, 상기 대출 도서 목록에 대해서 상위의 고유 토픽을 인기 카테고리로서 추출하는 인기 카테고리 추출 유니트;a popular category extracting unit which classifies the plurality of books into a list of loaned books and a list of un loaned books based on the integrated data set, and extracts a top unique topic from the list of loaned books as a popular category;

상기 복수의 도서에 대해서 TF-IDF(term frequency-inverse document frequency) 기반의 피처 벡터화를 위해 상기 복수의 도서 각각에 대해서 TF-IDF 값들을 산출하는 TF-IDF 벡터화 유니트;a TF-IDF vectorization unit calculating TF-IDF values for each of the plurality of books for feature vectorization based on term frequency-inverse document frequency (TF-IDF) for the plurality of books;

상기 미대출 도서 목록의 도서의 TF-IDF 값과 상기 대출 도서 목록의 도서의 TF-IDF 값으로 코사인 유사도(cosine similarity) 값을 산출하는 코사인 유사도 산출 유니트; 및a cosine similarity calculation unit for calculating a cosine similarity value using the TF-IDF values of the books in the uncirculated book list and the TF-IDF values of the books in the loaned book list; and

상기 미대출 도서 목록의 도서 각각에 대해서 대출 도서 목록의 복수의 도서에 대한 코사인 유사도 값의 평균 값을 근거로 미대출 도서를 추천하는 미대출 도서 추천 유니트를 포함하는 것일 수 있다.and an unleased book recommendation unit for recommending unleased books based on an average value of cosine similarity values for a plurality of books in the unleased book list for each book in the unleased book list.

본 발명의 미대출 도서 추천 시스템에서, 상기 장서 목록 정보는 도서명, 저자, 출판사, 발행 연도, ISBN(international standard book number), 세트 ISBN, 부가기호, 주제분류번호, 도서권수, 대출 건수 및 도서 등록일자 중 적어도 하나 이상의 정보를 포함하고, 상기 대출이력 정보는 대출 건 별에 대한 사용자 번호, 도서 번호, KDC(Korean decimal classification), 도서명, 저자, 출판사, 대출일, 반납일 및 미디어 형태 중 적어도 하나 이상의 정보를 포함하며, 상기 회원 정보는 사용자 번호, 회원 등록일, 생년월일, 성별, 우편번호, 도서 대여 횟수 및 연체 횟수 중 적어도 하나 이상의 정보를 포함하고, 상기 도서 상세 정보는 도서명, KDC, 출판일, 저자, 출판사, 도서 이미지 URL 및 도서 소개 중 적어도 하나 이상의 정보를 포함하는 것일 수 있다.In the uncirculated book recommendation system of the present invention, the collection list information includes book name, author, publisher, publication year, ISBN (international standard book number), set ISBN, additional code, subject classification number, number of books, number of loans, and book registration. It includes at least one information of a date, and the loan history information includes at least one or more of a user number, a book number, a Korean decimal classification (KDC), a book name, an author, a publisher, a loan date, a return date, and a media type for each loan case. The member information includes at least one of user number, member registration date, date of birth, gender, postal code, book rental number and overdue number, and the book detailed information includes book name, KDC, publication date, author , a publisher, a book image URL, and a book introduction may include at least one or more information.

본 발명의 미대출 도서 추천 시스템에서, 상기 데이터 셋 구축 유니트는, 상기 도서 상세 정보에 대해서 도서명이 동일한 도서들의 도서 소개를 통합하고, 도서 소개가 공백인 도서에 대한 도서 상세 정보를 삭제하는 도서 소개 정리부와, 상기 대출 이력 자료 및 상기 회원 정보를 사용자 번호(User key)를 기준으로 결합하고, 이산형 데이터인 상기 생년월일로부터 범주형 데이터인 연령대 변수를 생성하는 회원 정보 정리부와, 상기 장서 목록 정보, 상기 대출이력 정보 및 상기 회원 정보를 병합하여 대출 정보 데이터 셋을 생성하는 대출 정보 데이터 셋 생성부와, 상기 대출 정보 데이터 셋과 상기 도서 상세 정보를 통합하여 상기 통합 데이터 셋을 구축하는 통합 데이터 셋 구축부를 포함하는 것일 수 있다.In the unlendered book recommendation system of the present invention, the data set building unit integrates book introductions of books with the same book name for the detailed book information, and deletes detailed book information for books with blank book introductions. A member information organizing unit combining the loan history data and the member information based on a user number (User key), and generating an age group variable, which is categorical data, from the date of birth, which is discrete data; A loan information data set generating unit that generates a loan information data set by merging the loan history information and the member information, and an integrated data set construction unit that integrates the loan information data set and the detailed book information to build the integrated data set. It may include wealth.

본 발명의 미대출 도서 추천 시스템에서, 상기 텍스트 마이닝 유니트는 상기 도서 상세 정보 중 텍스트 데이터에 대해서 숫자 또는 특수문자를 제거하는 텍스트 클렌징부와, 상기 도서 상세 정보 중 텍스트 데이터에 대해서 명사를 추출하는 명사 추출부와, 상기 도서 상세 정보 중 텍스트 데이터에 대해서 불용어를 제거하는 불용어 제거부와, 상기 텍스트 클렌징부, 상기 명사 추출부 및 상기 불용어 제거부를 통해 처리된 텍스트 데이터에 대해서 한글을 형태소 단위로 분리하여 토큰화(Tokenization)하는 토큰화부를 포함하는 것일 수 있다.In the non-loaned book recommendation system of the present invention, the text mining unit includes a text cleansing unit that removes numbers or special characters from text data from the detailed book information, and a noun that extracts nouns from the text data from the detailed book information. An extraction unit, a stopword removal unit that removes stopwords from text data of the detailed book information, and the text data processed through the text cleansing unit, the noun extraction unit, and the stopword removal unit are separated into morpheme units, It may include a tokenization unit that performs tokenization.

본 발명의 미대출 도서 추천 시스템에서, 상기 도서 상세 정보 중 텍스트 데이터는 도서명 또는 도서 소개인 것일 수 있다.In the system for recommending books not yet loaned according to the present invention, text data among the detailed book information may be a book name or a book introduction.

본 발명의 미대출 도서 추천 시스템에서, 상기 토픽 군집화 유니트는, 상기 복수의 도서가 포함하고 있는 도서명 및 도서 소개에서 단어 토큰화로 추출된 명사를 활용하여 구형 K-평균 클러스터링을 수행하는 것일 수 있다.In the unleavened book recommendation system of the present invention, the topic clustering unit may perform spherical K-means clustering by utilizing the names of books included in the plurality of books and nouns extracted from book introductions through word tokenization.

본 발명의 미대출 도서 추천 시스템에서, 상기 인기 카테고리 추출 유니트는, 상기 회원 정보를 근거로 통합 데이터 셋을 복수의 회원 타입으로 분류하고, 상기 복수의 회원 타입 각각에 대해서 인기 카테고리를 독립적으로 추출하는 것일 수 있다.In the non-loaned book recommendation system of the present invention, the popular category extraction unit classifies an integrated data set into a plurality of member types based on the member information, and independently extracts a popular category for each of the plurality of member types. it could be

본 발명의 미대출 도서 추천 시스템에서, 상기 복수의 회원 타입은 성별 또는 연령대별인 것일 수 있다.In the system for recommending books not yet loaned according to the present invention, the plurality of member types may be by gender or age group.

본 발명의 미대출 도서 추천 시스템에서, 상기 인기 카테고리 추출 유니트는 복수의 인기 카테고리를 추출하고, 상기 TF-IDF 값은 인기 카테고리를 계산단위로 산출되며, 상기 TF-IDF 벡터화 유니트는, 상기 TF-IDF 값을 하기 수학식 1에 의해서 산출하는 것일 수 있다.In the uncirculated book recommendation system of the present invention, the popular category extraction unit extracts a plurality of popular categories, the TF-IDF value is calculated based on the popular category as a calculation unit, and the TF-IDF vectorization unit extracts a plurality of popular categories, The IDF value may be calculated by Equation 1 below.

[수학식 1][Equation 1]

W_i,j는 도서 j의 단어 i에 대한 TF-IDF 값이고, tf_i,j는 도서 j에서 지정된 분석영역에서 단어 i의 빈도 수이며, df_i는 단어 i를 포함한 도서의 수이고, N은 상기 계산단위에 속한 도서의 수이다.W _i,j is the TF-IDF value for word i in book j, tf _i,j is the frequency count of word i in the analysis domain specified in book j, df _i is the number of books containing word i, and N is the number of books belonging to the calculation unit.

본 발명의 미대출 도서 추천 시스템에서, 상기 코사인 유사도 산출 유니트는, 상기 코사인 유사도를 하기 수학식 2에 의해서 산출하는 것일 수 있다.In the unlisted book recommendation system of the present invention, the cosine similarity calculation unit may calculate the cosine similarity by Equation 2 below.

[수학식 2][Equation 2]

cos(θ)는 상기 미대출 도서 목록의 도서와 상기 대출 도서 목록의 도서 간의 코사인 유사도 값이고, A_i는 상기 미대출 도서 목록의 도서의 단어 i에 대한 TF-IDF 값이며, B_i는 상기 대출 도서 목록의 도서의 단어 i에 대한 TF-IDF 값이며, n은 단어 i의 종류의 수이다.cos(θ) is the cosine similarity value between the book of the uncirculated book list and the book of the loaned book list, A _i is the TF-IDF value for word i of the book of the un loaned book list, and B _i is the above It is the TF-IDF value for word i of the book in the borrowed book list, and n is the number of types of word i.

본 발명의 미대출 도서 추천 시스템은, 공공도서관 도서 대출이력을 기반으로 구형 K-평균(Spherical K-means) 클러스터링과 LDA 기반의 토픽모델링을 이용하여 성별 및 연령대별 인기 카테고리를 추출한 후, 콘텐츠 기반 필터링에서의 일대다 대응을 통해 해당 카테고리 내 인기도서와 유사한 특성을 지닌 미대출 도서를 추천해주는 시스템을 구현함으로써, 이용자들이 다양한 지식과 이념을 학습하는데 기여하며, 편향된 도서 대출의 완화와 이용 활성화에도 기여할 수 있다.The non-borrowed book recommendation system of the present invention extracts popular categories by gender and age group using spherical K-means clustering and LDA-based topic modeling based on the book loan history of public libraries, and then extracts popular categories based on content By implementing a system that recommends non-lending books with similar characteristics to popular books in the category through one-to-many correspondence in filtering, it contributes to users learning various knowledge and ideologies, and also contributes to mitigating biased book lending and activating the use of books. can contribute

본 발명의 미대출 도서 추천 시스템은 공공도서관의 대출 이력과 회원정보를 기반으로 사용자 집단의 선호를 반영하여 미대출 도서를 추천하는 콘텐츠 기반 필터링 도서 추천시스템을 구현하였다. 협업 필터링에서 발생하는 문제로서, 기존의 데이터가 축적되지 않은 신규 사용자에게 어떠한 아이템도 추천해줄 수 없는 콜드 스타트의 문제를 본 발명의 미대출 도서 추천 시스템은 군집 분석과 토픽 모델링 내 LDA 기법을 통해 사용자 집단별 인기 카테고리 내 인기도서를 추출한 후 이와 유사한 미대출 도서를 코사인 유사도 계산을 통해 추천하는 방식으로 해결할 수 있다.The system for recommending un-borrowed books of the present invention implements a content-based filtering book recommendation system that recommends un-borrowed books by reflecting the preferences of the user group based on the lending history and member information of the public library. As a problem that occurs in collaborative filtering, the cold start problem of not being able to recommend any item to a new user for whom no existing data has been accumulated is solved by the present invention's uncirculated book recommendation system through cluster analysis and LDA technique in topic modeling. It can be solved by extracting popular books in popular categories by group and then recommending similar books that are not loaned through cosine similarity calculation.

도 1은 본 발명의 미대출 도서 시스템을 나타내는 블록도이다.
도 2는 데이터 셋 구축 유니트를 나타내는 블록도이다.
도 3은 본 발명의 미대출 도서 시스템으로부터 출력되는 UI 화면일 수 있다.1 is a block diagram showing an unexploited book system of the present invention.
2 is a block diagram showing a data set building unit.
3 may be a UI screen output from the unbooked book system of the present invention.

이하, 첨부된 도면들을 참조하여 본 발명에 따른 실시 예를 상세히 설명한다. 이 과정에서 도면에 도시된 구성요소의 크기나 형상 등은 설명의 명료성과 편의상 과장되게 도시될 수 있다. 또한, 본 발명의 구성 및 작용을 고려하여 특별히 정의된 용어들은 사용자, 운용자의 의도 또는 관례에 따라 달라질 수 있다. 이러한 용어들에 대한 정의는 본 명세서 전반에 걸친 내용을 토대로 내려져야 한다.Hereinafter, embodiments according to the present invention will be described in detail with reference to the accompanying drawings. In this process, the size or shape of the components shown in the drawings may be exaggerated for clarity and convenience of explanation. In addition, terms specifically defined in consideration of the configuration and operation of the present invention may vary according to the intentions or customs of users and operators. Definitions of these terms should be made based on the content throughout this specification.

이하, 설명에서 "유니트" 또는 "부"는 하드웨어와 소프트웨어 결합 또는 하드웨어 단독으로 구현되는 것일 수 있다.Hereinafter, “unit” or “unit” in the description may be implemented by hardware and software combination or hardware alone.

이하, 도 1 내지 도 3을 참조하여, 본 발명의 미대출 도서 추천 시스템(100)에 대해서 상세히 설명한다.Hereinafter, referring to FIGS. 1 to 3, the unbooked book recommendation system 100 of the present invention will be described in detail.

도 1에 도시된 바와 같이, 본 발명의 미대출 도서 추천 시스템(100)은,As shown in FIG. 1, the unextracted book recommendation system 100 of the present invention,

복수의 도서 및 회원이 등록된 도서 대여 시스템(11)의 장서 목록 정보, 대출이력 정보, 회원 정보 및 도서 상세 정보가 저장되는 데이터 저장 유니트(110);a data storage unit 110 storing collection list information, loan history information, member information, and detailed book information of the book rental system 11 in which a plurality of books and members are registered;

상기 장서 목록 정보, 상기 대출이력 정보, 상기 회원 정보 및 상기 도서 상세 정보에 대해서 통합 데이터 셋을 구축하는 데이터 셋 구축 유니트(120);a data set construction unit 120 for constructing an integrated data set for the collection list information, the loan history information, the member information, and the detailed book information;

상기 통합 데이터 셋 중 텍스트 데이터에 대해서 텍스트 마이닝 처리를 하는 텍스트 마이닝 유니트(130);a text mining unit 130 that performs text mining processing on text data among the integrated data set;

상기 텍스트 마이닝 유니트(130)에서 텍스트 마이닝 처리된 데이터를 입력 값으로 구형 K-평균(spherical K-means) 클러스터링과 LDA(latent dirichlet allocation) 기반의 토픽 모델링으로 상기 복수의 도서를 복수의 군집으로 분류하고, 각각의 군집에 고유 토픽을 부여하는 토픽 군집화 유니트(140);The plurality of books are classified into a plurality of clusters by topic modeling based on spherical K-means clustering and latent dirichlet allocation (LDA) using the text mining processed data in the text mining unit 130 as an input value. a topic clustering unit 140 that assigns a unique topic to each cluster;

상기 통합 데이터 셋을 근거로 상기 복수의 도서를 대출 도서 목록과 미대출 도서 목록으로 분류하고, 상기 대출 도서 목록에 대해서 상위의 고유 토픽을 인기 카테고리로서 추출하는 인기 카테고리 추출 유니트(150);a popular category extracting unit (150) for classifying the plurality of books into a list of loaned books and a list of non-loaned books based on the integrated data set, and extracting, as a popular category, a top unique topic with respect to the list of loaned books;

상기 복수의 도서에 대해서 TF-IDF(term frequency-inverse document frequency) 기반의 피처 벡터화를 위해 상기 복수의 도서 각각에 대해서 TF-IDF 값들을 산출하는 TF-IDF 벡터화 유니트(160);a TF-IDF vectorization unit 160 for calculating TF-IDF values for each of the plurality of books for feature vectorization based on term frequency-inverse document frequency (TF-IDF) for the plurality of books;

상기 미대출 도서 목록의 도서의 TF-IDF 값과 상기 대출 도서 목록의 도서의 TF-IDF 값으로 코사인 유사도(cosine similarity) 값을 산출하는 코사인 유사도 산출 유니트(170); 및a cosine similarity calculation unit (170) for calculating a cosine similarity value using the TF-IDF values of the books in the uncirculated book list and the TF-IDF values of the books in the loaned book list; and

상기 미대출 도서 목록의 도서 각각에 대해서 대출 도서 목록의 복수의 도서에 대한 코사인 유사도 값의 평균 값을 근거로 미대출 도서를 추천하는 미대출 도서 추천 유니트(180)를 포함하는 것일 수 있다.It may include an unleased book recommendation unit 180 for recommending unleased books based on an average value of cosine similarity values for a plurality of books in the unleased book list for each book in the unleased book list.

복수의 도서 및 회원이 등록된 도서 대여 시스템(11)은 공공도서관일 수 있다. 또는 도서 대여 시스템(11)은 네트워크로 운영되는 전자책 서비스 시스템일 수 있다.The book rental system 11 in which a plurality of books and members are registered may be a public library. Alternatively, the book rental system 11 may be an e-book service system operated as a network.

데이터 저장 유니트(110)는 데이터베이스로서 데이터 저장 매체일 수 있다.The data storage unit 110 may be a data storage medium as a database.

장서 목록 정보는 공공도서관 등과 같은 도서 대여 시스템(11)에 보유 중인 도서들에 대한 정보일 수 있다. 상기 장서 목록 정보는 도서명, 저자, 출판사, 발행 연도, ISBN(international standard book number), 세트 ISBN, 부가기호, 주제분류번호, 도서권수, 대출 건수 및 도서 등록일자 중 적어도 하나 이상의 정보를 포함할 수 있다. ISBN은 도서를 서점에 유통시키기 위해 발급받는 도서번호(국제표준도서번호)일 수 있다. 세트 ISBN은 세트로 된 도서에 부여되는 ISBN으로, 같은 세트인 도서들이 있다면 모두 같은 세트 ISBN이 부여되는 것일 수 있다. 부가기호는 도서의 독자대상, 발행형태, 내용 등의 분류를 위해 지정한 기호일 수 있다.The collection list information may be information about books held in the book rental system 11 such as a public library. The collection list information may include at least one of book name, author, publisher, publication year, ISBN (international standard book number), set ISBN, additional code, subject classification number, number of books, number of loans, and book registration date. there is. The ISBN may be a book number (international standard book number) issued to distribute books to bookstores. The set ISBN is an ISBN assigned to books in a set, and if there are books in the same set, all of them may be assigned the same set ISBN. The additional sign may be a sign designated for classifying the reader target, publication type, and content of the book.

대출이력 정보는 공공도서관 등과 같은 도서 대여 시스템(11)에서 발생한 대출에 대한 정보일 수 있다. 상기 대출이력 정보는 대출 건 별에 대한 사용자 번호, 도서 번호, KDC(Korean decimal classification), 도서명, 저자, 출판사, 대출일, 반납일 및 미디어 형태 중 적어도 하나 이상의 정보를 포함할 수 있다. KDC는 한국 십진 분류법에 따른 도서 분류키일 수 있다.The loan history information may be information on loans generated in the book rental system 11 such as a public library. The loan history information may include at least one of a user number, a book number, a Korean decimal classification (KDC), a book name, an author, a publisher, a loan date, a return date, and a media type for each loan case. KDC may be a book classification key according to the Korean decimal classification method.

회원 정보는 공공도서관 등과 같은 도서 대여 시스템(11)을 이용하는 회원들의 연령, 주소 등의 개인 속성 정보일 수 있다. 더 구체적으로, 상기 회원 정보는 사용자 번호, 회원 등록일, 생년월일, 성별, 우편번호, 도서 대여 횟수 및 연체 횟수 중 적어도 하나 이상의 정보를 포함할 수 있다.Member information may be personal attribute information such as age and address of members who use the book rental system 11 such as a public library. More specifically, the member information may include at least one of user number, member registration date, date of birth, gender, postal code, book rental number, and overdue number.

도서 상세 정보는 공공도서관 등과 같은 도서 대영 시스템(11)에서 보관 또는 등록 중인 도서들의 도서명, 도서 소개 내용, 발행연도 등의 속성에 대한 정보일 수 있다. 구체적으로, 상기 도서 상세 정보는 도서명, KDC, 출판일, 저자, 출판사, 도서 이미지 URL 및 도서 소개 중 적어도 하나 이상의 정보를 포함할 수 있다. 상기 도서 상세 정보는 도서관 정보나루에서 제공하는 Open API으로 수집되는 데이터일 수 있다.The detailed book information may be information on properties such as book names, book introduction contents, publication year, etc. of books stored or registered in the book management system 11 such as a public library. Specifically, the detailed book information may include at least one of book name, KDC, publication date, author, publisher, book image URL, and book introduction. The detailed book information may be data collected through Open API provided by Library Information Naru.

데이터 저장 유니트(110)에는 XML 또는 JSON 형태의 반정형 데이터로 수집된 상기 도서 상세 정보가 파이썬(Python)을 통해 정형 데이터로 변환하여 저장되는 것일 수 있다.In the data storage unit 110, the book detailed information collected as semi-structured data in the form of XML or JSON may be converted into structured data through Python and stored.

하기 표 1은 장서 목록 정보, 대출이력 정보, 회원 정보, 도서 상세 정보의 세부 정보를 나타내는 표이다.Table 1 below is a table showing detailed information of collection list information, loan history information, member information, and detailed book information.

정보 데이터information data 세부 정보details 장서 목록 정보Collection catalog information Title, Author, Publisher, Year of publication, ISBN, Set ISBN, Additional code, Subject classification number, Number of books, Number of loans, Registration date and etcTitle, Author, Publisher, Year of publication, ISBN, Set ISBN, Additional code, Subject classification number, Number of books, Number of loans, Registration date and etc 대출이력 정보Loan history information User key, Book key, KDC, Title, Author, Publisher, Loan date, Return date, Media type and etcUser key, Book key, KDC, Title, Author, Publisher, Loan date, Return date, Media type and etc 회원 정보Profile User key, Registration date, Date of birth, Gender, Zipcode, Number of loans, Number of overdue and etcUser key, Registration date, Date of birth, Gender, Zipcode, Number of loans, Number of overdue and etc 도서 상세 정보Book details Title, KDC, Date of publication, Author, Publisher, Book Image URL, Book Outline and etcTitle, KDC, Date of publication, Author, Publisher, Book Image URL, Book Outline and etc

도 2에 도시된 바와 같이, 상기 데이터 셋 구축 유니트(120)는,As shown in FIG. 2, the data set building unit 120,

상기 도서 상세 정보에 대해서 도서명이 동일한 도서들의 도서 소개를 통합하고, 도서 소개가 공백인 도서에 대한 도서 상세 정보를 삭제하는 도서 소개 정리부(122)와,A book introduction organizing unit 122 for integrating book introductions of books with the same book name for the detailed book information and deleting detailed book information for books with blank book introductions;

상기 대출 이력 자료 및 상기 회원 정보를 사용자 번호(User key)를 기준으로 결합하고, 이산형 데이터인 상기 생년월일로부터 범주형 데이터인 연령대 변수를 생성하는 회원 정보 정리부(121)와,A member information organizing unit 121 that combines the loan history data and the member information based on a user number (User key) and generates an age group variable, which is categorical data, from the date of birth, which is discrete data;

상기 장서 목록 정보, 상기 대출이력 정보 및 상기 회원 정보를 병합하여 대출 정보 데이터 셋을 생성하는 대출 정보 데이터 셋 생성부(123)와,a loan information data set generating unit 123 generating a loan information data set by merging the collection list information, the loan history information, and the member information;

상기 대출 정보 데이터 셋과 상기 도서 상세 정보를 통합하여 상기 통합 데이터 셋을 구축하는 통합 데이터 셋 구축부를 포함하는 것일 수 있다.It may include an integrated data set building unit that builds the integrated data set by integrating the loan information data set and the detailed book information.

도서 소개 정리부(122)는 도서 상세 정보에 도서 소개가 공백이거나 도서명이 동일한 도서가 존재하는 경우, R을 통해 공백인 행은 제거하고 도서명이 동일한 도서의 도서 소개를 통합할 수 있다. R은 통계적인 계산과 데이터 분석에 특화되어 있는 프로그래밍 언어이다.If a book introduction is blank in the detailed book information or a book with the same book name exists, the book introduction organizing unit 122 may remove the blank row through R and integrate the book introduction of the book with the same book name. R is a programming language specialized in statistical calculations and data analysis.

회원 정보 정리부(121)는 성별 또는 연령대별 인기 카테고리 기반의 도서 추천을 위해 사용자 번호(User key)를 기준으로 대출이력 정보 및 회원 정보 결합할 수 있다. 회원 정보 정리부(121)는 도서 추천에 필요한 사용자 번호, 등록일, 생년월일, 성별을 제외한 데이터는 제거할 수 있다. 사용자 번호(User key)는 사용자 마다 부여되는 고유 식별 코드 또는 ID일 수 있다.The member information organizing unit 121 may combine loan history information and member information based on a user number (User key) to recommend books based on popular categories by gender or age group. The member information organizing unit 121 may remove data other than the user number, registration date, date of birth, and gender required for book recommendation. User key may be a unique identification code or ID assigned to each user.

상기 회원 정보 정리부(121)는 이상치와 결측치를 제거하기 위해 설정 생년월일을 기준으로 설정 생년월일 이후의 생년월일 값을 가지는 회원 정보를 제거하는 것일 수 있다. 예를 들어 설정 생년월일을 2019년 12월 31로하여, 출생 연도가 2020년 이후인 회원에 대한 회원 정보와 누락된 회원 정보를 일괄적으로 제거할 수 있다.The member information organizing unit 121 may remove member information having a date of birth after the set date of birth based on the set date of birth in order to remove outliers and missing values. For example, by setting the date of birth to December 31, 2019, member information about members whose birth year is after 2020 and missing member information can be collectively deleted.

상기 회원 정보 정리부(121)는 이산형의 생년월일 데이터를 범주형으로 변환하여 연령대 변수를 생성할 수 있다. 예를 들어, 연령대 변수로서, 미성년(19세 이하), 20대, 30대 등으로 범주형 변수를 생성할 수 있다.The member information organizing unit 121 may convert the discrete type of date of birth data into a categorical type to generate an age group variable. For example, as an age variable, a categorical variable may be created with minors (19 years old or younger), 20's, 30's, and the like.

상기 대출 정보 데이터 셋 생성부(123)는, 상기 장서 목록 정보, 상기 대출이력 정보 및 상기 회원 정보를 도서명을 기준으로 병합하고, 상기 대출 정보 데이터 셋에 ISBN 정보를 추가하는 것일 수 있다.The loan information data set generator 123 may merge the collection list information, the loan history information, and the member information based on book names, and add ISBN information to the loan information data set.

상기 통합 데이터 셋 구축부는, 상기 대출 정보 데이터 셋과 상기 도서 상세 정보 통합 시 ISBN을 기준으로 할 수 있다. 대출 정보 데이터 셋의 ISBN이 10자리인 경우 13자리로 변환한 후 도서 상세 정보와 ISBN을 기준으로 병합될 수 있다.When integrating the loan information data set and the detailed book information, the integrated data set building unit may use an ISBN as a standard. If the ISBN of the loan information data set is 10 digits, it can be converted to 13 digits and then merged based on the detailed book information and ISBN.

이때, 상기 대출 정보 데이터 셋의 ISBN이 이상치 또는 결측치일 경우 시퀀스매처(SequenceMatcher) 라이브러리로 상기 대출 정보 데이터셋의 도서명과 도서 상세 정보의 도서명 간의 도서명 유사도를 산출하며, 상기 도서명 유사도가 90% 초과일 경우에 상기 대출 정보 데이터 셋과 상기 도서 상세 정보는 통합하는 것일 수 있다. 시퀀스매처는 두 개의 문자열에 대한 상호 유사성을 수치화하는 기능을 제공하는 파이썬 표준 라이브러리이다. 도서명 유사도가 90% 이한인 경우 데이터를 삭제될 수 있다.At this time, if the ISBN of the loan information data set is an outlier or missing value, the SequenceMatcher library calculates the book name similarity between the book name of the loan information data set and the book name of the detailed book information, and the book name similarity exceeds 90% In this case, the loan information data set and the detailed book information may be integrated. SequenceMatcher is a Python standard library that provides a function to quantify the mutual similarity of two strings. Data can be deleted if the book name similarity is less than 90%.

상기 통합 데이터 셋 구축부는, 상기 대출 정보 데이터 셋에 상기 도서 상세 정보의 도서 소개를 통합하여 상기 통합 데이터 셋을 구축하는 것일 수 있다.The integrated data set building unit may build the integrated data set by integrating the book introduction of the detailed book information with the loan information data set.

상기 텍스트 마이닝 유니트(130)는,The text mining unit 130,

상기 도서 상세 정보 중 텍스트 데이터에 대해서 숫자 또는 특수문자를 제거하는 텍스트 클렌징부와,A text cleansing unit for removing numbers or special characters from text data of the detailed book information;

상기 도서 상세 정보 중 텍스트 데이터에 대해서 명사를 추출하는 명사 추출부와,A noun extraction unit extracting nouns from text data among the detailed book information;

상기 도서 상세 정보 중 텍스트 데이터에 대해서 불용어를 제거하는 불용어 제거부와,a stopword removal unit for removing stopwords from text data of the detailed book information;

상기 텍스트 클렌징부, 상기 명사 추출부 및 상기 불용어 제거부를 통해 처리된 텍스트 데이터에 대해서 한글을 형태소 단위로 분리하여 토큰화(Tokenization)하는 토큰화부를 포함하는 것일 수 있다.It may include a tokenization unit that separates Korean into morpheme units and tokenizes the text data processed through the text cleansing unit, the noun extraction unit, and the stopword removal unit.

토큰화부는 KoNLPy 라이브러리 및 보완 불용어 사전를 활용하여 한글을 형태소 단위로 분리하는 토큰화 (Tokenization) 작업을 진행할 수 있으며, 품사가 명사인 단어만 추출하는 것일 수 있다.The tokenization unit may perform a tokenization operation of separating Hangul into morpheme units by utilizing the KoNLPy library and complementary stopword dictionary, and may extract only words whose parts of speech are nouns.

보완 불용어 사전은 토큰화부에서 생성되는 것으로, 유의미한 단어만을 추출하기 위해 KoNLPy 라이브러리에서 제공하는 불용어 사전만을 적용하는 것은 한계가 있기 때문에 마련되는 것일 수 있다. 구체적으로, 보완 불용어 사전은 다음과 같이 생성될 수 있다.The complementary stopword dictionary is created in the tokenization unit, and it may be prepared because there is a limitation in applying only the stopword dictionary provided by the KoNLPy library to extract only meaningful words. Specifically, the complementary stopword dictionary can be created as follows.

LDA 기반의 토픽모델링을 활용하여 개별적인 토픽 내 키워드를 파악한 후, 키워드 중 도서 카테고리 분류에 불필요한 단어를 불용어로 선정하여 텍스트 파일에 입력한 후 저장한다. 이후 전체 단어에서 텍스트 파일에 입력된 불용어를 제거하고 다시 토픽모델링을 활용하여 토픽 내 키워드를 파악하고 새로운 불용어가 추가로 발견되면 불용어를 텍스트 파일에 추가한다. 불용어가 더 이상 파악되지 않을 때까지 이와 같은 과정을 반복한다.After identifying keywords in individual topics by using LDA-based topic modeling, among keywords, words unnecessary for book category classification are selected as stopwords, entered into a text file, and then saved. Then, the stopwords entered in the text file are removed from all words, and the keywords in the topic are identified by using topic modeling again. If new stopwords are additionally found, the stopwords are added to the text file. Repeat this process until no more stopwords are identified.

하기 표 2는 데이터 셋 구축 유니트(120) 및 텍스트 마이닝 유니트(130)의 기능을 나타내는 표이다.Table 2 below is a table showing functions of the data set building unit 120 and the text mining unit 130.

구성composition 기능function 데이터 셋 구축 유니트Data set building unit removal of missing values and outliers, conversion of data type, addition of required data, calculation of similarities between strings, sequential combination of 4 materialsremoval of missing values and outliers, conversion of data type, addition of required data, calculation of similarities between strings, sequential combination of 4 materials 텍스트 마이닝 유니트text mining unit text cleansing, tokenization, noun extraction, stopwords removal(286 cases)text cleansing, tokenization, noun extraction, stopwords removal(286 cases)

상기 도서 상세 정보 중 텍스트 데이터는 도서명 또는 도서 소개인 것일 수 있다.Text data among the detailed book information may be a book name or a book introduction.

상기 토픽 군집화 유니트(140)는, 상기 복수의 도서가 포함하고 있는 도서명 및 도서 소개에서 단어 토큰화로 추출된 명사를 활용하여 구형 K-평균 클러스터링을 수행하는 것일 수 있다.The topic clustering unit 140 may perform spherical K-means clustering using nouns extracted through word tokenization from book names and book introductions included in the plurality of books.

상기 토픽 군집화 유니트(140)는 실질적인 도서의 분류를 위해 구형 K-평균(Spherical K-means) 클러스터링과 LDA 기반의 토픽 모델링을 활용하여 수집한 모든 도서를 군집으로 분류하고, 각각의 군집이 의미하는 토픽을 정의할 수 있다.The topic clustering unit 140 classifies all collected books into clusters using spherical K-means clustering and LDA-based topic modeling for actual book classification, and each cluster means Topics can be defined.

구형 K-평균(Spherical K-means) 클러스터링은 문서 집합과 같은 고차원 데이터의 군집 분석에 적합한 머신러닝 기법이다. LDA(Latent Dirichlet Allocation, 잠재 디리클레 할당)은 주어진 문서에 대해 각 문서에 어떤 토픽들이 존재하는지를 서술하는 확률적 토픽 모델 기법이다.Spherical K-means clustering is a machine learning technique suitable for cluster analysis of high-dimensional data such as document sets. LDA (Latent Dirichlet Allocation) is a probabilistic topic model technique that describes which topics exist in each document for a given document.

예를 들어, 도서가 포함하고 있는 도서명, 도서 소개에서 단어 토큰화로 추출된 명사를 활용하여 구형 K-평균 클러스터링을 수행함으로써 모든 도서를 20개의 군집으로 분류할 수 있다. 토픽의 개수가 너무 많으면 해석에 어려움이 존재하기에 20개가 적정할 수 있다. For example, all books can be classified into 20 clusters by performing spherical K-means clustering using the book names and nouns extracted by word tokenization in book introductions. If the number of topics is too large, there are difficulties in interpretation, so 20 topics may be appropriate.

상기 토픽 군집화 유니트(140)는 복수의 군집에 대하여 LDA 기반의 토픽 모델링을 활용하여 각 군집이 개별적으로 내포하고 있는 고유 토픽을 부여할 수 있다.The topic clustering unit 140 may assign a unique topic individually contained in each cluster by utilizing LDA-based topic modeling for a plurality of clusters.

예를 들어, 고유 토픽은 문학·여행, 사회문제, 심리·정치, 종교·철학, 건강·요리, 외교·세계정치, 청소년교육, 취미·생활·IT, 외국어, 패션·디자인, 의학·세계소설, 자기계발·창업, 소설, 세계사·역사, 문화·예술, 교육, 산문·시집, 과학, 만화·오락, 한국수필일 수 있다.For example, unique topics include literature/travel, social issues, psychology/politics, religion/philosophy, health/cooking, diplomacy/world politics, youth education, hobbies/life/IT, foreign languages, fashion/design, medicine/world novels. , self-development/entrepreneurship, fiction, world history/history, culture/art, education, prose/poetry, science, comics/entertainment, and Korean essays.

상기 토픽 군집화 유니트(140)는 고유 토픽을 복수의 도서 각각에 할당하여 통한 데이터 셋에 추가할 수 있다.The topic clustering unit 140 may assign a unique topic to each of a plurality of books and add it to a data set.

상기 인기 카테고리 추출 유니트(150)는 회원 정보를 근거로 인기 카테고리를 추출할 수 있다. 인기 카테고리 추출 유니트(150)는 상기 토픽 군집화 유니트(140)에서 생성된 고유 토픽들 중에서 회원 정보 중 도서 대여 횟수를 근거로 인기 카테고리를 추출할 수 있다. 상기 인기 카테고리 추출 유니트(150)는 복수의 인기 카테고리를 추출할 수 있다.The popular category extraction unit 150 may extract a popular category based on member information. The popular category extraction unit 150 may extract a popular category from among unique topics generated by the topic clustering unit 140 based on the number of book rentals among member information. The popular category extraction unit 150 may extract a plurality of popular categories.

인기 카테고리 추출 유니트(150)는 복수의 회원 타입을 생성하고 복수의 회원 타입 각각에 대해서 개별적으로 인기 카테고리를 추출할 수 있다. 상기 인기 카테고리 추출 유니트(150)는 상기 회원 정보를 근거로 통합 데이터 셋을 복수의 회원 타입으로 분류하고, 상기 복수의 회원 타입 각각에 대해서 인기 카테고리를 독립적으로 추출하는 것일 수 있다. 상기 복수의 회원 타입은 성별 또는 연령대별인 것일 수 있다. 예를 들어, 성별은 남성과 여성 2가지 유형, 연령대는 각 성별마다 10대 이하, 20대, 30대, 40대, 50대, 60대 이상의 6가지 유형으로 구분하여 총 12개의 회원 타입으로 통합 데이터 셋을 분류할 수 있고, 이에 대해서, 각 회원 타입에 대해서 상위 3개의 인기 카테고리를 추출할 수 있다. 하기 표 3은 회원 타입별로 추출된 인기 카테고리의 예시를 나타내는 표이다.The popular category extracting unit 150 may create a plurality of member types and individually extract a popular category for each of the plurality of member types. The popular category extraction unit 150 may classify an integrated data set into a plurality of member types based on the member information, and independently extract a popular category for each of the plurality of member types. The plurality of member types may be by gender or age group. For example, gender is divided into two types, male and female, and age group is divided into 6 types: 10 or younger, 20s, 30s, 40s, 50s, 60s or older for each gender, and combined into a total of 12 member types. We can classify the data set, and for this, we can extract the top 3 popular categories for each member type. Table 3 below is a table showing examples of popular categories extracted for each member type.

성별gender 연령대age group 인기 카테고리Popular category 1위1st 2위2nd place 3위3rd place 남성male 0~190-19 Youth educationYouth education Comic/ EntertainmentComic/Entertainment ScienceScience 20~2920-29 Selfimprovement/ StartupSelf-improvement/Start-up Literature/ TravelLiterature/Travel Hobby/ Life Style/ ITHobby/ Life Style/ IT 30~3930 to 39 Selfimprovement/ StartupSelf-improvement/Start-up Literature/ TravelLiterature/Travel Hobby/ Life Style/ ITHobby/ Life Style/ IT 40~4940 to 49 Selfimprovement/ StartupSelf-improvement/Start-up Youth educationYouth education Comic/ EntertainmentComic/Entertainment 50~5950-59 Literature/ TravelLiterature/Travel Selfimprovement/ StartupSelf-improvement/Start-up Hobby/ Life Style/ ITHobby/ Life Style/ IT 60~60~ Literature/ TravelLiterature/Travel Selfimprovement/ StartupSelf-improvement/Start-up World history/ HistoryWorld history/ History 여성female 0~190-19 Youth educationYouth education Comic/ EntertainmentComic/Entertainment ScienceScience 20~2920-29 Literature/ TravelLiterature/Travel Selfimprovement/ StartupSelf-improvement/Start-up NovelNovel 30~3930 to 39 Youth educationYouth education Literature/ TravelLiterature/Travel Comic/ EntertainmentComic/Entertainment 40~4940 to 49 Literature/ TravelLiterature/Travel Youth educationYouth education Comic/ EntertainmentComic/Entertainment 50~5950-59 Literature/ TravelLiterature/Travel Selfimprovement/ StartupSelf-improvement/Start-up NovelNovel 60~60~ Literature/ TravelLiterature/Travel Comic/ EntertainmentComic/Entertainment Youth educationYouth education

추출된 인기 카테고리는 후술되는 TF-IDF 값을 산출하는데 있어서, 계산단위가 될 수 있다. 즉, TF-IDF 값은 인기 카테고리를 계산단위로 산출될 수 있다. 구체적으로, TF-IDF 값은 인기 카테고리 별로, 도서를 그룹화하여, 그룹별로 산출될 수 있다. 예를 들어, 인기 카테고리로 청소년교육 및 취미·생활·IT가 추출되고, 청소년교육에 30개의 도서, 취미·생활·IT에 20개의 도서가 소속된다면, 청소년교육에 대해서 30개의 도서들 계산단위로 30개의 도서 각각에 대한 TF-IDF 값을 산출하고, 취미·생활·IT에 대해서 20개의 도서를 계산단위로 20개의 도서 각각에 대한 TF-IDF 값을 산출할 수 있다. 회원 타입 별로 인기 카테고리를 추출한 경우, 회원 타입 별 인기 카테고리 각각이 계산단위가 될 수 있다. 예를 들어, 상기 표 3의 경우 12개의 회원 타입 마다 3개의 인기 카테고리를 추출하였고, 따라서, 계산단위의 수는 36개가 될 수 있다.The extracted popularity category may be a calculation unit in calculating a TF-IDF value described later. That is, the TF-IDF value may be calculated with a popular category as a calculation unit. Specifically, the TF-IDF value may be calculated for each group by grouping books by popular category. For example, if youth education and hobbies/life/IT are extracted as popular categories, and 30 books belong to youth education and 20 books belong to hobby/life/IT, then 30 books for youth education are counted as a unit of calculation. TF-IDF values for each of the 30 books can be calculated, and TF-IDF values for each of the 20 books for hobbies, life, and IT can be calculated using 20 books as a calculation unit. When popular categories are extracted for each member type, each popular category for each member type may be a calculation unit. For example, in the case of Table 3, 3 popular categories were extracted for every 12 member types, and therefore, the number of calculation units may be 36.

상기 TF-IDF 벡터화 유니트(160)는 상기 TF-IDF 값을 하기 수학식 1에 의해서 산출하는 것일 수 있다.The TF-IDF vectorization unit 160 may calculate the TF-IDF value by Equation 1 below.

[수학식 1][Equation 1]

단어 i는 계산단위에 소속된 도서들의 분석영역에서 기재된 단어들일 수 있다. 단어 i는 명사이며 불용어가 제거된 것일 수 있다. 예를 들어, 하나의 도서의 분석영역에 단어 a, b, c, d가 기재되어 있고, 동일 계산단위의 다른 하나의 도서의 분석영역에 단어 b, c, e, f가 기재되어 있다면, 단어 i들은 a, b, c, d, e, f를 포함할 수 있다. 분석영역은 도서 전체일 수도 있으며, 도서 명칭, 도서 소개 등으로 일부 영역으로 지정되는 것일 수도 있다. 계산단위 내에서, 단어 i의 종류의 수가 총 n개라면, 계산단위에 해당하는 도서들은 각각 n개의 성분을 가지는 벡터 값으로 산출될 수 있으며, 도서의 벡터 값에서 성분 값이 TF-IDF 값일 수 있다. 즉, 계산단위 내에서, 단어 i의 종류의 수가 총 n개라면, 계산단위에 해당하는 도서들은 각각 n개의 TF-IDF 값들을 가질 수 있다. 따라서, 하나의 계산단위에, 도서가 N개가 포함되고, 도서 N개의 분석영역에 단어의 종류들이 n개 기재되어 있다면, 하나의 계산단위(하나의 인기 카테고리)에서 TF-IDF 값은 N×n개 산출될 수 있다.Word i may be words described in the analysis area of books belonging to the calculation unit. The word i is a noun and may have stop words removed. For example, if the words a, b, c, and d are described in the analysis area of one book and the words b, c, e, and f are described in the analysis area of another book of the same computational unit, the words i may include a, b, c, d, e, f. The analysis area may be the entire book or may be designated as a part of the book name or book introduction. Within the computational unit, if the number of types of word i is a total of n, the books corresponding to the computational unit can be calculated as vector values having n components, and the component values in the vector values of the books can be TF-IDF values there is. That is, if the total number of types of word i within a computational unit is n, each book corresponding to the computational unit may have n TF-IDF values. Therefore, if N books are included in one calculation unit and n types of words are listed in the analysis areas of N books, the TF-IDF value in one calculation unit (one popular category) is N×n dogs can be produced.

피처 벡터화란 텍스트를 벡터 값을 가진 피처로 변환하는 수치화 방식으로 그 중 TF-IDF(Term Frequency-Inverse Document Frequncy) 기법은 다중 문서로 이루어진 문서 집합에서 특정 문서 내 각 단어의 중요도를 산출하는 기법이다. TF-IDF는 특정 문서 내 단어 빈도 수가 높을수록 가산점을 부여할 뿐만 아니라, 여러 문서에서 많이 등장하는 단어에는 패널티를 부여함으로써 각단어의 특성을 보다 잘 반영한다는 장점이 있다.Feature vectorization is a digitization method that converts text into features with vector values. Among them, the term frequency-inverse document frequency (TF-IDF) method calculates the importance of each word in a specific document in a document set consisting of multiple documents. . TF-IDF has the advantage of better reflecting the characteristics of each word by not only assigning additional points to the higher the frequency of words in a specific document, but also giving a penalty to words that appear frequently in multiple documents.

상기 코사인 유사도 산출 유니트(170)는 상기 코사인 유사도를 하기 수학식 2에 의해서 산출하는 것일 수 있다.The cosine similarity calculation unit 170 may calculate the cosine similarity by Equation 2 below.

[수학식 2][Equation 2]

코사인 유사도란 두 벡터 사이의 각도에 해당하는 코사인 값을 통해 산출된 벡터 간의 유사성에 대한 수치를 의미한다. 두 벡터가 이루는 각이 0°일 경우 1, 90°일 경우 0, 그리고 180°일 경우 -1의 값을 가질 수 있다.The cosine similarity means a numerical value of similarity between vectors calculated through a cosine value corresponding to an angle between two vectors. The angle formed by the two vectors can have a value of 1 when the angle is 0°, 0 when the angle is 90°, and -1 when the angle is 180°.

코사인 유사도는 차원의 제한을 받지 않는 특성으로 인해 다차원 공간에서의 벡터 간 유사도 측정에 적합하다는 장점이 있다.Cosine similarity has the advantage of being suitable for measuring the similarity between vectors in a multi-dimensional space due to its dimension-independent characteristics.

상기 인기 카테고리 추출 유니트(150)는 상기 인기 카테고리를 복수로 추출할 수 있다. 구체적으로, 회원 타입 별로 복수의 인기 카테고리를 추출할 수 있다.The popular category extraction unit 150 may extract a plurality of popular categories. Specifically, a plurality of popular categories may be extracted for each member type.

상기 미대출 도서 추천 유니트(180)는 상기 복수의 인기 카테고리 별로 상기 코사인 유사도 값의 평균 값을 기준으로 상기 미대출 도서 목록의 도서를 랭크시키는 것일 수 있다. 즉, 각각의 미대출 도서와 모든 대출 도서 간 코사인 유사도의 평균을 계산하고, 이를 내림차순으로 정렬하여 미대출 도서별 추천 순위를 인기 카테고리 별로 산정할 수 있다. The unleavened book recommendation unit 180 may rank the books in the unlendered book list based on an average value of the cosine similarity value for each of the plurality of popular categories. That is, the average of the cosine similarities between each unleased book and all borrowed books may be calculated, and the recommended rank for each unleased book may be calculated for each popular category by arranging them in descending order.

하기 표 4는 코사인 유사도 평균 값을 근거로 산출된 미대출 도서의 추천의 예시를 나타내는 표이다.Table 4 below is a table showing an example of recommending uncirculated books calculated based on the average value of cosine similarity.

도 3에 도시된 바와 같이, 본 발명의 미대출 도서 추천 시스템(100)은 미대출 도서를 추천할 수 있다. 미대출 도서 추천 유니트(180)는 사용자 단말기(13)에 도 3의 도시된 화면과 같이, 미대출 도서 추천 정보를 발송할 수 있다.As shown in FIG. 3 , the system 100 for recommending unexamined books of the present invention may recommend unextracted books. As shown in the screen shown in FIG. 3 , the unlendered book recommendation unit 180 may send unlended book recommendation information to the user terminal 13 .

이상에서 본 발명에 따른 실시예들이 설명되었으나, 이는 예시적인 것에 불과하며, 당해 분야에서 통상적 지식을 가진 자라면 이로부터 다양한 변형 및 균등한 범위의 실시예가 가능하다는 점을 이해할 것이다. 따라서, 본 발명의 진정한 기술적 보호 범위는 다음의 특허청구범위에 의해서 정해져야 할 것이다.Embodiments according to the present invention have been described above, but these are merely examples, and those skilled in the art will understand that various modifications and embodiments of equivalent range are possible therefrom. Therefore, the true technical protection scope of the present invention should be defined by the following claims.

11...도서 대여 시스템
13...사용자 단말기
100...미대출 도서 추천 시스템
110...데이터 저장 유니트
120...데이터 셋 구축 유니트
121...회원 정보 정리부
122...도서 소개 정리부
123...대출 정보 데이터 셋 생성부
124...통합 데이터 셋 생성부
130...텍스트 마이닝 유니트
140...토픽 군집화 유니트
150...인기 카테고리 추출 유니트
160...TF-IDF 벡터화 유니트
170...코사인 유사도 산출 유니트
180...미대출 도서 추천 유니트11...Book lending system
13 ... user terminal
100...unloaned book recommendation system
110 ... data storage unit
120 ... data set building unit
121...member information organization department
122... Book Introduction Organizing Department
123 ... loan information data set generation unit
124 ... integrated data set generation unit
130 ... text mining unit
140 ... topic clustering unit
150...Popular Category Extraction Unit
160...TF-IDF vectorization unit
170 ... cosine similarity calculation unit
180... unchecked book recommendation unit

Claims

A data storage unit for storing collection list information, loan history information, member information, and detailed book information of a book rental system in which a plurality of books and members are registered;
a data set construction unit for constructing an integrated data set for the collection list information, the loan history information, the member information, and the detailed book information;
a text mining unit that performs text mining processing on text data from among the integrated data set;
The plurality of books are classified into a plurality of clusters by topic modeling based on spherical K-means clustering and latent dirichlet allocation (LDA) using the data processed by text mining in the text mining unit as an input value, respectively. a topic clustering unit that assigns a unique topic to clusters of;
a popular category extracting unit which classifies the plurality of books into a list of loaned books and a list of un loaned books based on the integrated data set, and extracts a top unique topic from the list of loaned books as a popular category;
a TF-IDF vectorization unit calculating TF-IDF values for each of the plurality of books for feature vectorization based on term frequency-inverse document frequency (TF-IDF) for the plurality of books;
a cosine similarity calculation unit for calculating a cosine similarity value using the TF-IDF values of the books in the uncirculated book list and the TF-IDF values of the books in the loaned book list; and
and an unleased book recommendation unit for recommending an unleased book based on an average value of cosine similarity values for a plurality of books in the borrowed book list for each book in the unleased book list.

According to claim 1,
The collection list information includes at least one of book name, author, publisher, year of publication, international standard book number (ISBN), set ISBN, additional code, subject classification number, number of books, number of loans, and book registration date,
The loan history information includes at least one information of a user number, book number, KDC (Korean decimal classification), book name, author, publisher, loan date, return date, and media type for each loan case,
The member information includes at least one of user number, member registration date, date of birth, gender, postal code, book rental number and overdue number,
The book detailed information includes at least one of book name, KDC, publication date, author, publisher, book image URL, and book introduction.

According to claim 2,
The data set building unit,
A book introduction organizing unit for integrating book introductions of books with the same book name for the detailed book information and deleting detailed book information for books with blank book introductions;
A member information organizing unit for combining the loan history data and the member information based on a user number (User key) and generating an age group variable as categorical data from the date of birth as discrete data;
a loan information data set generating unit generating a loan information data set by merging the collection list information, the loan history information, and the member information;
and an integrated data set building unit that builds the integrated data set by integrating the loan information data set and the detailed book information.

According to claim 1,
The text mining unit
A text cleansing unit for removing numbers or special characters from text data of the detailed book information;
A noun extraction unit extracting nouns from text data among the detailed book information;
a stopword removal unit for removing stopwords from text data of the detailed book information;
and a tokenization unit that separates Korean into morpheme units and tokenizes the text data processed through the text cleansing unit, the noun extraction unit, and the stopword removal unit.

According to claim 4,
The text data of the detailed book information is the book name or book introduction.

According to claim 1,
The topic clustering unit,
An unloaned book recommendation system that performs spherical K-means clustering by using nouns extracted by word tokenization from book names and book introductions included in the plurality of books.

According to claim 1,
The popular category extraction unit,
Based on the member information, the integrated data set is classified into a plurality of member types,
An unloaned book recommendation system for independently extracting a popular category for each of the plurality of member types.

According to claim 7,
The plurality of member types are by gender or age group.

According to claim 1,
The popular category extracting unit extracts a plurality of popular categories;
The TF-IDF value is calculated based on the popular category as a calculation unit.
The TF-IDF vectorization unit,
Unlendered book recommendation system that calculates the TF-IDF value by Equation 1 below:
[Equation 1]

W _i,j is the TF-IDF value for word i in book j, tf _i,j is the frequency count of word i in the analysis domain specified in book j, df _i is the number of books containing word i, and N is the number of books belonging to the calculation unit.

According to claim 1,
The cosine similarity calculation unit,
Unleavened book recommendation system that calculates the cosine similarity by Equation 2 below:
[Equation 2]

cos(θ) is the cosine similarity value between the book of the uncirculated book list and the book of the loaned book list, A _i is the TF-IDF value for word i of the book of the un loaned book list, and B _i is the above It is the TF-IDF value for word i of the book in the borrowed book list, and n is the number of types of word i.