KR20220075490A

KR20220075490A - Learning content recommendation method

Info

Publication number: KR20220075490A
Application number: KR1020200163605A
Authority: KR
Inventors: 허현범
Original assignee: 허현범
Priority date: 2020-11-30
Filing date: 2020-11-30
Publication date: 2022-06-08
Also published as: KR102623256B1; KR20230169907A

Abstract

학습 콘텐츠 추천 방법이 개시된다. 학습 콘텐츠 추천 방법은, 온라인상에서 디지털 콘텐츠를 수집하는 단계; 상기 수집된 디지털 콘텐츠를 분류 모델에 적용하여 카테고리 리스트를 생성하고, 상기 생성된 카테고리 리스트에 따라 상기 디지털 콘텐츠를 각각 분류하는 단계; 상기 디지털 콘텐츠를 분석하여 단어를 각각 추출한 후 기초 단어 사전을 생성하고, 상기 기초 단어 사전을 이용하여 학습 레벨별 사전을 생성하며, 상기 디지털 콘텐츠의 학습 레벨별 사전의 커버리지 비율을 도출하여 상기 디지털 콘텐츠의 학습 순번을 결정하는 단계; 및 사용자 학습 결과와 상기 디지털 콘텐츠의 학습 순번을 고려하여 학습 콘텐츠를 추천하는 단계를 포함한다. A learning content recommendation method is disclosed. A method for recommending learning content includes: collecting digital content online; generating a category list by applying the collected digital content to a classification model, and classifying each of the digital content according to the generated category list; After analyzing the digital content and extracting each word, a basic word dictionary is generated, a dictionary for each learning level is generated using the basic word dictionary, and a coverage ratio of the dictionary for each learning level of the digital content is derived, and the digital content determining a learning sequence of and recommending learning content in consideration of a user learning result and a learning sequence of the digital content.

Description

Learning content recommendation method

본 발명은 학습 콘텐츠 추천 방법에 관한 것이다.The present invention relates to a method for recommending learning content.

기초외국어능력은 직장생활을 하다 보면 직무와 관련하여 다양한 상황에서 요구되기 마련이지만, 대부분의 직업인들은 탁월한 업무능력에도 불구하고 기초외국어를 하지 못해 업무를 제대로 추진하지 못하는 경우가 많다. 국가직무능력표준 (NCS: National Competency Standards)에서 요구하는 직업능력에는 직무수행능력과 직업기초능력이 있고, 이 직업기초능력 중에 하나인 기초외국어능력은 모든 직업인이 공통으로 갖추어야 할 핵심적인 능력이며, 대부분의 직종에서 직무를 성공적으로 수행하는데 필요한 공통적인 능력이다.Basic foreign language skills are often required in a variety of job-related situations while working at work, but most professionals are unable to carry out their work properly because they do not speak the basic foreign language despite their excellent work skills. The vocational competence required by the National Competency Standards (NCS) includes job performance ability and vocational basic competence. It is a common skill required to successfully perform a job in the occupation.

외국어능력이 중요해짐에 따라 다양한 외국어 학습을 위한 콘텐츠들이 제공되고 있으나, 종래의 학습 콘텐츠는 학습지와 단어장을 제공하여 학습자가 자발적으로 학습하도록 하는 형식의 학습 방법이 대부분이다.As foreign language ability becomes more important, various contents for learning foreign languages are provided. However, most of the conventional learning contents are learning methods in the form of providing study sheets and vocabulary books so that learners can learn voluntarily.

학습자는 암기하기 힘든 단어들을 마주하게 될 경우, 흥미가 낮아질 뿐만 아니라 계속되는 암기의 어려움 때문에 성취도 또한 낮아지게 되고, 결국 학습 의욕의 저하로 나타난다.When a learner encounters words that are difficult to memorize, not only will their interest be lowered, but their achievement will also be lowered due to the continuing difficulty in memorizing, which in turn leads to a decrease in their motivation to learn.

한국공개특허공보 제10-2018-0000444호Korean Patent Publication No. 10-2018-0000444

본 발명은 온라인상의 디지털 콘텐츠를 이용하여 학습 콘텐츠로 가공하여 제공할 수 있는 학습 콘텐츠 추천 방법을 제공하기 위한 것이다.An object of the present invention is to provide a method for recommending learning content that can be provided by processing online digital content into learning content.

또한, 본 발명은 사용자의 학습 레벨, 유사한 학습자들이 학습한 콘텐츠를 추천할 수 있는 학습 콘텐츠 추천 방법을 제공하기 위한 것이다.Another object of the present invention is to provide a learning content recommendation method capable of recommending a user's learning level and content learned by similar learners.

또한, 본 발명은 사용자의 학습 레벨과 유사하며 어려운 콘텐츠와 쉬운 콘텐츠를 구분하여 제공하여 학습 능력 향상을 도모할 수 있는 학습 콘텐츠 추천 방법을 제공하기 위한 것이다. Another object of the present invention is to provide a method for recommending learning content that is similar to a user's learning level and can improve learning ability by distinguishing between difficult content and easy content.

본 발명의 일 측면에 따르면, 온라인상의 디지털 콘텐츠를 이용하여 학습 콘텐츠로 가공하여 제공할 수 있는 학습 콘텐츠 추천 방법이 제공된다. According to one aspect of the present invention, there is provided a learning content recommendation method that can be provided by processing the online digital content into learning content.

본 발명의 실시예에 따른 학습 콘텐츠 추천 방법을 제공함으로써, 온라인상의 디지털 콘텐츠를 이용하여 학습 콘텐츠로 가공하여 제공할 수 있는 이점이 있다.By providing the learning content recommendation method according to the embodiment of the present invention, there is an advantage in that it can be processed and provided as learning content using online digital content.

또한, 본 발명은 사용자의 학습 레벨, 사용자의 학습 목표나 선호도, 유사한 학습자들이 학습한 콘텐츠를 추천할 수 있는 이점도 있다. In addition, the present invention also has an advantage in that the user's learning level, the user's learning goals or preferences, and content learned by similar learners can be recommended.

또한, 본 발명은 사용자의 학습 레벨과 유사하며 어려운 콘텐츠와 쉬운 콘텐츠를 구분하여 제공하여 학습 능력 향상을 도모할 수 있는 이점도 있다. In addition, the present invention is similar to the user's learning level and has an advantage in that it is possible to improve the learning ability by separately providing difficult content and easy content.

도 1은 본 발명의 일 실시예에 따른 언어 학습을 위한 콘텐츠 추천 시스템의 구성을 개략적으로 도시한 도면도 2는 본 발명의 일 실시예에 따른 학습 콘텐츠 데이터베이스를 구축하는 방법을 나타낸 순서도이고, 도 3은 본 발명의 일 실시예에 따른 카테고리 분류를 위한 샘플 데이터를 예시한 도면
도 4는 본 발명의 일 실시예에 따른 학습 콘텐츠를 추천하는 방법을 나타낸 순서도.
도 5는 본 발명의 일 실시예에 따른 진단 테스트 셋을 예시한 도면.
도 6은 본 발명의 일 실시예에 따른 카테고리별 추천 콘텐츠를 예시한 도면.
도 7은 본 발명의 일 실시예에 따른 전체 문서에 대한 추천 콘텐츠를 예시한 도면.
도 8은 본 발명의 일 실시예에 따른 가중치를 반영하여 추천 순서를 정렬한 결과를 도시한 도면.
도 9는 본 발명의 다른 실시예에 따른 콘텐츠 추천 방법을 나타낸 순서도.
도 10은 본 발명의 일 실시예에 따른 유사 학습자들을 선별하기 위한 데이터들을 예시한 도면.
도 11은 본 발명의 일 실시예에 따른 유사한 학습자를 선별한 결과를 예시한 도면.
도 12에는 본 발명의 일 실시예에 따른 선별된 학습자들이 학습한 콘텐츠 목록을 예시한 도면.
도 13은 본 발명의 일 실시예에 따른 가중치를 반영하여 정렬한 결과를 예시한 도면.
도 14는 본 발명의 또 다른 실시예에 따른 콘텐츠 추천 방법을 나타낸 순서도.
도 15는 본 발명의 또 다른 실시예에 따른 유사 콘텐츠 추천 결과를 예시한 도면.
도 16은 본 발명의 또 다른 실시예에 따른 커버리지를 반영한 쉬운 학습 콘텐츠를 분류할 결과를 예시한 도면.
도 17은 본 발명의 또 다른 실시예에 따른 커버리지를 반영한 어려운 학습 콘텐츠를 분류할 결과를 예시한 도면.
도 18은 본 발명의 일 실시예에 따른 콘텐츠 추천 방법을 나타낸 흐름도.1 is a diagram schematically showing the configuration of a content recommendation system for language learning according to an embodiment of the present invention 3 is a diagram illustrating sample data for category classification according to an embodiment of the present invention;
4 is a flowchart illustrating a method of recommending learning content according to an embodiment of the present invention.
5 is a diagram illustrating a diagnostic test set according to an embodiment of the present invention.
6 is a diagram illustrating recommended content for each category according to an embodiment of the present invention.
7 is a diagram illustrating recommended content for all documents according to an embodiment of the present invention.
8 is a diagram illustrating a result of arranging a recommendation order by reflecting weights according to an embodiment of the present invention;
9 is a flowchart illustrating a content recommendation method according to another embodiment of the present invention.
10 is a diagram illustrating data for selecting similar learners according to an embodiment of the present invention.
11 is a diagram illustrating a result of selecting similar learners according to an embodiment of the present invention.
12 is a diagram illustrating a list of contents learned by selected learners according to an embodiment of the present invention.
13 is a diagram illustrating an arrangement result by reflecting weights according to an embodiment of the present invention;
14 is a flowchart illustrating a content recommendation method according to another embodiment of the present invention.
15 is a diagram illustrating a similar content recommendation result according to another embodiment of the present invention.
16 is a diagram illustrating a result of classifying easy learning content reflecting coverage according to another embodiment of the present invention.
17 is a diagram illustrating a result of classifying difficult learning content reflecting coverage according to another embodiment of the present invention.
18 is a flowchart illustrating a content recommendation method according to an embodiment of the present invention.

본 발명은 다양한 변환을 가할 수 있고 여러 가지 실시예를 가질 수 있는 바, 특정 실시예들을 도면에 예시하고 상세한 설명에 상세하게 설명하고자 한다. 그러나, 이는 본 발명을 특정한 실시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술 범위에 포함되는 모든 변환, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다.Since the present invention can apply various transformations and can have various embodiments, specific embodiments are illustrated in the drawings and described in detail in the detailed description. However, this is not intended to limit the present invention to specific embodiments, and should be understood to include all modifications, equivalents, and substitutes included in the spirit and scope of the present invention.

본 발명을 설명함에 있어서, 관련된 공지 기술에 대한 구체적인 설명이 본 발명의 요지를 불필요하게 흐릴 수 있다고 판단되는 경우 그 상세한 설명을 생략한다. 또한, 본 명세서의 설명 과정에서 이용되는 숫자(예를 들어, 제1, 제2 등)는 하나의 구성요소를 다른 구성요소와 구분하기 위한 식별기호에 불과하다.In describing the present invention, if it is determined that a detailed description of a related known technology may unnecessarily obscure the gist of the present invention, the detailed description thereof will be omitted. In addition, numbers (eg, first, second, etc.) used in the description process of the present specification are only identification symbols for distinguishing one component from other components.

또한, 명세서 전체에서, 일 구성요소가 다른 구성요소와 "연결된다" 거나 "접속된다" 등으로 언급된 때에는, 상기 일 구성요소가 상기 다른 구성요소와 직접 연결되거나 또는 직접 접속될 수도 있지만, 특별히 반대되는 기재가 존재하지 않는 이상, 중간에 또 다른 구성요소를 매개하여 연결되거나 또는 접속될 수도 있다고 이해되어야 할 것이다. 또한, 명세서 전체에서, 어떤 부분이 어떤 구성요소를 "포함"한다고 할 때, 이는 특별히 반대되는 기재가 없는 한 다른 구성요소를 제외하는 것이 아니라 다른 구성요소를 더 포함할 수 있는 것을 의미한다. 또한, 명세서에 기재된 "부", "모듈" 등의 용어는 적어도 하나의 기능이나 동작을 처리하는 단위를 의미하며, 이는 하나 이상의 하드웨어나 소프트웨어 또는 하드웨어 및 소프트웨어의 조합으로 구현될 수 있음을 의미한다.Also, throughout the specification, when an element is referred to as “connected” or “connected” with another element, the one element may be directly connected or directly connected to the other element, but in particular It should be understood that, unless there is a description to the contrary, it may be connected or connected through another element in the middle. In addition, throughout the specification, when a part "includes" a certain component, it means that other components may be further included, rather than excluding other components, unless otherwise stated. In addition, terms such as "unit" and "module" described in the specification mean a unit that processes at least one function or operation, which means that it can be implemented as one or more hardware or software or a combination of hardware and software .

이하, 첨부된 도면들을 참조하여 본 발명의 실시예를 상세히 설명한다. Hereinafter, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 일 실시예에 따른 언어 학습을 위한 콘텐츠 추천 시스템의 구성을 개략적으로 도시한 도면이다. 1 is a diagram schematically illustrating a configuration of a content recommendation system for language learning according to an embodiment of the present invention.

도 1을 참조하면, 본 발명의 일 실시예에 따른 언어 학습을 위한 콘텐츠 추천 시스템(100)은 사용자 단말(110) 및 서버(120)를 포함하여 구성된다. Referring to FIG. 1 , a content recommendation system 100 for language learning according to an embodiment of the present invention includes a user terminal 110 and a server 120 .

사용자 단말(110)은 사용자가 소지한 장치로, 언어 학습을 위한 콘텐츠를 제공받는 장치이다. 사용자 단말(110)은 예를 들어, 이동통신 단말기, 태블릿 PC, 노트북 등과 같이 통신 기능이 구비된 전자 장치일 수 있다. The user terminal 110 is a device possessed by a user, and is a device that receives content for language learning. The user terminal 110 may be, for example, an electronic device equipped with a communication function, such as a mobile communication terminal, a tablet PC, or a notebook computer.

사용자 단말(110)은 서버(120)에서 언어 학습을 위한 콘텐츠를 제공받고, 이를 이용하여 학습 및 복습을 수행하기 위한 장치이다. The user terminal 110 is a device for receiving content for language learning from the server 120 and performing learning and review using the content.

또한, 사용자 단말(110)은 사용자의 학습 취향과 학습 정보를 기초로 다양한 학습 콘텐츠를 서버(120)로부터 제공받아 학습할 수 있다. In addition, the user terminal 110 may learn by receiving various learning contents from the server 120 based on the user's learning taste and learning information.

서버(120)는 인터넷상의 다양한 디지털 콘텐츠를 가공하여 사용자의 학습 취향과 학습 정보를 기초로 학습 콘텐츠를 추천하여 제공할 수 있는 장치이다. The server 120 is a device capable of processing various digital contents on the Internet and recommending and providing learning contents based on a user's learning taste and learning information.

본 발명의 일 실시예에 따르면, 서버(120)는 온라인상에서 다양한 분야의 디지털 콘텐츠를 수집한 후 이를 분석, 가공하여 학습용 콘텐츠로 생성할 수 있다. 이어, 서버(120)는 사용자 단말(110)의 접속에 상응하여 생성된 학습용 콘텐츠들 중에서 사용자 학습 취향과 학습 정보를 기초로 학습 콘텐츠를 추천하여 제공할 수 있다. According to an embodiment of the present invention, the server 120 may collect digital contents in various fields online, analyze and process them, and generate contents for learning. Next, the server 120 may recommend and provide learning content based on the user's learning preference and learning information among the learning content generated in response to the access of the user terminal 110 .

이에 대해서는 하기의 설명에 의해 보다 명확하게 이해될 것이다. This will be more clearly understood by the following description.

도 2는 본 발명의 일 실시예에 따른 학습 콘텐츠 데이터베이스를 구축하는 방법을 나타낸 순서도이고, 도 3은 본 발명의 일 실시예에 따른 카테고리 분류를 위한 샘플 데이터를 예시한 도면이다. 2 is a flowchart illustrating a method of constructing a learning content database according to an embodiment of the present invention, and FIG. 3 is a diagram illustrating sample data for category classification according to an embodiment of the present invention.

단계 210에서 서버(120)는 온라인상에서 디지털 콘텐츠를 수집한다. In step 210, the server 120 collects digital content online.

서버(120)는 온라인상에 공개된 다양한 디지털 콘텐츠를 수집한다. 여기서, 여기서, 디지털 콘텐츠는 온라인상에 게시된 다양한 유형의 콘텐츠일 수 있다. 예를 들어, 뉴스, 논문, 블로그 등과 같이 온라인 상에 게시된 디지털 콘텐츠는 모두 적용될 수 있다. The server 120 collects various digital contents published online. Here, the digital content may be various types of content posted online. For example, all digital contents posted online such as news, papers, blogs, etc. may be applied.

단계 215에서 서버(120)는 디지털 콘텐츠의 메타 데이터를 추출한다. 여기서, 메타 데이터는 제목, 출처, 작성일, 전체 길이, 문장 수, 평균문장길이, 단어수, 핵심 단어셋, 비핵심 단어셋, 카테고리, 내용 요약, 콘텐츠 레벨, 레벨 커버량 등일 수 있다.In step 215, the server 120 extracts the metadata of the digital content. Here, the metadata may include title, source, creation date, total length, number of sentences, average sentence length, number of words, core word set, non-core word set, category, content summary, content level, level coverage, and the like.

단계 220에서 서버(120)는 디지털 콘텐츠의 카테고리를 분류한다. In step 220, the server 120 classifies the categories of digital content.

이에 대해 보다 상세히 설명하기로 한다. This will be described in more detail.

예를 들어, 서버(120)는 머신러닝 기법을 이용하여 디지털 콘텐츠의 카테고리를 분류할 수도 있다. 보다 상세하게, RoBerta, GTP-3, RandomForest, Bayesian, SVM, LSTM, RNN, CNN, Transformer, BERT 등의 분류 기법 또는 머신러닝 기법, 딥러닝 기법 등을 이용하여 디지털 콘텐츠의 카테고리를 분류할 수 있다.For example, the server 120 may classify a category of digital content using a machine learning technique. In more detail, categories of digital content can be classified using classification techniques such as RoBerta, GTP-3, RandomForest, Bayesian, SVM, LSTM, RNN, CNN, Transformer, and BERT, or machine learning techniques, deep learning techniques, etc. .

이하에서는 이해와 설명의 편의를 도모하기 위해 베이지안 분류 모델(Bayesian Classifier)을 이용하여 디지털 콘텐츠의 카테고리를 분류하는 방법에 대해 설명하기로 한다. Hereinafter, for the convenience of understanding and explanation, a method of classifying a category of digital content using a Bayesian Classifier will be described.

베이지안 분류 모델은 조건부 확률을 계산하는 베이지안 정리를 이용한 텍스트 분류 모델이다. 베이지안 정리에 대해 간략하게 설명하면 다음과 같다. The Bayesian classification model is a text classification model using Bayesian theorem to calculate conditional probabilities. A brief description of the Bayesian theorem is as follows.

P(A)는 A가 일어날 확률을 나타내며, P(B)는 B가 일어날 확률을 나타내고, P(B|A)는 A가 일어나고 B가 나타날 확률을 나타내며, P(A|B)는 B가 일어나고 A가 나타날 확률을 나타낸다. P(A) indicates the probability that A will occur, P(B) indicates the probability that B will occur, P(B|A) indicates the probability that A will occur and B will occur, and P(A|B) indicates that B is It represents the probability that A will occur.

P(A|B)는 다음 수학식 1과 같이 나타낼 수 있다.P(A|B) can be expressed as Equation 1 below.

이러한 베이지안 정리를 활용하여 디지털 콘텐츠의 카테고리 분류에 적용하면 다음과 같이 나타낼 수 있다.If such Bayesian theorem is applied to category classification of digital content, it can be expressed as follows.

예를 들어, P(과학|입력문서)는 입력 문서가 과학으로 분류될 확률로 정의될 수 있다. For example, P(science|input document) may be defined as the probability that the input document is classified as science.

문서가 "과학 뉴스입니다"라면, If the article is "science news", then

예를 들어, 각 카테고리가 분류되어 있는 약 22,000개의 샘플 데이터 셋을 이용하여 카테고리 목록을 다음과 같이 분류할 수 있다. 카테고리 예시는 다음과 같다. For example, the category list can be classified as follows using about 22,000 sample data sets in which each category is classified. Examples of categories are as follows.

예를 들어, 카테고리 목록은 business, technology, science, art, sports, world, politics, life, education, environment와 같다. For example, a list of categories might look like business, technology, science, art, sports, world, politics, life, education, environment.

상술한 카테고리 목록은 일 예일 뿐이며, 카테고리 목록을 달라질 수도 있음은 당연하다. 샘플 데이터 셋을 이용하여 카테고리를 추론하여 테스트 셋으로 사용할 수도 있다.The above-described category list is only an example, and it is natural that the category list may be changed. It can also be used as a test set by inferring a category using the sample data set.

샘플 데이터 셋은 도 3에 도시된 바와 같다. 샘플 데이터 셋과 같이 카테고리가 분류된 데이터 셋을 이용할지라도, 상술한 분류 모델을 이용하여 최종 카테고리 분류 결과는 달라질 수 있다. The sample data set is as shown in FIG. 3 . Even if a data set in which categories are classified, such as a sample data set, is used, the final category classification result using the above-described classification model may be different.

단계 225에서 서버(120)는 카테고리 분류 결과를 기초로 디지털 콘텐츠를 각각 분류한다. 즉, 수집된 디지털 콘텐츠 분석 결과를 기초로 카테고리 분류 결과를 구성한 후 디지털 콘텐츠를 분류된 카테고리 목록에 따라 분류할 수 있다. In step 225, the server 120 classifies each digital content based on the category classification result. That is, after configuring a category classification result based on the collected digital content analysis result, the digital content may be classified according to the classified category list.

이를 위해, 서버(120)는 디지털 콘텐츠(예를 들어, 문서)를 각각 벡터화할 수 있다. 예를 들어, Word2Vec, Count Vectorization, TF/IDF 등을 이용하여 문서를 벡터화할 수 있다. 본 발명의 일 실시예에서는 Count Vectorization 기법을 중심으로 설명하기로 한다. 그러나 반드시 Count Vectorization 기법으로 제한되는 것은 아니며, 이외에도 데이터를 벡터화할 수 있는 공지된 기법들이 적용될 수 있음은 당연하다. To this end, the server 120 may vectorize each digital content (eg, a document). For example, the document may be vectorized using Word2Vec, Count Vectorization, TF/IDF, or the like. In an embodiment of the present invention, the count vectorization technique will be mainly described. However, it is not necessarily limited to the count vectorization technique, and it is natural that well-known techniques capable of vectorizing data may be applied.

"They refuse to permit us to obtain the refuse permit"와 같은 문장을 가정하여 설명하기로 한다. I will explain assuming a sentence such as "They refuse to permit us to obtain the refuse permit".

서버(120)는 해당 문장에서 품사 태깅 기법을 통해 단어를 각각 추출할 수 있다. 즉, "They refuse to permit us to obtain the refuse permit"에서 품사 태깅 기법을 적용하면, ('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO'), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT'), ('refuse', 'NN'), ('permit', 'NN')와 같다. The server 120 may extract each word from the corresponding sentence through the part-of-speech tagging technique. That is, if the part-of-speech tagging technique is applied in "They refuse to permit us to obtain the refuse permit", ('They', 'PRP'), ('refuse', 'VBP'), ('to', 'TO' ), ('permit', 'VB'), ('us', 'PRP'), ('to', 'TO'), ('obtain', 'VB'), ('the', 'DT' ), ('refuse', 'NN'), ('permit', 'NN').

이와 같이, 품사 태깅 기법을 적용하여 각각의 단어를 분리한 후 단어만 추출하면 다음과 같이 추출될 수 있다. In this way, if each word is separated by applying the part-of-speech tagging technique and only the word is extracted, it can be extracted as follows.

they, refuse, to, permit, us, to, obtain, the, refuse, permitthey, refuse, to, permit, us, to, obtain, the, refuse, permit

이와 같이, 각각의 단어가 추출되면, 서버(120)는 문서에 대한 특징 데이터 셋을 구성할 수 있다. 이때, 서버(120)는 추출된 단어의 출현 빈도를 이용하여 특징 데이터 셋을 구성할 수 있다. 즉, 서버(120)는 추출된 각 단어의 출현 빈도를 도출한 후 출현 빈도가 기준치 미만인 단어는 특징 데이터 셋에서 제외할 수 있다. In this way, when each word is extracted, the server 120 may configure a feature data set for the document. In this case, the server 120 may configure the feature data set using the frequency of occurrence of the extracted word. That is, after deriving the frequency of appearance of each extracted word, the server 120 may exclude words having an appearance frequency less than the reference value from the feature data set.

즉, 출현 빈도가 일정 기준치 이상인 단어만으로 특징 데이터 셋이 구성될 수 있다. 본 발명의 일 실시예에서는 출현 빈도가 10회 이하인 단어들을 제외하여 특징 데이터 셋을 구성하는 것을 가정하여 설명하나 기준치는 구현시 탄력적으로 변경될 수 있음은 당연하다. That is, the feature data set may be configured only with words having an appearance frequency equal to or greater than a predetermined reference value. In an embodiment of the present invention, it is assumed that a feature data set is configured by excluding words having an appearance frequency of 10 times or less, but it is natural that the reference value can be flexibly changed during implementation.

예를 들어, 각 단어와 해당 단어의 출현 빈도가 예를 들어 다음과 같다고 가정하기로 한다. For example, it is assumed that each word and the frequency of appearance of the word are, for example, as follows.

{'hundreds': 10, 'thousands': 9, 'people': 128, 'marched': 2, 'central': 14, 'london': 15, 'calling': 6, 'referendum': 11, 'mps': 19, 'search': 2, 'way': 45, 'brexit': 37, ??}{'hundreds': 10, 'thousands': 9, 'people': 128, 'marched': 2, 'central': 14, 'london': 15, 'calling': 6, 'referendum': 11, ' mps': 19, 'search': 2, 'way': 45, 'brexit': 37, ??}

서버(120)는 출현 빈도가 10회 이하인 단어들을 제외시킨 후 ['hundreds', 'people', 'central', 'london', 'referendum', 'mps', 'way', 'brexit', 'campaign', 'say']과 같이 특징 데이터 셋을 구성할 수 있다. The server 120 excludes words having an appearance frequency of 10 or less ['hundreds', 'people', 'central', 'london', 'referendum', 'mps', 'way', 'brexit', ' You can configure a feature data set like campaign', 'say'].

이어, 서버(120)는 특징 데이터 셋에 대한 특징 매트릭스를 구성한다. 특징 매트릭스는 특징 데이터 셋에 포함된 단어들의 출현 빈도로 구성될 수 있다. Next, the server 120 constructs a feature matrix for the feature data set. The feature matrix may be composed of appearance frequencies of words included in the feature data set.

예를 들어, 특징 데이터 셋이 [hundred, of, thousand, for, is]이라고 가정하기로 하며, 문장이 ('politics', 'Hundreds of thousands of people have marched for another E')라고 가정하기로 한다. For example, suppose that the feature data set is [hundred, of, thousand, for, is] and the sentence is ('politics', 'Hundreds of thousands of people have marched for another E'). .

이와 같은 경우, 특징 매트릭스는 [1, 2, 1, 1, 0]과 같이 구성될 수 있다. In this case, the feature matrix may be configured as [1, 2, 1, 1, 0].

서버(120)는 특징 매트릭스를 이용하여 해당 문서를 카테고리에 따라 분류할 수 있다. The server 120 may classify a corresponding document according to a category using the feature matrix.

단계 230에서 서버(120)는 기초 사전을 구성한다. In step 230, the server 120 constructs a basic dictionary.

기초 사전은 분야별(카테고리별)로 구성될 수 있으며, 카테고리와 무관하게 구성될 수도 있다. 본 발명의 일 실시예에서는 n-gram 모델의 수식을 응용하여 기초 사전을 구축할 수 있다. The basic dictionary may be configured by field (category), or may be configured regardless of category. In an embodiment of the present invention, a basic dictionary can be constructed by applying the formula of the n-gram model.

이해와 설명의 편의를 도모하기 위해 n-gram에 대해 간략히 설명하기로 한다. For the convenience of understanding and explanation, the n-gram will be briefly described.

P(다음 단어|현재 단어)는 현재 단어 다음에 다음 단어가 등장할 확률을 나타낸다. 또한 P(다음 단어|현재 단어)는 현재 단어 다음 단어 등장 빈도를 현재 단어 등장 빈도로 나눈 값일 수 있다. P(next word|current word) represents the probability that the next word will appear after the current word. Also, P (next word|current word) may be a value obtained by dividing the frequency of occurrence of the word following the current word by the frequency of occurrence of the current word.

기초 사전을 구성하는 방법에 대해 이하에서 보다 상세히 설명하기로 한다.A method of constructing the basic dictionary will be described in more detail below.

기초 사전 구축을 위해, 온라인상에서 수집된 디지털 콘텐츠가 이용될 수 있다. 이때, 디지털 콘텐츠는 검색엔진에 키워드를 입력하여 수집되거나 RSS 피드의 정보를 토대로 수집될 수도 있다. For building a basic dictionary, digital content collected online can be used. In this case, the digital content may be collected by inputting a keyword into a search engine or may be collected based on information of an RSS feed.

서버(120)는 온라인상에서 수집된 디지털 콘텐츠를 변환하여 텍스트를 추출할 수 있다. 예를 들어, PDF는 자체 텍스트 변환 기능이 구비되어 있는바 이를 이용하여 서버(120)는 PDF내의 전체 텍스트가 추출될 수 있다. 또한, 이미지 파일의 경우 OCR 변환을 통해 텍스트가 추출될 수도 있다. 또한, SMI 등 자막 파일 유형의 경우 정규 표현식을 이용하여 텍스트가 추출될 수 있으며, HTML 유형의 경우 정규 표현식을 이용하여 텍스트가 추출될 수 있다. 또한, 워드 파일의 경우, 변환 프로그램을 이용하여 텍스트가 추출될 수도 있다. The server 120 may extract text by converting digital content collected online. For example, since PDF has its own text conversion function, the server 120 can extract the entire text in the PDF using this. Also, in the case of an image file, text may be extracted through OCR conversion. Also, in the case of a subtitle file type such as SMI, text may be extracted using a regular expression, and in the case of an HTML type, text may be extracted using a regular expression. Also, in the case of a word file, text may be extracted using a conversion program.

예를 들어, SMI나 HTML과 같이 마크업 랭귀지 기반 데이터는 태그를 제외한 텍스트를 추출할 수 있다. 예를 들어, 원본 소스 파일이 "<div class="lh s"><b>Reward quality writing.</b> When you spend time reading a story, a portion of your membership fee will go directly to its author.</div>"와 같다고 가정하기로 한다. For example, markup language-based data such as SMI or HTML can extract text excluding tags. For example, if the original source file is "<div class="lh s"><b>Reward quality writing.</b> When you spend time reading a story, a portion of your membership fee will go directly to its author. </div>" is assumed.

서버(120)는 태그들을 제외한 후 텍스트만을 "Reward quality writing. When you spend time reading a story, a portion of your membership fee will go directly to its author."과 같이 추출할 수 있다. The server 120 may extract only the text after excluding the tags, such as "Reward quality writing. When you spend time reading a story, a portion of your membership fee will go directly to its author."

이어, 서버(120)는 언어별 특성을 고려하여 불필요한 특수 문자 등을 제어하여 텍스트를 추출할 수 있다. 예를 들어, 영어를 가정하기로 한다. 영어의 경우, a ~ z 또는 A ~ Z로 구성되므로, 이외의 문자는 모두 제거하여 텍스트가 추출될 수 있다. Then, the server 120 may extract the text by controlling unnecessary special characters, etc. in consideration of the characteristics of each language. For example, suppose English is spoken. In the case of English, since it consists of a to z or A to Z, the text may be extracted by removing all other characters.

영어에 대한 텍스트들이 추출된 후, 서버(120)는 영어의 특성을 고려하여 고유 명사, 비교급, 조동사를 제거하고, 과거형은 현재형으로 변환하며, 부사를 형용사로 변환하고, 동사는 원형으로 변환하며, 복수는 단수로 변환하여 단어가 추출될 수 잇다. After the texts for English are extracted, the server 120 removes proper nouns, comparatives, and auxiliary verbs in consideration of the characteristics of English, converts past tense into present tense, converts adverbs into adjectives, and converts verbs into prototypes, , the plural can be converted to singular, so that words can be extracted.

예를 들어, 영어의 경우, 고유 명사, 비교급, 조동사를 제거하며, 과거형을 현재형으로 변환하고, 부사를 형용사로 변환하며, 동사는 원형으로 복수는 단수를 변환하는 등 영어의 특성을 고려하여 텍스트를 추출할 수 있다. For example, in the case of English, proper nouns, comparatives, and auxiliary verbs are removed, the past tense is converted to the present tense, adverbs are converted into adjectives, verbs are prototypes, plurals are singular, etc. can be extracted.

예를 들어, 텍스트가 "Reward quality writing. When you spend time reading a story, a portion of your membership fee will go directly to its author."과 같이 추출된 경우, 언어의 특성을 고려하여 "reward, direct, story, spend, membership, quality, to, of, a, reward, when, go, it, writing, reading, fee, portion, author, time, you"과 같이 단어를 추출할 수 있다. For example, if the text is extracted as “Reward quality writing. When you spend time reading a story, a portion of your membership fee will go directly to its author.” You can extract words such as story, spend, membership, quality, to, of, a, reward, when, go, it, writing, reading, fee, portion, author, time, you".

서버(120)는 추출된 단어를 단어 출현 빈도, 등장한 문서수 순으로 정리할 수 있다. (1) circumvallate, 335, 91, (2) circumvallate, 261, 91 등과 같이 단어 출현 빈도와 등장한 문서수를 각각 정리할 수 있다. The server 120 may arrange the extracted words in the order of word appearance frequency and number of appeared documents. (1) circumvallate, 335, 91, (2) circumvallate, 261, 91, etc., the frequency of occurrence of words and the number of documents can be arranged, respectively.

또한, 서버(120)는 각 단어의 단어 출현 빈도와 등장한 문서수를 이용하여 단어별 비율값을 도출하여 추가할 수 있다. 예를 들어, 단어별 비율값은 단어출현빈도를 등장한 문서수로 나눔으로써 계산될 수 있다. (1) circumvallate, 335, 91, 3.6813186813 (2) circumvallate, 261, 91, 2.8681318681와 같이 추가될 수 있다. In addition, the server 120 may derive and add a ratio value for each word by using the frequency of occurrence of each word and the number of appeared documents. For example, the ratio value for each word may be calculated by dividing the word occurrence frequency by the number of appeared documents. (1) circumvallate, 335, 91, 3.6813186813 (2) circumvallate, 261, 91, 2.8681318681

또한, 서버(120)는 각 단어에 대한 기준 가중치를 계산하고, 계산된 기준 가중치를 이용하여 최종 가중치를 계산할 수 있다.Also, the server 120 may calculate a reference weight for each word and calculate a final weight using the calculated reference weight.

우선 기준 가중치를 계산하는 방법에 대해 우선 설명하기로 한다. First, a method of calculating the reference weight will be described first.

예를 들어, 데이터 셋이 다음과 같다고 가정하기로 한다. For example, suppose the data set is as follows.

(1) circumvallate, 335, 91, 3.6813186813(1) circumvallate, 335, 91, 3.6813186813

(2) circumvallate, 261, 91, 2.8681318681(2) circumvallate, 261, 91, 2.8681318681

서버(120)는 대상값을 출현빈도로 나누어 기준 가중치를 계산할 수 있다. 위의 두 예시를 가정하여 단어출현빈도의 기준 가중치를 계산하는 경우, (335+261)/2로 계산되며, 등장한 문서수의 기준가중치는 (91+91)/2로 계산되고, 비율값에 대한 기준 가중치는 (3.6813186813+2.8681318681)/2로 계산될 수 있다. 따라서, 단어출현빈도, 등장한 문서수, 비율값 각각에 대한 기준 가중치는 298, 91, 3.2747252747와 같이 계산될 수 있다. The server 120 may calculate the reference weight by dividing the target value by the frequency of appearance. Assuming the above two examples, when calculating the standard weight of word occurrence frequency, it is calculated as (335+261)/2, and the standard weight of the number of appeared documents is calculated as (91+91)/2, and The reference weight for the reference weight may be calculated as (3.6813186813+2.8681318681)/2. Accordingly, the reference weights for each of the word appearance frequency, the number of appeared documents, and the ratio value may be calculated as 298, 91, and 3.2747252747.

서버(120)는 기준가중치를 이용하여 최종 가중치를 계산할 수 있다. The server 120 may calculate the final weight by using the reference weight.

예를 들어, 단어출현빈도의 가중치는 (단어출현빈도/단어출현빈도의 기준가중치 + 단어 상수)로 계산될 수 있다. 또한, 등장한 문서수의 가중치는 (등장한 문서수/등장한 문서수의 기준가중치 + 문서 상수)와 같이 계산될 수 있다. For example, the weight of the word appearance frequency may be calculated as (word appearance frequency/reference weight of word appearance frequency + word constant). Also, the weight of the number of appeared documents may be calculated as (the number of appeared documents/the reference weight of the number of appeared documents + the document constant).

최종적으로 가중치는 ((단어출현빈도가중치+등장한 문서수 가중치) x 정밀도 상수)로 계산될 수 있다. Finally, the weight can be calculated as ((word appearance frequency weight + number of appeared documents weight) x precision constant).

여기서, 단어 상수, 문서 상수 및 정밀도 상수는 구현에 따라 바뀔 수 있음은 당연하다. Here, it goes without saying that the word constant, the document constant, and the precision constant may change depending on implementation.

데이터 셋을 이용하여 가중치를 계산하는 과정에 대해 설명하기로 한다. A process of calculating weights using a data set will be described.

(1)에 대한 최종 가중치= ((335 / 298 + 31) + (91 / 91 + 36)) * 0.001와 같이 계산되며, (2)에 대한 최종 가중치는 ((261 / 298 + 31) + (91 / 91 + 36)) * 0.001와 같이 계산될 수 있다. The final weight for (1) = ((335 / 298 + 31) + (91 / 91 + 36)) * 0.001, and the final weight for (2) is ((261/298 + 31) + ( It can be calculated as 91 / 91 + 36)) * 0.001.

즉, (1) circumvallate, 335, 91, 3.6813186813, 0.06912416107That is, (1) circumvallate, 335, 91, 3.6813186813, 0.06912416107

(2) circumvallate, 261, 91, 2.8681318681, 0.06887583893(2) circumvallate, 261, 91, 2.8681318681, 0.06887583893

와 같이 최종 가중치가 각 단어에 반영될 수 있다. A final weight may be reflected in each word as shown in FIG.

서버(120)는 최종 가중치를 이용하여 중복 단어를 제거하고, 우선순위에 따라 기초 단어 사전을 정렬할 수 있다. The server 120 may remove duplicate words using the final weight and sort the basic word dictionary according to priority.

우선 순위는 최종가중치, 비율값, 등장한문서수, 단어출현빈도 순으로 설정될 수 있으며, 우선순위가 높은 순으로 단어들을 정렬하여 기초 사전을 구축할 수 있다. The priority can be set in the order of the final weight, the ratio value, the number of documents appearing, and the word appearance frequency, and the basic dictionary can be constructed by arranging the words in the order of the highest priority.

예를 들어, "the, of, and, to, a, in, is, for, was, with, ??"등과 같이 기초 사전이 구축될 수 있다. For example, a basic dictionary can be constructed such as "the, of, and, to, a, in, is, for, was, with, ??".

단계 235에서 서버(120)는 기초 사전을 이용하여 학습 순번 사전(학습 레벨 사전)을 구축한다. In step 235, the server 120 builds a learning order dictionary (learning level dictionary) using the basic dictionary.

학습 순번 사전(학습 레벨 사전)은 기초 사전을 이용하여 구축될 수 있다. 학습 순번 사전(학습 레벨 사전)은 분야별로 구축될 수도 있으며, 분야와 무관하게 구축될 수도 있다. The learning order dictionary (learning level dictionary) may be constructed using the basic dictionary. The learning sequence dictionary (learning level dictionary) may be constructed for each field or may be constructed independently of the field.

서버(120)는 기초 사전에서 상위 n(15,000)개의 단어를 추출한 후 레벨별 단어셋 크기대로 분리한 후 순서대로 레벨을 부여할 수 있다. 예를 들어, 레벨별 단어셋 크기가 150이라고 가정하면, 상위 n개의 단어들이 150개씩 분리되어 100개의 레벨별 단어셋이 생성될 수 있다. After extracting the top n (15,000) words from the basic dictionary, the server 120 may classify the words according to the size of the word set for each level, and then assign the levels in order. For example, assuming that the size of the word set for each level is 150, the top n words may be separated by 150 to generate 100 word sets for each level.

단계 240에서 서버(120)는 학습 분선 사전을 이용하여 콘텐츠의 학습 순번(콘텐츠 레벨)을 결정한다. In step 240, the server 120 determines the learning order (content level) of the content using the learning division dictionary.

이에 대해 보다 상세하게 설명하기로 한다. This will be described in more detail.

학습 순번 사전(학습 레벨 사전)에 포함된 각 단어들이 각각의 콘텐츠에서 추출된 단어들에 어느 정도 포함되는지에 대한 커버리지 비율을 각각 도출하여 해당 콘텐츠의 학습 순번(콘텐츠 레벨)을 결정할 수 있다. It is possible to determine the learning order (content level) of the corresponding content by deriving a coverage ratio for each word included in the learning turn dictionary (learning level dictionary) to what extent they are included in the words extracted from each content.

예를 들어, 1순위 사전(레벨)이 "reward, direct, story, spend, membership"이고, 2순위 사전(레벨)이 "portion, author, time, you, is, the"이며, 3순위 사전(레벨)이 "fee, nice, attack, stuck, computer"라고 가정하기로 한다. For example, the first-order dictionary (level) is "reward, direct, story, spend, membership", the second-order dictionary (level) is "portion, author, time, you, is, the", and the third-order dictionary ( level) is "fee, nice, attack, stuck, computer".

특정 콘텐츠의 레벨별 커버리지 비율이 1순위 가전(레벨)은 100%이고, 2순위 사전(레벨)은 80%이며, 3순위 사전(레벨)은 20%로 도출되었다고 가정하기로 한다. 이와 같은 경우, 해당 콘텐츠의 레벨은 1순위로 설정될 수 있다.It is assumed that the coverage ratio for each level of specific content is derived as 100% for home appliances (level) with the 1st priority, 80% for the dictionary with the 2nd priority (level), and 20% for the dictionary with the 3rd priority (level). In this case, the level of the corresponding content may be set as the first priority.

그러나 만일 커버리지 비율이 가장 높은 매칭률이 85%인 경우, 서버(120)는 한단계 낮은 2순위로 해당 콘텐츠의 레벨을 설정할 수도 있다. However, if the matching rate with the highest coverage ratio is 85%, the server 120 may set the level of the corresponding content as the second lower priority.

단계 245에서 서버(120)는 학습 순번이 설정된 콘텐츠를 학습 콘텐츠로서 저장한다. In step 245, the server 120 stores the content for which the learning sequence is set as the learning content.

도 4는 본 발명의 일 실시예에 따른 학습 콘텐츠를 추천하는 방법을 나타낸 순서도이고, 도 5는 본 발명의 일 실시예에 따른 진단 테스트 셋을 예시한 도면이고, 도 6은 본 발명의 일 실시예에 따른 카테고리별 추천 콘텐츠를 예시한 도면이고, 도 7은 본 발명의 일 실시예에 따른 전체 문서에 대한 추천 콘텐츠를 예시한 도면이며, 도 8은 본 발명의 일 실시예에 따른 가중치를 반영하여 추천 순서를 정렬한 결과를 도시한 도면이다. 최초 등록된 사용자에 대한 학습 콘텐츠를 추천하는 방법에 대해 설명하기로 한다. 4 is a flowchart illustrating a method of recommending learning content according to an embodiment of the present invention, FIG. 5 is a diagram illustrating a diagnostic test set according to an embodiment of the present invention, and FIG. 6 is an embodiment of the present invention It is a diagram illustrating recommended content for each category according to an example, FIG. 7 is a diagram illustrating recommended content for all documents according to an embodiment of the present invention, and FIG. 8 is a view showing weights according to an embodiment of the present invention It is a diagram showing the result of sorting the recommendation order. A method of recommending learning content for an initially registered user will be described.

단계 410에서 서버(120)는 진단 테스트 셋을 구성한다. In step 410, the server 120 configures a diagnostic test set.

진단 테스트 셋은 최초 등록된 사용자에 대한 학습 콘텐츠 추천을 위해 사용자의 학습 레벨을 알아보기 위한 데이터 셋으로, 각각의 레벨별 사전에서 단어들을 추출하여 구성될 수 있다. 도 5에는 진단 테스트 셋의 일 예가 도시되어 있다. The diagnostic test set is a data set for ascertaining a user's learning level in order to recommend learning content for an initially registered user, and may be configured by extracting words from a dictionary for each level. 5 shows an example of a diagnostic test set.

따라서, 진단 테스트 셋은 가장 낮은 레벨에서 가장 높은 레벨까지 복수의 단어들을 포함하여 구성될 수 있다. Accordingly, the diagnostic test set may include a plurality of words from the lowest level to the highest level.

또한, 진단 테스트 셋을 구성함에 있어, 사전에 사용자로부터 획득된 사용자 정보를 반영하여 특정 카테고리에 대한 진단 테스트 셋이 구성될 수 있다. Also, in configuring the diagnostic test set, a diagnostic test set for a specific category may be configured by reflecting user information obtained from the user in advance.

여기서, 사용자 정보는 예를 들어, 사용자가 관심 있는 분야, 키워드, 학습 목적, 신분, 젠더 등일 수 있다. Here, the user information may be, for example, a field in which the user is interested, a keyword, a purpose of learning, an identity, a gender, and the like.

단계 415에서 서버(120)는 구성된 진단 테스트 셋을 이용하여 진단 퀴즈를 사용자 단말(110)로 제공한다.In step 415 , the server 120 provides a diagnostic quiz to the user terminal 110 using the configured diagnostic test set.

이어, 단계 420에서 서버(120)는 사용자 단말(110)로부터 진단 퀴즈 결과를 획득한다. Next, in step 420 , the server 120 obtains a diagnostic quiz result from the user terminal 110 .

단계 425에서 서버(120)는 진단 퀴즈 결과를 분석하여 학습 콘텐츠를 사용자 단말(110)로 추천한다.In step 425 , the server 120 analyzes the diagnostic quiz result and recommends learning content to the user terminal 110 .

예를 들어, 서버(120)는 진단 퀴즈 결과를 분석하여 사용자가 오답을 제시한 퀴즈 중 레벨이 가장 낮은 레벨을 사용자 학습 레벨로 결정할 수 있다. 서버(120)는 결정된 학습 레벨에 따른 학습 콘텐츠를 사용자 단말(110)로 추천할 수 있다. For example, the server 120 may analyze the diagnostic quiz result and determine the lowest level among the quizzes in which the user has presented an incorrect answer as the user learning level. The server 120 may recommend learning content according to the determined learning level to the user terminal 110 .

본 발명의 일 실시예에 따르면, 학습 콘텐츠를 추천함에 있어 Doc2Vec을 활용하여 문서들의 특징값을 도출한 후 이를 벡터 스페이스에 임베딩하고, 유사한 콘텐츠들을 도출하여 추천할 수도 있다.According to an embodiment of the present invention, in recommending learning content, feature values of documents are derived using Doc2Vec, then embedded in a vector space, and similar content can be derived and recommended.

Doc2Vec 이외에도, TF/IDF, Cosine Similarity, Euclidean Method 들이 이용될 수도 있다.In addition to Doc2Vec, TF/IDF, Cosine Similarity, and Euclidean Methods may be used.

본 발명의 일 실시예에 따르면, 각 콘텐츠들은 토픽 모델링 기법에 기반하여 핵심 데이터들이 추출될 수 있다. 예를 들어, 토픽 모델링 기법은 MG-LDA(Multi Grain Latent Dirichlet Allocation)일 수 있다. According to an embodiment of the present invention, core data may be extracted from each content based on a topic modeling technique. For example, the topic modeling technique may be MG-LDA (Multi Grain Latent Dirichlet Allocation).

MG-LDA는 토픽을 추출할 문서를 각각 문장 별로 구분한 후, 각 문장의 명사를 추출할 수 있다. 이때, MG-LDA는 전역 주제와 지역 주제를 동시에 추출하는 방식으로, 단어가 특정 토픽에 존재할 확률과 문서에 특정 토픽이 존재할 확률을 결합확률로 추정하여 토픽을 추정할 수 있다. The MG-LDA can extract the nouns from each sentence after classifying the document from which the topic is to be extracted for each sentence. In this case, the MG-LDA can estimate the topic by estimating the probability that a word exists in a specific topic and the probability that the specific topic exists in the document as a combined probability by simultaneously extracting a global topic and a local topic.

따라서, 본 발명의 일 실시예에서는 별도의 설명이 없더라도 각 콘텐츠에 대한 핵심 데이터(주제어)가 추출되어 매핑되어 있는 것으로 이해되어야 할 것이다. Accordingly, in an embodiment of the present invention, it should be understood that core data (keyword) for each content is extracted and mapped, even if there is no separate explanation.

서버(120)는 결정된 학습 레벨에 맞는 콘텐츠를 추천함에 있어 사용자 정보를 반영하여 카테고리별 콘텐츠를 추천할 수도 있으며, 전체 문서를 대상으로 콘텐츠를 추천할 수도 있다. The server 120 may recommend content for each category by reflecting user information in recommending content suitable for the determined learning level, or may recommend content for all documents.

또한, 서버(120)는 추천된 콘텐츠를 취합한 후 가중치를 각각 계산한 후 가중치 순서대로 정렬하여 콘텐츠를 사용자 단말(110)로 추천할 수도 있다. In addition, the server 120 may recommend the contents to the user terminal 110 by collecting the recommended contents, calculating the weights, and arranging the contents in the order of the weights.

취합된 콘텐츠의 가중치는 하기 수학식을 이용하여 계산될 수 있다. The weight of the collected content may be calculated using the following equation.

여기서, 여기서, r은 read_cnt 문서를 조회한 사용자 수를 나타내고, l은 like_cnt 문서를 마음에 들어한 사용자 수를 나타내며, s는 share_cnt 문서를 다른 사람에게 공유한 사용자 수를 나타내고, rt는 전체 read_cnt를 나타내며, lt는 전체 like_cnt를 나타내고, st는 전체 share_cnt를 나타내며, similarity는 문서 유사도를 나타낸다. 문서 유사도는 예를 들어, TF/IDF, Cosine Similarity, Euclidean Method를 이용하여 도출될 수 있다. Here, r represents the number of users who viewed the read_cnt document, l represents the number of users who liked the like_cnt document, s represents the number of users who shared the share_cnt document with others, and rt represents the total read_cnt , lt represents the total like_cnt, st represents the entire share_cnt, and similarity represents the document similarity. Document similarity may be derived using, for example, TF/IDF, Cosine Similarity, or Euclidean Method.

도 6에는 카테고리별 추출된 콘텐츠들이 예시되어 있으며, 도 7에는 전체 문서에서 추출된 콘텐츠들이 예시되어 있다. 도 6 및 도 7과 같이 추출된 콘텐츠들에 대해 가중치를 반영하면, 도 8과 같이 정렬될 수 있다. 계산된 가중치를 이용하여 취합된 콘텐츠들 중 사용자에게 추천되는 콘텐츠의 최종 순서가 도 8과 같이 결정될 수 있다. Contents extracted by category are exemplified in FIG. 6 , and contents extracted from the entire document are exemplified in FIG. 7 . If weights are reflected for the extracted contents as shown in FIGS. 6 and 7 , they may be sorted as shown in FIG. 8 . A final order of contents recommended to a user among the collected contents using the calculated weight may be determined as shown in FIG. 8 .

도 9는 본 발명의 다른 실시예에 따른 콘텐츠 추천 방법을 나타낸 순서도이고, 도 10은 본 발명의 일 실시예에 따른 유사 학습자들을 선별하기 위한 데이터들을 예시한 도면이며, 도 11은 본 발명의 일 실시예에 따른 유사한 학습자를 선별한 결과를 예시한 도면이고, 도 12에는 본 발명의 일 실시예에 따른 선별된 학습자들이 학습한 콘텐츠 목록을 예시한 도면이며, 도 13은 본 발명의 일 실시예에 따른 가중치를 반영하여 정렬한 결과를 예시한 도면이다. 9 is a flowchart illustrating a content recommendation method according to another embodiment of the present invention, FIG. 10 is a diagram illustrating data for selecting similar learners according to an embodiment of the present invention, and FIG. 11 is an embodiment of the present invention It is a diagram illustrating a result of selecting similar learners according to an embodiment, FIG. 12 is a diagram illustrating a list of contents learned by the selected learners according to an embodiment of the present invention, and FIG. 13 is an embodiment of the present invention It is a diagram exemplifying the results of sorting by reflecting the weights according to .

단계 910에서 서버(120)는 사용자 단말(110)로부터 학습 상태 정보를 획득하고, 이를 이용하여 학습 상태 통계를 집계한다. 사용자 단말(110)In step 910 , the server 120 obtains learning state information from the user terminal 110 and aggregates learning state statistics using the obtained learning state information. user terminal 110

여기서, 학습 통계는 주간 성장 통계, 월간 학습 통계, 주간 단어 학습 통계, 일일 학습 통계 등으로 집계될 수 있다. 서버(120)는 사용자 단말(110)로 제공된 학습 콘텐츠 중 복습 회차별 테스트를 모두 통과한 단어에 대해서는 아는 단어로 설정할 수 있다. 또한, 서버(120)는 리뷰를 시작하여 1단계라도 테스트를 통과하거나, 학습 중 한번이라도 틀린 단어는 학습 중인 단어로 설정할 수 있다. 또한, 서버(120)는 학습을 시작하지 않은 단어는 모르는 단어로 설정할 수 있다. 서버(120)는 각 레벨별로 제공된 학습 콘텐츠에 대한 학습 결과를 사용자 단말(110)로부터 제공받고, 이를 이용하여 주간 성장 통계, 월간 학습 통계, 주간 단어 학습 통계, 일일 학습 통계를 각각 집계할 수 있다. Here, the learning statistics may be aggregated into weekly growth statistics, monthly learning statistics, weekly word learning statistics, daily learning statistics, and the like. The server 120 may set a word that has passed all the tests for each review session among the learning contents provided to the user terminal 110 as a known word. In addition, the server 120 may start the review and pass the test even in the first stage, or set the wrong word even once during learning as the learning word. Also, the server 120 may set a word that has not started learning as an unknown word. The server 120 may receive a learning result for the learning content provided for each level from the user terminal 110, and use it to aggregate weekly growth statistics, monthly learning statistics, weekly word learning statistics, and daily learning statistics, respectively. .

단계 915에서 서버(120)는 학습 상태 통계 집계 결과를 반영하여 유사한 학습자를 추출할 수 있다. In step 915, the server 120 may extract similar learners by reflecting the learning state statistics aggregation result.

예를 들어, 서버(120)는 아이디, 선호카테고리, 선호키워드, 학습목적, 직업, 성별, 학습한 콘텐츠 목록, 학습중인 콘텐츠 목록, 학습한 단어 목록을 이용하여 Doc2Vec 으로 학습자들의 특징값을 도출한 후 이를 워드 스페이스에 임베딩할 수 있다.For example, the server 120 derives the feature values of learners with Doc2Vec using ID, preferred category, preferred keyword, learning purpose, occupation, gender, learned content list, learning content list, and learned word list. After that, it can be embedded in the word space.

이어, 서버(120)는 TF/IDF, Cosine Similarity, Euclidean Method 등을 이용하여 유사한 학습자들을 선별할 수 있다. Next, the server 120 may select similar learners using TF/IDF, Cosine Similarity, Euclidean Method, or the like.

도 10에는 유사 학습자들을 선별하기 위한 데이터들이 예시되어 있다. 도 10에 도시된 바와 같이, 아이디, 선호카테고리, 선호키워드, 학습목적, 직업, 성별, 학습한 콘텐츠 목록, 학습중인 콘텐츠 목록, 학습한 단어 목록을 이용한 학습자들의 특징값을 도출한 후 이를 이용하여 유사 학습자를 선별할 수 있다. 10 illustrates data for selecting similar learners. As shown in FIG. 10, after deriving the characteristic values of learners using ID, preferred category, preferred keyword, learning purpose, occupation, gender, learned content list, learning content list, and learned word list, using this Similar learners can be selected.

단계 920에서 서버(120)는 선별된 유사 학습자들이 학습한 학습 콘텐츠를 추출한다. 여기서, 학습 콘텐츠는 카테고리별 콘텐츠일 수도 있으며, 전체 문서를 대상으로 추출될 수도 있다. In step 920, the server 120 extracts the learning content learned by the selected similar learners. Here, the learning content may be content for each category or may be extracted for the entire document.

단계 925에서 서버(120)는 추출된 콘텐츠에 대한 가중치를 계산한다. 가중치 계산은 도 4에서 수학식2를 참조하여 설명한 바와 동일하므로 중복되는 설명은 생략하기로 한다. In step 925, the server 120 calculates a weight for the extracted content. Since the weight calculation is the same as described with reference to Equation 2 in FIG. 4 , a redundant description will be omitted.

단계 930에서 서버(120)는 가중치에 따른 우선순위로 정렬된 콘텐츠를 사용자 단말(110)로 순서대로 추천한다. In step 930 , the server 120 sequentially recommends the contents sorted by priority according to the weight to the user terminal 110 .

이와 같이, 유사한 학습자들이 학습한 콘텐츠를 사용자 단말(110)로 제공할 수도 있다. In this way, content learned by similar learners may be provided to the user terminal 110 .

도 11에는 유사한 학습자를 선별한 결과가 예시되어 있으며, 도 12에는 선별된 학습자들이 학습한 콘텐츠 목록이 예시되어 있다. 도 12와 같이 유사한 학습자들이 학습한 콘텐츠들을 추출한 후 가중치를 계산하여 가중치에 따라 정렬하여 도 13에 도시된 바와 같이 추천 순서를 결정하여 사용자 단말(110)로 제공할 수 있다. 11 exemplifies a result of selecting similar learners, and FIG. 12 exemplifies a list of contents learned by the selected learners. As shown in FIG. 12 , after extracting contents learned by similar learners, weights are calculated and sorted according to the weights, and as shown in FIG. 13 , a recommendation order can be determined and provided to the user terminal 110 .

도 14는 본 발명의 또 다른 실시예에 따른 콘텐츠 추천 방법을 나타낸 순서도이고, 도 15는 본 발명의 또 다른 실시예에 따른 유사 콘텐츠 추천 결과를 예시한 도면이며, 도 16은 본 발명의 또 다른 실시예에 따른 커버리지를 반영한 쉬운 학습 콘텐츠를 분류할 결과를 예시한 도면이고, 도 17은 본 발명의 또 다른 실시예에 따른 커버리지를 반영한 어려운 학습 콘텐츠를 분류할 결과를 예시한 도면이다. 14 is a flowchart illustrating a content recommendation method according to another embodiment of the present invention, FIG. 15 is a diagram illustrating a similar content recommendation result according to another embodiment of the present invention, and FIG. 16 is another embodiment of the present invention It is a diagram illustrating a result of classifying easy learning content reflecting coverage according to an embodiment, and FIG. 17 is a diagram illustrating a result of classifying difficult learning content reflecting coverage according to another embodiment of the present invention.

단계 1410에서 서버(120)는 사용자가 현재 학습하고 있는 학습 콘텐츠와 유사한 콘텐츠를 추출한다. In step 1410, the server 120 extracts content similar to the learning content that the user is currently learning.

이는 이미 전술한 바와 같이, Doc2Vec을 활용하여 문서들의 특징값을 도출한 후 이를 벡터 스페이스에 임베딩한 후 TF/IDF, Cosine Similarity, Euclidean Method를 이용하여 유사한 콘텐츠를 추출할 수 있다. As described above, similar content can be extracted using TF/IDF, Cosine Similarity, and Euclidean Method after deriving feature values of documents using Doc2Vec and embedding them in a vector space.

단계 1415에서 서버(120)는 추출된 유사 콘텐츠에 대한 커버리지를 각각 계산한다. In step 1415, the server 120 calculates coverage for the extracted similar content, respectively.

예를 들어, 커버리지는 수학식 3을 이용하여 계산될 수 있다. For example, the coverage may be calculated using Equation (3).

여기서, m은 학습자가 아는 단어와 콘텐츠 단어가 일치하는 수를 나타내고, t는 콘텐츠의 총 단어 수를 나타낸다. Here, m represents the number of matches between the words the learner knows and the content words, and t represents the total number of words in the content.

즉, 학습 커버리지는 콘텐츠의 전체 단어수 대비 사용자가 아는 단어의 수를 의미한다. That is, the learning coverage refers to the number of words the user knows compared to the total number of words in the content.

단계 1420에서 서버(120)는 추출된 유사 콘텐츠에 대한 커버리지를 이용하여 쉬운 콘텐츠와 어려운 콘텐츠를 각각 분류한다. In operation 1420, the server 120 classifies the easy content and the difficult content, respectively, by using the coverage for the extracted similar content.

유사 콘텐츠에 대한 커버리지를 이용하여 사용자에게 쉬운 콘텐츠와 어려운 콘텐츠를 각각 분류함에 있어, 사용자 커버리지가 이용될 수 있다. In classifying content that is easy for a user and content that is difficult for a user, respectively, using coverage for similar content, user coverage may be used.

사용자 커버리지는 테스트 콘텐츠를 대상으로, 사용자가 아는 단어의 커버리지를 도출한 결과이다. 따라서, 사용자 커버리지보다 추출된 유사 콘텐츠의 커버리지가 작으면 쉬운 컨텐츠로 분류되며, 사용자 커버리지가 추출된 유사 콘텐츠의 커버리지보다 작으면 해당 유사 콘텐츠는 어려운 콘텐츠로 분류될 수 있다. The user coverage is a result of deriving the coverage of words that the user knows from the test content. Accordingly, if the coverage of the extracted similar content is smaller than the user coverage, it is classified as easy content, and if the user coverage is smaller than the coverage of the extracted similar content, the similar content may be classified as difficult content.

즉, 사용자 커버리지< 유사 콘텐츠의 커버리지이면, 해당 유사 콘텐츠는 쉬운 콘텐츠로 분류되며, 사용자 커버리지>유사 콘텐츠의 커버리지이면, 해당 유사 콘텐츠는 어려운 콘텐츠로 분류될 수 있다. That is, if user coverage < coverage of similar content, the corresponding similar content may be classified as easy content, and if user coverage > coverage of similar content, the corresponding similar content may be classified as difficult content.

단계 1425에서 서버(120)는 분류된 결과를 이용하여 쉬운 콘텐츠 또는 어려운 콘텐츠를 사용자 단말(110)로 추천할 수 있다. In operation 1425 , the server 120 may recommend easy content or difficult content to the user terminal 110 using the classified result.

이때, 서버(120)는 복수의 쉬운 콘텐츠와 어려운 콘텐츠를 사용자 단말(110)로 추천함에 있어, 도 4에서 설명한 바와 같이, 수학식2를 참조하여 가중치를 계산한 후 이를 이용하여 정렬하여 콘텐츠의 추천 순서를 결정하여 순서대로 추천할 수도 있다. In this case, when the server 120 recommends a plurality of easy and difficult contents to the user terminal 110 , as described in FIG. 4 , the server 120 calculates a weight with reference to Equation 2 and then aligns the contents using the calculation. The recommendation order may be determined and recommendations may be made sequentially.

도 18은 본 발명의 일 실시예에 따른 콘텐츠 추천 방법을 나타낸 흐름도이다. 18 is a flowchart illustrating a content recommendation method according to an embodiment of the present invention.

단계 1810에서 서버(120)는 학습 콘텐츠 데이터베이스를 구축한다. In step 1810, the server 120 builds a learning content database.

이는 도 2를 참조하여 설명한 바와 동일하므로 중복되는 설명은 생략하기로 한다. Since this is the same as that described with reference to FIG. 2 , the overlapping description will be omitted.

단계 1815에서 서버(120)는 진단 퀴즈를 사용자 단말(110)로 제공한다. In step 1815 , the server 120 provides the diagnostic quiz to the user terminal 110 .

단계 1820에서 사용자 단말(110)은 서버(120)로부터 제공되는 진단 퀴즈를 학습한 후 학습 결과를 서버(120)로 전송한다. In step 1820 , the user terminal 110 learns the diagnostic quiz provided from the server 120 , and then transmits the learning result to the server 120 .

단계 1825에서 서버(120)는 학습 결과를 고려하여 학습 콘텐츠를 추천하여 사용자 단말(110)로 제공한다. In step 1825 , the server 120 recommends learning content in consideration of the learning result and provides it to the user terminal 110 .

이는 도 4를 참조하여 설명한 바와 동일하므로 중복되는 설명은 생략하기로 한다. Since this is the same as that described with reference to FIG. 4 , the overlapping description will be omitted.

단계 1830에서 사용자 단말(110)은 서버(120)로부터 추천된 학습 콘텐츠를 학습한다. In step 1830 , the user terminal 110 learns the recommended learning content from the server 120 .

사용자 단말(110)은 학습 콘텐츠를 선택한 후 원문 읽기, 리뷰, 테스트, 반복 또는 완료를 통해 학습 콘텐츠를 학습할 수 있다. After selecting the learning content, the user terminal 110 may learn the learning content by reading, reviewing, testing, repeating, or completing the original text.

단계 1835에서 사용자 단말(110)은 학습 활동 정보를 서버(120)로 전송한다. In step 1835 , the user terminal 110 transmits learning activity information to the server 120 .

단계 1840에서 서버(120)는 사용자 단말(110)로부터 수신된 학습 활동 정보를 이용하여 학습 상태 통계를 집계한다. In step 1840 , the server 120 aggregates learning state statistics using the learning activity information received from the user terminal 110 .

이어, 단계 1845에서 서버(120)는 학습 상태 통계 집계 결과를 반영하여 학습 콘텐츠를 추천하여 사용자 단말(110)로 제공한다. 이는 도 9 및 도 14를 참조하여 설명한 바와 동일하므로 중복되는 설명은 생략하기로 한다. Next, in step 1845 , the server 120 reflects the learning state statistics aggregation result to recommend learning content and provides it to the user terminal 110 . Since this is the same as that described with reference to FIGS. 9 and 14 , a redundant description will be omitted.

단계 1850에서 사용자 단말(110)은 서버(120)로부터 추천된 학습 콘텐츠를 학습한다. 즉, 사용자 단말(110)은 추천된 학습 콘텐츠를 선택한 후 원문 읽기, 리뷰, 테스트, 반복 또는 완료를 통해 학습 콘텐츠를 학습할 수 있다. In step 1850 , the user terminal 110 learns the recommended learning content from the server 120 . That is, after selecting the recommended learning content, the user terminal 110 may learn the learning content by reading, reviewing, testing, repeating, or completing the original text.

단계 1855에서 서버(120)는 망각 상태 업데이트를 포함한 학습 스케쥴을 관리한다. In step 1855, the server 120 manages the learning schedule including the forgetting state update.

예를 들어, 서버(120)는 사용자의 다음 복습 회차의 지정된 기간 이내에 학습 콘텐츠에 대한 테스트 결과가 미반영되는 경우 복습 회차를 강등처리할 수 있다. For example, the server 120 may demote the review session when the test result for the learning content is not reflected within a specified period of the user's next review session.

예를 들어, 1월 1일에 1회차 학습을 마친 후 3일 이후에 1월 4일이 2회차인 경우, 1월 5일내에 테스트 결과가 미반영되는 경우, 복습 회차가 1회차로 강등될 수 있다. 서버(120)는 테스트를 완료하지 않았더라도 시도한 경우 여유 기일을 부여한 후 강등 여부를 다시 결정할 수 있다. For example, if the second session on January 4 is 3 days after completing the first study on January 1, and if the test result is not reflected within January 5, the review session may be demoted to the first session. have. Even if the server 120 has not completed the test, if it is attempted, the server 120 may decide whether to demote again after granting a spare date.

망각 상태 업데이트에 따라 강등처리되는 경우 해당 회차의 단어 학습 상태가 롤백될 수 있다. 다만, 서버(120)는 다른 콘텐츠를 통해 학습 상태가 갱신되는 단어는 롤백시 제외시킬 수 있다. In the case of being demoted according to the forgetting status update, the word learning status of the corresponding round may be rolled back. However, the server 120 may exclude words whose learning status is updated through other content when rolling back.

단계 1860에서 사용자 단말(110)은 학습 통계 확인 및 복습 요청을 서버(120)로 전송한다.In step 1860 , the user terminal 110 transmits a learning statistics check and review request to the server 120 .

단계 1865에서 서버(120)는 사용자 단말(110)의 학습 통계 확인 및 복습 요청에 따라 학습 통계 관련 정보 및 복습 관련 콘텐츠를 사용자 단말(110)로 제공한다. 이에 따라 사용자 단말(110)는 학습 통계를 확인하고 복습할 수 있다. In step 1865 , the server 120 provides learning statistics-related information and review-related contents to the user terminal 110 according to the learning statistics confirmation and review request of the user terminal 110 . Accordingly, the user terminal 110 may check and review the learning statistics.

학습 통계를 확인하는 과정은 학습 콘텐츠를 학습하는 과정에서 빈번하게 수행될 수 있으며, 단계 1820 이후에 서버(120)의 동작과는 별개로 사용자 단말(110)에 의해 병렬적으로 수행될 수도 있다. The process of checking the learning statistics may be frequently performed in the process of learning learning content, and may be performed in parallel by the user terminal 110 separately from the operation of the server 120 after step 1820 .

또한, 사용자 단말(110)은 학습 콘텐츠를 복습하는 과정을 별도로 수행할 수 있으며, 복습은 서버(120)에서 지정된 복습 스케쥴내에서 수행되지 않는 경우, 망각 상태 업데이트시 강등처리될 수도 있다.In addition, the user terminal 110 may separately perform the process of reviewing the learning content, and if the review is not performed within the review schedule specified by the server 120, it may be demoted when the forgetting state is updated.

이상에서는 본 발명의 실시예를 참조하여 설명하였지만, 해당 기술 분야에서 통상의 지식을 가진 자라면 하기의 특허 청구의 범위에 기재된 본 발명의 사상 및 영역으로부터 벗어나지 않는 범위 내에서 본 발명을 다양하게 수정 및 변경시킬 수 있음을 쉽게 이해할 수 있을 것이다.Although described above with reference to the embodiments of the present invention, those of ordinary skill in the art can variously modify the present invention within the scope not departing from the spirit and scope of the present invention described in the claims below. and can be changed.

Claims

collecting digital content online;
generating a category list by applying the collected digital content to a classification model, and classifying each digital content according to the generated category list;
After analyzing the digital content and extracting each word, a basic word dictionary is generated, a dictionary for each learning level is generated using the basic word dictionary, and a coverage ratio of the dictionary for each learning level of the digital content is derived, and the digital content determining a learning sequence of ; and
A method of recommending learning content, comprising the step of recommending learning content in consideration of a user learning result and a learning sequence of the digital content.

According to claim 1,
In generating the basic word dictionary,
analyzing the digital content and deriving a frequency of occurrence, a number of document appearances, and a ratio value for each extracted word; the ratio value is calculated by dividing the frequency of occurrence by the number of occurrences of the document;
calculating a weight of each word using the frequency of occurrence, the number of appearances in the document, and the ratio value; and
and generating the basic word dictionary by arranging the words in the order of the highest weight and extracting the top n words.

According to claim 1,
obtaining learning activity information of the recommended learning content from a user terminal;
After collecting learning status statistics using the learning activity information, selecting similar learners using the aggregation result - The similar learners have ID, preferred category, preferred keyword, learning purpose, occupation, gender, list of learned contents, After deriving the characteristic values of learners including the list of contents being learned and the list of words learned, embedding them in a word space to select similar learners; and
The method of recommending learning content further comprising the step of extracting and recommending the learning content learned by the similar learner.

According to claim 1,
The step of recommending the content,
calculating weights for each of the extracted learning contents;
sorting the learning content using the weight; and
and recommending the learning content sequentially according to the sorted order.

According to claim 1,
selecting content similar to the learning content being studied by the user;
calculating coverage for the similar content, respectively;
classifying difficult learning content and easy learning content using the coverage;
Further comprising the step of providing the classified difficult learning content and easy learning content to the user terminal in order,
The learning content recommendation method, characterized in that the coverage is calculated in consideration of whether a word known by the user matches a word included in the similar content.

According to claim 1,
Updating the forgetting state, but further comprising the step of demoting the review cycle or rolling back the learning state of the level when the user does not test the recommended learning content within a specified review period.