KR20230063441A

KR20230063441A - System for recommending document using dynamic change of time-series pattern information

Info

Publication number: KR20230063441A
Application number: KR1020210148521A
Authority: KR
Inventors: 유상열; 임경은; 안경준
Original assignee: 주식회사 엠클라우독
Priority date: 2021-11-02
Filing date: 2021-11-02
Publication date: 2023-05-09
Also published as: KR102659788B1

Abstract

A system for personalized document recommendation using dynamic change of time-series pattern information according to the present invention is characterized by comprising: a document object database part in which saved is a document object, a collected document through document centralization; a history information generation part which saves, when the document object is used by a user, history information including information on a user who used the document object and information on hours of use during which the document object is used in conjunction with the document object; a request information input part into which entered are requester information and a request signal including at least one request period information among a plurality of different setting periods; a first selection part selecting a used document object which is a document object used by the user corresponding to the requester information from the document object based on the history information; a recommendation document selection part selecting a recommendation document which is at least one document object having the information on hours of use relatively close to the reference point of the request period information from the used document object; and a list-providing part which generates a recommendation list information corresponding to the recommendation document and provides the generated recommendation list information to the requester. According to the present invention, it is possible to provide a recommendation document list necessary to a user by analyzing and tracing multifacetedly various patterns of a user which dynamically change over time.

Description

Customized document recommendation system using dynamic change of time-series pattern information {SYSTEM FOR RECOMMENDING DOCUMENT USING DYNAMIC CHANGE OF TIME-SERIES PATTERN INFORMATION}

본 발명은 사용자에게 문서를 추천하는 시스템으로서, 구체적으로는 사용자가 문서를 열람하거나 액세스하는 패턴정보의 시계열적 변화에 대한 분석과 통계적 활용을 효과적으로 접목시켜 문서 탐색에 소요되는 시간을 비약적으로 단축시킴은 물론, 문서 누락 등이 효과적으로 방지되도록 유도함으로써 사용자의 문서 접근과 활용에 대한 편의성을 극대화시킬 수 있는 시계열 패턴정보의 동적 변화를 이용한 맞춤형 문서 추천시스템에 관한 것이다.The present invention is a system for recommending a document to a user, and specifically, it dramatically reduces the time required to search for a document by effectively combining analysis of time-series changes in pattern information in which a user browses or accesses a document and statistical utilization. In addition, it relates to a customized document recommendation system using dynamic changes in time-series pattern information that can maximize the convenience of users' document access and use by inducing document omissions to be effectively prevented.

회사 내에서 개인 PC를 기반으로 문서 등을 작성하고 해당 PC에 문서 등을 저장하는 전통적인 방식의 경우, 사적 유용에 의한 문서 유출을 원천적으로 방지할 수 없으며, 바이러스 감염, 해킹, 하드 디스크 등의 정보저장매체의 물리적/논리적 손상 등이 개별적, 다발적 그리고 지속적으로 발생하게 되므로 그에 대한 적응적 대응이 상당히 곤란함은 물론, 원천적 방지가 어려우므로 상술된 문제에 항시적으로 노출되어 있다고 할 수 있다.In the case of the traditional method of creating documents based on a personal PC within the company and storing the documents on the PC, it is impossible to fundamentally prevent the leakage of documents due to private use, and information such as virus infection, hacking, and hard disk Since the physical/logical damage of the storage medium occurs individually, intermittently, and continuously, it is very difficult to respond adaptively to it and it is difficult to fundamentally prevent it, so it can be said that it is constantly exposed to the above-mentioned problems.

이러한 문제를 해소하기 위하여 개인PC마다 보안 솔루션이나 정책 등을 강화하는 방법을 고려할 수 있으나, 이는 근본적으로 인적(人的) 의존적 방법이므로 그 실효성이 낮을 수밖에 없다는 본질적인 한계가 있다고 할 수 있다. In order to solve this problem, it is possible to consider a method of strengthening security solutions or policies for each individual PC, but since this is a human-dependent method, it can be said that there is an inherent limitation that its effectiveness is inevitably low.

최근에는 이러한 문제점을 좀 더 근원적으로 해소하기 위하여 클라우드 기반의 문서 중앙화 시스템이 개시되고 있다. 이러한 중앙화 시스템은 회사 등과 같은 조직에서 이루어지는 다양한 업무 결과물 등을 개인 PC가 아닌 중앙의 서버에 저장하고 활용하는 환경 내지 시스템을 의미한다.Recently, a cloud-based document centralization system has been disclosed in order to more fundamentally solve these problems. Such a centralized system refers to an environment or system that stores and utilizes various work results from organizations such as companies in a central server rather than a personal PC.

이러한 문서 중앙화는 정보 유출, 바이러스 유입 및 해킹 방지 등을 일원화된 방식과 통일되고 강화된 솔루션 등으로 실현할 수 있다는 점에서 상술된 문제점을 효과적으로 해소할 수 있으며, 정보/데이터의 백업을 위한 분산처리 등에서도 그 효용성이 더욱 높다고 할 수 있다.This document centralization can effectively solve the above-mentioned problems in that information leakage, virus inflow, and hacking prevention can be realized in a unified way and a unified and reinforced solution, etc., and in distributed processing for information / data backup It can also be said to be more effective.

또한, 이러한 문서 중앙화 시스템은 시간 및 공간의 제약이 없는 유비쿼터스 환경으로 관련 업무 및 작업 등을 수행할 수 있다는 점에서도 상당한 이점을 제공할 수 있다.In addition, such a document centralization system can provide a significant advantage in that related tasks and tasks can be performed in a ubiquitous environment without time and space constraints.

이와 같이 문서 중앙화 시스템은 문서 리소스를 효과적으로 집중화 내지 중앙화시킬 수 있다는 등의 근본적인 장점을 가짐에 반해, 방대한 데이터 내지 정보가 집합되어 있으므로 관리자 측면에서는 문서의 효율적인 관리 등이 어렵고 일반 사용자 입장에서는 원하는 문서를 정확하고 신속하게 검색/확인하기가 어렵다는 문제점이 있다.In this way, while the document centralization system has fundamental advantages such as being able to centralize or centralize document resources effectively, since a vast amount of data or information is collected, it is difficult for administrators to efficiently manage documents, etc. for general users. There is a problem in that it is difficult to accurately and quickly search/check.

최근 이러한 문제점을 해소하기 위하여 인공지능 기반의 지도 학습 기법(Supervised learning) 또는 비지도 학습 기법(Unsupervised Learning)을 이용하여 중앙화된 방대한 문서를 분류하는 방법이나 또는 SNS업체에서 수행하는 바와 같은 키워드를 기반으로 복수 객 객체(문서 등)를 그룹핑(grouping)하는 방법 등이 개시되고 있다.Recently, in order to solve these problems, a method of classifying a large amount of centralized documents using artificial intelligence-based supervised learning or unsupervised learning or based on keywords as performed by SNS companies As a method of grouping multiple object objects (documents, etc.), etc. are disclosed.

그러나 이러한 방법은 데이터 처리를 기계적으로 처리하는 기법에 기반하여 문서를 카테고리화하는 것이므로 문서의 실질적인 내용을 반영하기 어려움은 물론, 그룹핑을 수행하는 것 자체에 중점을 두고 있어 해당 그룹의 특성을 사용자 관점에서 직관적이고 쉽게 인지하기 어려우므로 문서 검색 및 확인 등의 과정이 여전히 어렵다고 할 수 있다.However, since this method categorizes documents based on a technique for mechanically processing data, it is difficult to reflect the actual content of the document, and it focuses on grouping itself, so the characteristics of the group can be viewed from the user's point of view. It is difficult to intuitively and easily recognize in the document, so it can be said that the process of searching and checking documents is still difficult.

한편, 문서 중앙화를 구현하는 시스템의 경우 방대한 개수와 용량의 문서객체가 집합되는 만큼 이를 이용하는 사용자 측면에서는 작성하는 문서 또는 확인이 필요한 문서, 회람이 요구되는 문서 등에 쉽고 정확하게 접근할 수 있어야 한다.On the other hand, in the case of a system that implements document centralization, since a vast number and capacity of document objects are gathered, users who use them must be able to easily and accurately access documents to be created, documents to be confirmed, or documents to be circulated.

그러나 문서중앙화 시스템은 방대한 개수의 문서가 트리 구조 등의 디텍토리로 분류되어 있을 뿐이어서 사용자는 자신의 기억과 자신만의 노하우로 해당 문서를 일일이 찾고 접근해야 하므로 문서 액세스(access)에 소요되는 시간이 길어지거나 주기적으로 확인하거나 작성해야 하는 문서를 간과하거나 누락하는 등의 문제가 언제든지 발생될 수 있다.However, in the document centralization system, a huge number of documents are only classified into directories such as a tree structure, so users have to find and access the documents individually with their own memory and their own know-how, which takes time to access documents. Problems such as lengthening of documents, overlooking or omission of documents that need to be checked or prepared periodically can occur at any time.

이와 관련하여, 방대한 개수의 문서 중 단순히 문서가 열람(read, write 등 포함)되는 전체 회수를 기준으로 선호도가 높은 복수 개 문서를 추천하는 방법, 설문조사 등으로 파악된 사용자의 선호도에 부합되는 복수 개 문서를 추천하는 방법 또는 문서(콘텐츠)의 내용과 사용자의 인적정보 사이의 매칭 정도를 기준으로 문서를 추천하는 방법 등이 종래 문서중앙화 시스템에 일부 적용되어 있다.In this regard, a method of recommending a plurality of documents with high preference based on the total number of documents read (including read, write, etc.) out of a vast number of documents, and a plurality of documents matching the preferences of users identified through surveys, etc. A method of recommending one document or a method of recommending a document based on the degree of matching between the content of a document (content) and the user's personal information has been partially applied to a conventional document centralization system.

그러나 이러한 방법은 통계적 의미 또는 문서의 참고적 활용에는 어느 정도 도움이 될 수 있으나 기본적으로 개별 유저들 각각의 패턴 분석에 기반을 두고 있지 않으므로 유저 각자를 기준으로 한 문서 이용 및 활용에는 최적화될 수 없다는 한계가 있다.However, although this method can be helpful to some extent for statistical significance or reference utilization of documents, it is basically not based on the pattern analysis of individual users, so it cannot be optimized for the use and utilization of documents based on each user. There are limits.

또한, 문서 중앙화를 구현하는 시스템의 경우, 방대한 규모의 문서 내지 문서객체들이 저장되고 이용되므로 원하는 문서 또는 참고로 활용할 문서 등을 신속하게 검색 및 확인하기가 어렵다는 문제가 있다.In addition, in the case of a system implementing document centralization, since a vast amount of documents or document objects are stored and used, there is a problem in that it is difficult to quickly search and check a desired document or a document to be used as a reference.

키워드 기반의 검색 툴이 구현되어 있기는 하나, 이는 OS 등에서 통일적으로 적용되는 검색 방식으로서 파일명이나 문서의 내용에 수록된 일부 단어 등만을 기준으로 검색 환경이 구현되므로 사용자가 원하는 문서 등을 신속하고 정확하게 제공하기는 여전히 어렵다고 할 수 있다.Although a keyword-based search tool is implemented, this is a search method that is uniformly applied to the OS, etc. The search environment is implemented based only on the file name or some words in the document content, so the document the user wants is quickly and accurately provided. It can still be difficult to do.

또한, 검색 내지 탐색 등의 효율성을 높이기 위하여 문서 이용(작성, 수정, 편집 등) 시, 다양한 태그 정보 등을 해당 문서에 부여하는 방법 등이 정책적으로 유도되기도 하나, 이는 기본적으로 문서 작성자 등에 의존하는 인적 의존적 방법이므로 휴먼 에러(Human error)가 언제든지 발생할 수 있음은 물론, 주관적이고 자의적인 판단에 의한 부가정보이므로 정확성과 신뢰성이 상당히 낮다고 할 수 있다.In addition, in order to increase the efficiency of search or search, when using a document (creation, revision, editing, etc.), a method of assigning various tag information to the document is induced by policy, but this is basically dependent on the document creator. Since it is a human-dependent method, human error can occur at any time, and since it is additional information based on subjective and arbitrary judgment, accuracy and reliability are quite low.

문서 중앙화 시스템에서는 다수의 사용자에 의한 방대한 문서객체가 DB화되어 있으므로 문서 작성, 참고 및 활용 등에 대한 시너지 효과를 더욱 높이기 위하여 유저 이용하는 문서 등과 실질적 유사성을 가지는 문서들이 정확하게 추천 내지 제공되는 시스템 구현이 더욱 필요하다고 할 수 있다.In the document centralization system, since a large number of document objects by multiple users are made into a DB, it is more important to implement a system that accurately recommends or provides documents that have substantial similarities to documents used by users in order to further increase the synergy effect on document creation, reference, and utilization. can be said to be necessary.

본 발명은 상기와 같은 배경에서 상술된 문제점을 해결하기 위하여 창안된 것으로서, 시계열적으로 동적으로 변화되는 사용자의 다양한 패턴을 다각적으로 분석 및 추적하고 이를 기반으로 사용자에게 필요한 문서 리스트를 추천 제공함으로써 방대한 용량의 데이터베이스가 구축되는 문서중앙화 시스템이 가지는 본질적 장점을 유지함과 동시에 개인별 맞춤형 이용에 대한 편의 기능을 더욱 확장적으로 적용할 수 있는 시계열 패턴정보의 동적 변화를 이용한 맞춤형 문서 추천시스템을 제공하는데 그 목적이 있다.The present invention was conceived to solve the above-mentioned problems in the background as described above, and analyzes and tracks various patterns of users that change dynamically in a time series in various ways, and based on this, recommends a list of documents necessary for the user, thereby providing a vast Its purpose is to provide a customized document recommendation system using dynamic changes in time-series pattern information that can more extensively apply convenience functions for personalized use while maintaining the essential advantages of a document centralization system in which a database of capacity is built. there is

본 발명의 다른 목적 및 장점들은 아래의 설명에 의하여 이해될 수 있으며, 본 발명의 실시예에 의하여 보다 분명하게 알게 될 것이다. 또한, 본 발명의 목적 및 장점들은 특허청구범위에 나타난 구성과 그 구성의 조합에 의하여 실현될 수 있다.Other objects and advantages of the present invention can be understood by the following description, and will be more clearly understood by the embodiments of the present invention. In addition, the objects and advantages of the present invention can be realized by the configuration shown in the claims and the combination of the configuration.

상기 목적을 달성하기 위한 본 발명의 시계열 패턴정보의 동적 변화를 이용한 맞춤형 문서 추천시스템은 문서 중앙화로 집합된 문서인 문서객체가 저장되는 문서객체DB부; 상기 문서객체가 사용자에 의하여 이용되는 경우, 해당 문서객체를 이용한 사용자정보 및 해당 문서객체가 이용된 이용시간정보를 포함하는 이력정보를 해당 문서객체와 연계하여 저장하는 이력정보생성부; 요청자정보 및 서로 다른 복수 개 설정기간 중 하나 이상의 요청기간정보를 포함하는 리퀘스트신호가 입력되는 요청정보입력부; 상기 이력정보를 이용하여 상기 문서객체 중 상기 요청자정보에 해당하는 사용자가 이용한 문서객체인 사용문서객체를 선별하는 제1선별부; 상기 사용문서객체 중 상기 요청기간정보의 기준시점과 상대적으로 근접한 이용시간정보를 가지는 하나 이상의 문서객체인 추천문서를 선별하는 추천문서선별부; 및 상기 추천문서에 해당하는 추천리스트정보를 생성하고 상기 생성된 추천리스트정보를 상기 요청자에게 제공하는 리스트제공부를 포함하여 구성될 수 있다.To achieve the above object, a customized document recommendation system using dynamic change of time-series pattern information of the present invention includes a document object DB unit in which document objects, which are documents collected by document centralization, are stored; When the document object is used by a user, a history information generator for storing history information including user information using the document object and usage time information when the document object is used in association with the document object; a request information input unit into which a request signal including requester information and at least one request period information among a plurality of different set periods is input; a first sorting unit which selects a used document object, which is a document object used by a user corresponding to the requestor information, from among the document objects by using the history information; a recommendation document sorting unit which selects one or more recommended documents, which are document objects having use time information relatively close to the reference point in time of the request period information, among the used document objects; and a list providing unit generating recommendation list information corresponding to the recommendation document and providing the generated recommendation list information to the requestor.

여기에서, 본 발명의 상기 추천문서선별부는 상기 사용문서객체마다 제1랭크정보를 부여하되, 상기 사용문서객체의 이용시간정보와 상기 요청기간정보의 기준시점 사이의 이격이 작을수록 상대적으로 높은 제1랭크정보를 부여하는 주기가중치처리부; 및 상기 사용문서객체 중 상기 제1랭크정보의 크기를 기준으로 상기 추천문서를 선별하는 선별처리부를 포함하여 구성될 수 있다.Here, the recommended document selection unit of the present invention assigns first rank information to each used document object, and the smaller the distance between the use time information of the used document object and the reference time point of the request period information, the higher the rank information. a periodic weight processing unit that gives first-rank information; and a sorting processor which selects the recommended document based on the size of the first rank information among the used document objects.

바람직하게 본 발명의 상기 추천문서선별부는 상기 사용문서객체마다 제2랭크정보를 부여하되, 연속적으로 이용된 기간이 길수록 또는 이용횟수가 많을수록 상대적으로 높은 제2랭크정보를 부여하는 연속가중치처리부를 더 포함할 수 있으며 이 경우 본 발명의 상기 선별처리부는 상기 사용문서객체 중 상기 제1 및 제2랭크정보의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.Preferably, the recommended document sorting unit of the present invention assigns second rank information to each used document object, and further includes a continuous weighting processing unit that assigns relatively high second rank information as the number of consecutively used periods is longer or the number of times of use is greater. In this case, the selection processing unit of the present invention may be configured to select the recommended document based on the computed sizes of the first and second rank information among the used document objects.

실시형태에 따라서, 본 발명의 상기 추천문서선별부는 상기 사용문서객체의 이력정보에 포함된 이용시간정보 중 마지막 이용시간정보를 추출하고, 상기 마지막 이용시간정보가 현재시점과 근접할수록 상대적으로 높은 제3랭크정보를 부여하는 근접가중치처리부를 더 포함할 수 있으며, 이 경우 본 발명의 상기 선별처리부는 상기 사용문서객체 중 상기 제1 내지 제3랭크정보의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.According to an embodiment, the recommended document selection unit of the present invention extracts the last use time information from among the use time information included in the history information of the used document object, and the closer the last use time information is to the current time point, the higher the priority. It may further include a proximity weight processing unit that grants 3 rank information. In this case, the selection processing unit of the present invention selects the recommended document based on the calculated sizes of the first to third rank information among the document objects in use. can be configured to

또한, 본 발명의 상기 추천문서선별부는 상기 사용문서객체의 파일명 정보를 대상으로 형태소 파싱 프로세싱을 수행하여 명사에 해당하는 복수 개 텍스트를 선별하고, 이 선별된 복수 텍스트 중 사용빈도를 기준으로 상위 m(m은 2이상의 자연수)개의 주요텍스트를 선별하는 텍스트선별부; 및 상기 사용문서객체마다 제4랭크정보를 부여하되, 상기 상위 m개의 주요텍스트 중 해당 사용문서객체의 파일명에 포함된 주요텍스트의 개수가 많을수록 상대적으로 높은 제4랭크정보를 부여하는 빈도가중치처리부를 더 포함할 수 있다.In addition, the recommended document selection unit of the present invention performs morpheme parsing processing on the file name information of the used document object to select a plurality of texts corresponding to nouns, and selects a plurality of texts corresponding to nouns, and among the selected plurality of texts, based on the frequency of use, the top m a text selection unit for selecting main texts (m is a natural number equal to or greater than 2); and a frequency weighting processor that assigns fourth rank information to each used document object, and assigns a relatively higher fourth rank information as the number of main texts included in the file name of the corresponding used document object among the top m main texts increases. can include more.

상기 실시형태가 구현되는 경우 본 발명의 상기 선별처리부는 상기 사용문서객체 중 상기 제1 내지 제4랭크정보의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.When the above embodiment is implemented, the screening processing unit of the present invention may be configured to select the recommended document based on the calculated size of the first to fourth rank information among the used document objects.

나아가 본 발명의 상기 추천문서선별부는 상기 사용문서객체의 내용 수정이 이루어진 수정횟수정보 또는 상기 사용문서객체의 전체 내용 중 수정이 이루어진 부분의 수정비율정보 중 하나 이상을 포함하는 트래킹정보를 생성하는 트래킹부를 더 포함할 수 있으며 이 경우 본 발명의 상기 선별처리부는 상기 사용문서객체 중 상기 제1랭크정보의 크기 및 상기 트래킹정보의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.Furthermore, the recommended document sorting unit of the present invention tracks tracking information that generates tracking information including at least one of modification count information in which the content of the used document object has been modified or information on the revision rate of the modified portion of the entire content of the used document object. In this case, the selection processing unit of the present invention may be configured to select the recommended document based on the size of the first rank information and the calculated size of the tracking information among the used document objects.

바람직한 실시형태의 구현을 위하여 본 발명의 상기 추천문서선별부는 상기 사용문서객체마다 주목도지표를 부여하되, 상기 요청자정보에 해당하는 사용자 이외의 사용자로서, 상기 사용문서객체의 내용을 수정한 사용자의 수가 많을수록 상대적으로 높은 주목도지표를 부여하는 주목도처리부를 더 포함할 수 있다.In order to implement a preferred embodiment, the recommendation document selection unit of the present invention assigns an attention index to each document object, and the number of users other than the user corresponding to the requestor information who has modified the contents of the document object It may further include an attention level processing unit that gives a relatively high attention level index as the number increases.

상기 실시형태가 구현되는 경우 본 발명의 상기 선별처리부는 상기 사용문서객체 중 상기 제1랭크정보의 크기 및 상기 주목도지표의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.When the above embodiment is implemented, the selection processing unit of the present invention may be configured to select the recommended document based on the size of the first rank information and the calculated size of the attention index among the used document objects.

다른 측면에 의한 목적을 달성하기 위한, 본 발명의 분류 체계의 이원적 적용을 이용한 사용자 지향적 문서 분류시스템은 가변적으로 설정되는 N(N은 2이상이 자연수)개 카테고리 각각마다 해당 카테고리를 대표하는 복수 개 문서인 샘플문서가 입력되는 제1입력부; 인공지능 기반의 임베딩 모델을 이용하여 상기 샘플문서 각각마다 해당 샘플문서의 내용을 표상하는 다차원 숫자 기반의 임베딩벡터를 생성하는 벡터생성부; 동일 카테고리에 속하는 샘플문서의 임베딩벡터를 대상으로 통계적 연산을 수행하고 그 결과값을 해당 카테고리의 대표벡터로 설정하는 대표벡터설정부; 문서 중앙화의 대상이 되는 m(m은 2이상의 자연수)개 타겟문서를 대상으로 상기 m개 타겟문서 각각의 임베딩벡터인 대상벡터를 생성하는 대상벡터생성부; 및 상기 N개 카테고리 각각의 대표벡터와 상기 대상벡터의 유사도를 이용하여 상기 m개 타겟문서 각각이 속하는 하나 이상의 소속카테고리를 선정하는 데이터처리부를 포함하여 구성될 수 있다.In order to achieve the object according to another aspect, a user-oriented document classification system using a dual application of the classification system of the present invention is a plurality of representing the corresponding category for each of N (N is a natural number equal to or greater than 2) variably set. a first input unit into which sample documents, which are individual documents, are input; a vector generator for generating a multi-dimensional number-based embedding vector representing the contents of a corresponding sample document for each of the sample documents by using an artificial intelligence-based embedding model; a representative vector setting unit that performs statistical calculations on embedding vectors of sample documents belonging to the same category and sets the resulting value as a representative vector of the corresponding category; a target vector generator for generating target vectors, which are embedding vectors of each of the m target documents, targeting m (m is a natural number greater than or equal to 2) target documents to be centralized; and a data processing unit that selects at least one belonging category to which each of the m target documents belongs, using a similarity between the representative vector of each of the N categories and the object vector.

여기에서 본 발명의 상기 데이터처리부는 상기 N개 카테고리 각각의 대표벡터와 상기 대상벡터 사이의 대비연산을 통하여 제1기준값 이상의 유사도를 가지는 P(P는 1이상 N이하의 자연수)개 카테고리를 상기 m개 타겟문서 각각의 소속카테고리로 선정하되, 상기 P개 카테고리마다 유사도의 크기에 따른 랭킹정보를 상기 m개 타겟문서 각각에 부여하는 분류프로세싱을 수행하도록 구성될 수 있다.Here, the data processing unit of the present invention selects P (P is a natural number of 1 or more and N or less) categories having a degree of similarity greater than or equal to the first reference value through a contrast operation between the representative vector of each of the N categories and the target vector. It may be configured to perform classification processing in which each of the m target documents is selected as a category belonging to each of the target documents, and ranking information according to the degree of similarity is assigned to each of the m target documents for each of the P categories.

바람직하게 본 발명은 상기 m개 타겟문서 중 상기 제1기준값 이상의 유사도를 가지는 카테고리가 존재하지 않는 e(e는 1이상 m이하의 자연수)개 타겟문서를 대상으로 클러스터링 알고리즘을 적용하여 k(k는 1이상의 자연수)개 그룹을 생성하는 클러스터링부; 상기 k개 그룹 각각에 속하는 타겟문서의 대상벡터를 대상으로 통계적 연산을 수행하고 그 결과값을 상기 k개 그룹 각각의 대표벡터로 설정하는 제2대표벡터설정부; 및 상기 k개 그룹 각각의 대표벡터와 상기 e개 타겟문서 각각의 대상벡터 사이의 대비연산을 통하여 제2기준값 이상의 유사도를 가지는 S(S는 1이상 k이하의 자연수)개 그룹을 상기 e개 타겟문서 각각의 소속그룹으로 선정하되, 상기 S개 카테고리마다 유사도의 크기에 따른 랭킹정보를 상기 e개 타겟문서 각각에 부여하는 제2데이터처리부를 더 포함하도록 구성될 수 있다.Preferably, the present invention applies a clustering algorithm to e (e is a natural number of 1 or more and m or less) target documents in which no category having a similarity higher than the first reference value among the m target documents exists, and k (k is a clustering unit for generating groups (at least 1 natural number); a second representative vector setting unit which performs a statistical operation on target vectors of target documents belonging to each of the k groups and sets the resultant value as a representative vector of each of the k groups; and S (S is a natural number between 1 and k) groups having a degree of similarity greater than or equal to the second reference value through a contrast operation between the representative vector of each of the k groups and the target vector of each of the e target documents. It may be configured to further include a second data processing unit that selects each document as a group to which it belongs, and assigns ranking information according to the degree of similarity to each of the e target documents for each of the S categories.

나아가 본 발명은 상기 k개 그룹 각각에 속하는 타겟문서의 파일명 정보를 대상으로 형태소 파싱 알고리즘을 적용하여 명사에 해당하는 텍스트를 추출하는 텍스트추출부; 및 상기 텍스트 중 빈번도가 상대적으로 높은 텍스트인 중심텍스트를 이용하여 상기 k개 그룹 각각의 카테고리 네이밍을 선정하는 네이밍부를 더 포함하도록 구성되는 것이 더욱 바람직하다.Furthermore, the present invention provides a text extraction unit for extracting text corresponding to a noun by applying a morpheme parsing algorithm to file name information of target documents belonging to each of the k groups; and a naming unit that selects a category naming for each of the k groups by using a central text, which is a text having a relatively high frequency among the texts.

구체적으로 본 발명의 상기 네이밍부는 사용 빈번도를 기준으로 상기 중심텍스트를 복수 개 선별하는 선별부; 상기 복수 개 중심텍스트 각각이 해당 그룹의 타겟문서에서 사용된 전체횟수정보 및 동일 타겟문서의 각 문장을 기준으로 중복적으로 사용된 중복횟수정보를 연산하는 연산처리부; 및 상기 전체횟수정보 및 중복횟수정보를 이용하여 상기 k개 그룹 각각의 카테고리 네이밍을 선정하는 네이밍선정부를 포함하여 구성될 수 있다.Specifically, the naming unit of the present invention includes a selection unit for selecting a plurality of central texts based on frequency of use; a calculation processing unit for calculating information on the total number of times each of the plurality of central texts is used in a target document of a corresponding group and information on the number of repetitions in which each sentence of the same target document is repeatedly used; and a naming selector configured to select a category naming for each of the k groups by using the total number of times information and the overlapping number information.

여기에서, 본 발명의 상기 네이밍선정부는 아래 수식의 결과값 크기를 이용하여 상기 복수 개 중심텍스트 중 하나를 해당 그룹의 상기 카테고리 네이밍으로 선정하도록 구성될 수 있다.Here, the naming selection unit of the present invention may be configured to select one of the plurality of central texts as the category naming of the corresponding group using the size of the resulting value of the formula below.

상기 수식에서 A는 전체횟수정보, B는 중복횟수정보, a는 1이상의 실수, b는 1이상 a이하의 실수이다.In the above formula, A is total number information, B is repetition count information, a is a real number greater than or equal to 1, and b is a real number greater than or equal to 1 and less than or equal to a.

더욱 바람직하게, 본 발명은 상기 타겟문서의 내용이 유저에 의하여 수정되는지 여부를 모니터링하는 모니터링부; 상기 타겟문서의 내용이 유저에 의하여 수정되는 경우 수정된 타겟문서를 대상으로 제2대상벡터를 생성하도록 상기 대상벡터생성부를 제어하고, 상기 제2대상벡터를 이용하여 상기 분류프로세싱을 수행하고 수정이 이루어진 시간별로 상기 수정된 타겟문서의 시계열 랭킹정보를 생성하도록 상기 데이터처리부를 제어하는 처리제어부; 및 상기 시계열 랭킹정보의 변화가 기준값 이상 발생하는 경우 해당 유저에게 관련 정보를 전송하는 피드백처리부를 더 포함하도록 구성될 수 있다.More preferably, the present invention includes a monitoring unit for monitoring whether the content of the target document is modified by a user; When the content of the target document is modified by the user, the target vector generator is controlled to generate a second target vector for the modified target document, and the classification process is performed using the second target vector, and the modification is performed. a processing control unit controlling the data processing unit to generate time-series ranking information of the modified target document for each time period; and a feedback processor configured to transmit related information to a corresponding user when a change in the time series ranking information is greater than or equal to a reference value.

또 다른 측면에 의한 목적을 달성하기 위한, 본 발명의 유저 프로파일 기반의 유사문서 추천시스템은 문서중앙화로 집합된 문서인 문서객체가 상하 S(S는 2이상의 자연수)개 계층적 트리구조를 가지는 저장공간에 저장되는 문서DB부; 인공지능 기반의 임베딩 모델을 이용하여 상기 문서객체 각각마다 해당 문서객체의 내용을 표상하는 다차원 숫자 기반의 임베딩벡터를 생성하는 임베딩벡터생성부; 요청자에 대한 정보 및 대상문서에 대한 정보를 포함하며 유사문서 추천에 대한 리퀘스트신호가 입력되는 리퀘스트입력부; 상기 대상문서의 임베딩벡터와 상기 문서객체의 임베딩벡터 사이의 제1유사도 프로세싱을 통하여 상기 대상문서와 유사성을 가지는 k개(k는 2이상의 자연수) 유사문서를 선별하는 유사문서선별부; 및 상기 k개 유사문서에 대한 정보가 포함되는 유사리스트정보를 상기 요청자에게 제공하는 메인처리부를 포함하여 구성될 수 있다.In order to achieve the object according to another aspect, the user profile-based similar document recommendation system of the present invention stores document objects, which are documents collected by document centralization, having a hierarchical tree structure of upper and lower S (S is a natural number of 2 or more). Document DB unit stored in the space; an embedding vector generator for generating a multi-dimensional number-based embedding vector representing the content of each document object for each of the document objects by using an artificial intelligence-based embedding model; a request input unit including information about a requester and information about a target document and inputting a request signal for recommending a similar document; a similar document selection unit which selects k (k is a natural number of 2 or more) similar documents having similarities to the target document through first similarity processing between the embedding vector of the target document and the embedding vector of the document object; and a main processing unit providing similarity list information including information on the k similar documents to the requester.

상기 본 발명은 상기 k개 유사문서 각각의 유사도랭킹정보를 생성하는 제1랭킹정보생성부를 더 포함할 수 있으며, 이 경우 본 발명의 상기 메인처리부는 상기 유사도랭킹정보에 따른 상기 유사리스트정보를 제공하도록 구성될 수 있다.The present invention may further include a first ranking information generating unit generating similarity ranking information for each of the k similar documents. In this case, the main processing unit of the present invention provides the similarity list information according to the similarity ranking information. can be configured to

바람직하게, 본 발명은 상기 문서객체가 상기 저장공간에 저장된 저장경로정보, 해당 문서객체가 이용된 이용시간정보, 해당 문서객체를 이용한 사용자정보 및 해당 사용자의 소속정보를 포함하는 프로파일정보를 해당 문서객체와 연계하여 저장하는 프로파일생성부; 상기 k개 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 상기 요청자에 대한 정보 및 상기 요청자의 소속정보에 대한 엘리먼트로 이루어지는 k개 학습객체를 생성하는 학습객체생성부; 및 상기 k개 학습객체를 대상으로 상기 프로파일정보를 이용한 학습 알고리즘을 적용하여 상기 프로파일정보 중 해당 학습객체에 대응하는 프로파일정보가 존재할 확률을 표상하는 가중치정보를 상기 k개 유사문서마다 생성하는 러닝처리부를 더 포함할 수 있다.Preferably, the present invention provides profile information including storage path information of the document object stored in the storage space, use time information of the corresponding document object, user information using the corresponding document object, and belonging information of the corresponding user to the corresponding document. A profile creation unit that stores in association with an object; a learning object creation unit that creates k learning objects consisting of storage path information of the k similar documents, first time information at which the request signal was input, information about the requestor, and elements about membership information of the requestor; and a learning processing unit that applies a learning algorithm using the profile information to the k learning objects to generate weight information for each of the k similar documents representing a probability that profile information corresponding to the corresponding learning object exists among the profile information. may further include.

이 경우 본 발명의 상기 메인처리부는 상기 가중치정보를 상기 유사도랭킹정보에 반영한 최종랭킹정보를 생성하고 상기 최종랭킹정보에 따른 상기 유사리스트정보를 제공하도록 구성될 수 있다.In this case, the main processing unit of the present invention may be configured to generate final ranking information by reflecting the weight information to the similarity ranking information and provide the similarity list information according to the final ranking information.

나아가 본 발명의 상기 학습객체생성부는 상기 k개 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 상기 요청자에 대한 정보 및 상기 요청자의 소속정보를 구성하는 각각의 엘리먼트마다 인코딩 프로세싱을 수행하여 해당 엘리먼트를 표상하는 숫자 체계의 개별벡터를 생성하는 개별벡터생성부; 및 상기 개별벡터를 결합한 다차원최종벡터를 생성하고 상기 생성된 다차원최종벡터를 이용하여 상기 k개 학습객체를 생성하는 객체생성부를 포함할 수 있다.Furthermore, the learning object creation unit of the present invention processes encoding for each element constituting the storage path information of the k similar documents, the first time information at which the request signal was input, the requestor information, and the requester's belonging information. an individual vector generator for generating an individual vector of a number system representing a corresponding element by performing; and an object generator for generating a multidimensional final vector obtained by combining the individual vectors and generating the k learning objects using the generated multidimensional final vector.

이 경우, 본 발명의 상기 개별벡터생성부는 상기 k개 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 상기 요청자에 대한 정보 및 상기 요청자의 소속정보마다 독립된 개별벡터를 생성하되, 상기 k개 유사문서의 저장경로정보를 이루는 엘리먼트로서 계층적 트리구조를 형성하는 S개 계층 각각의 엘리먼트마다 독립된 개별벡터를 생성하도록 구성되는 것이 바람직하다.In this case, the individual vector generating unit of the present invention generates independent individual vectors for each of the storage path information of the k similar documents, the first time information at which the request signal was input, the information about the requester, and the belonging information of the requester; , It is preferable to generate an independent individual vector for each element of S hierarchies forming a hierarchical tree structure as an element constituting the storage path information of the k similar documents.

나아가 본 발명의 상기 러닝처리부는 상기 k개 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 상기 요청자에 대한 정보 및 상기 요청자의 소속정보에 대한 엘리먼트 중 대응되는 엘리먼트의 개수 및 대응되는 엘리먼트의 유사성 정도를 조합적으로 적용하여 상기 가중치정보를 생성하도록 구성될 수 있다.Furthermore, the running processing unit of the present invention includes the storage path information of the k similar documents, the first time information at which the request signal was input, the number of corresponding elements among elements related to information about the requestor and information about the belonging of the requester, and It may be configured to generate the weight information by applying similarities of corresponding elements in combination.

본 발명의 일 실시예에 의할 때, 문서중앙화를 이용하는 유저 각각의 개인별 패턴정보의 시계열적 동적 변화에 대한 분석과 통계적 활용을 효과적으로 접목시킴으로써 문서 탐색 등에 소요되는 시간을 비약적으로 단축시킬 수 있다.According to an embodiment of the present invention, the time required for document search and the like can be drastically reduced by effectively combining analysis of the time-series dynamic change of individual pattern information of each user using document centralization and statistical utilization.

또한, 본 발명의 일 실시예에 의하는 경우, 가변적으로 설정되는 소정의 주기마다 사용자로 하여금 중요 문서 등을 효과적으로 확인하도록 유도할 수 있어 중요 문서 등이 간과되거나 누락되는 등의 문제를 효과적으로 억제할 수 있어 문서 접근과 사용자의 이용 편의성을 더욱 최적화시킬 수 있다.In addition, according to an embodiment of the present invention, it is possible to induce a user to effectively check an important document at each predetermined cycle set variably, thereby effectively suppressing problems such as overlooking or missing an important document. Document access and user convenience can be further optimized.

본 발명의 다른 실시예에 의할 때, 임베딩벡터 등과 같은 식별체계를 이용한 카테고리 분류 및 클러스터링 기법 등을 이용한 카테고리 분류를 보완적으로 상호 접목시킴으로써 유사도 근접식 강제 분류를 지양할 수 있어 문서객체의 실질적인 내용을 더욱 정밀하게 반영한 카테고리 분류를 구현할 수 있다.According to another embodiment of the present invention, by complementarily grafting category classification using an identification system such as an embedding vector and category classification using a clustering technique, etc., it is possible to avoid compulsory classification based on similarity proximity, thereby substantially improving document objects. It is possible to implement category classification that reflects the content more precisely.

본 발명의 다른 실시예에 의할 때, 개별 문서에 실제 수록된 내용과 통계적 연산의 결과를 통하여 카테고리의 네이밍(naming)이 선정되도록 함으로써 정보 검색 및 정보 접근성 등을 비약적으로 향상시킬 수 있다.According to another embodiment of the present invention, information retrieval and information accessibility can be drastically improved by allowing the naming of categories to be selected based on the actual contents of individual documents and the results of statistical calculations.

또한, 본 발명에 의하는 경우, 사용자가 문서를 작성하거나 해당 문서의 제목 또는 파일 등의 명칭을 정하는 일반적인 경험칙(UX)을 반영할 수 있어 카테고리 또는 문서가 속한 카테고리/그룹 등을 더욱 직관적으로 인지 및 확인할 수 있으므로 중앙화된 문서들의 이용 환경을 더욱 사용자 지향적 환경으로 구현할 수 있다.In addition, in the case of the present invention, a general rule of thumb (UX) for creating a document or naming the title or file of the document can be reflected, so that the category or category/group to which the document belongs can be more intuitively recognized. and confirmation, the use environment of centralized documents can be implemented as a more user-oriented environment.

본 발명의 또 다른 실시예에 의할 때, 문서객체의 실질적 내용을 표상할 수 있는 임베딩벡터와 이를 대상으로 유사성을 판단하는 프로세싱이 적용되므로 대상문서와 그 내용을 기준으로 유사성을 가지는 유사문서를 정확하고 객관적으로 선별하여 제공할 수 있다.According to another embodiment of the present invention, an embedding vector capable of representing the actual content of a document object and processing for determining similarity are applied to the embedding vector, so similar documents having similarity based on the target document and its content It can be selected and provided accurately and objectively.

또한, 본 발명의 또 다른 실시예 의하는 경우, 문서객체의 저장경로 내지 접근경로는 물론, 해당 문서객체가 이용된 시간 정보(월, 요일, 시각 등) 그리고 인적 관련성을 표상하는 사용자 정보 및 소속정보 등을 종합적으로 반영한 다차원벡터 형식의 학습 객체를 생성하고 이를 기반으로 유사도에 대한 정량적 지표를 제공하도록 구성되므로 문서들 상호간의 참고적 활용성을 더욱 극대화시킬 수 있는 유사문서를 추천 및 제공할 수 있다.In addition, according to another embodiment of the present invention, the storage path or access path of the document object, as well as the time information (month, day, time, etc.) when the document object was used, and user information and affiliation representing human relevance Since it is configured to create a learning object in the form of a multi-dimensional vector that comprehensively reflects information, etc., and provide a quantitative index for similarity based on this, it is possible to recommend and provide similar documents that can further maximize the reference usability between documents. there is.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 후술되는 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 효과적으로 이해시키는 역할을 하는 것이므로, 본 발명은 이러한 도면에 기재된 사항에만 한정되어 해석되어서는 아니 된다.
도 1은 본 발명의 바람직한 일 실시예에 의한 문서 분류시스템 및 이와 관련된 전반적인 구성을 도시한 도면,
도 2 및 도 3은 본 발명의 바람직한 실시예에 의한 문서 분류시스템의 상세 구성을 도시한 블록도,
도 4는 임베딩벡터를 이용한 카테고리 분류 과정을 도시한 흐름도,
도 5는 클러스터링 프로세싱 및 네이밍 선정 프로세싱 과정을 도시한 흐름도,
도 6은 신규문서를 대상으로 수행되는 카테고리 분류 과정을 도시한 흐름도,
도 7은 본 발명의 제2실시예에 의한 문서 추천시스템의 상세 구성을 도시한 블록도,
도 8은 도 7의 추천문서선별부에 대한 상세 구성을 도시한 블록도,
도 9는 본 발명의 제2실시예에 의한 문서 추천 프로세싱 과정을 도시한 흐름도,
도 10은 추천문서를 생성하는 구체적인 프로세싱 과정을 도시한 흐름도,
도 11은 본 발명의 제3실시예에 의한 유사문서 추천시스템의 상세 구성을 도시한 블록도,
도 12는 본 발명의 제3실시예에 의한 유사문서 추천 프로세싱 과정을 도시한 흐름도,
도 13은 가중치 정보 등을 종합적으로 고려하여 유사문서를 추천하는 프로세싱 과정을 도시한 흐름도,
도 14는 상하 계층적 트리 구조를 가지는 본 발명의 일 실시예에 의한 문서 저장 구조를 도시한 도면,
도 15는 학습객체 각각의 엘리먼트마다 생성되는 개별벡터에 대한 일 예를 도시한 도면이다.The following drawings attached to this specification illustrate preferred embodiments of the present invention, and together with the detailed description of the present invention serve to more effectively understand the technical idea of the present invention, the present invention is described in these drawings should not be construed as limited to
1 is a diagram showing a document classification system and overall configuration related thereto according to a preferred embodiment of the present invention;
2 and 3 are block diagrams showing the detailed configuration of a document classification system according to a preferred embodiment of the present invention;
4 is a flowchart showing a category classification process using an embedding vector;
5 is a flowchart showing clustering processing and naming selection processing;
6 is a flowchart showing a category classification process performed for new documents;
7 is a block diagram showing the detailed configuration of a document recommendation system according to a second embodiment of the present invention;
8 is a block diagram showing a detailed configuration of the recommended document selection unit of FIG. 7;
9 is a flowchart showing a document recommendation processing process according to a second embodiment of the present invention;
10 is a flowchart showing a specific processing process for generating a recommendation document;
11 is a block diagram showing the detailed configuration of a similar document recommendation system according to a third embodiment of the present invention;
12 is a flowchart showing similar document recommendation processing according to a third embodiment of the present invention;
13 is a flowchart showing a processing procedure for recommending similar documents by comprehensively considering weight information, etc.;
14 is a diagram showing a document storage structure according to an embodiment of the present invention having a hierarchical tree structure up and down;
15 is a diagram showing an example of individual vectors generated for each element of a learning object.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니 되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, the terms or words used in this specification and claims should not be construed as being limited to the usual or dictionary meaning, and the inventor appropriately uses the concept of the term in order to explain his/her invention in the best way. It should be interpreted as a meaning and concept consistent with the technical idea of the present invention based on the principle that it can be defined.

따라서 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상을 모두 대변하는 것은 아니므로, 본 출원시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Therefore, since the embodiments described in this specification and the configurations shown in the drawings are only one of the most preferred embodiments of the present invention and do not represent all of the technical ideas of the present invention, various equivalents that can replace them at the time of the present application It should be understood that there may be waters and variations.

도 1은 본 발명의 실시예에 의한 문서 중앙화시스템(100, 500, 1000) 및 이와 관련된 전반적인 구성을 도시한 도면이다. 1 is a diagram showing a document centralization system (100, 500, 1000) according to an embodiment of the present invention and its overall configuration.

앞서 기술된 바와 같이 본 발명의 문서 중앙화시스템(100)은 클라우드 기반의 문서 중앙화를 구현하는 시스템으로서, 통신 네트워크(50)를 통하여 하나 이상의 사용자 단말(200-1, 200-n)과 통신 가능하게 연결된다.As described above, the document centralization system 100 of the present invention is a system implementing cloud-based document centralization, and is capable of communicating with one or more user terminals 200-1 and 200-n through a communication network 50. Connected.

사용자(회사 내 직원 등)가 사용자 단말(200-1, 200-n)을 통하여 작성, 수정, 편집 등의 작업을 수행하면 해당 결과물(이하 '문서' 또는 '문서객체'라 지칭한다)은 본 발명의 문서 중앙화시스템(100, 500, 1000)에 저장되어 관리된다.When a user (employee in the company, etc.) performs tasks such as creating, modifying, and editing through the user terminals 200-1 and 200-n, the result (hereinafter referred to as 'document' or 'document object') is It is stored and managed in the document centralization system (100, 500, 1000) of the invention.

회사 내 정책이나 보완 관련 프로토콜 등에 따라 다양하게 설정될 수 있으나 기본적으로 사용자 단말(200)은 업무 처리를 위한 소프트웨어가 탑재되지 않을 수 있고 일시적 저장(휘발성 저장) 이외에, 해당 소프트웨어에 의한 결과 데이터 또한, 저장되지 않을 수 있으므로 사용자 단말(200)은 본 발명의 문서 중앙화시스템(100, 500, 1000)과의 관계에서 인터페이싱 수단의 기능을 수행하도록 구성될 수 있다.It can be set in various ways according to company policies or supplement-related protocols, but basically, the user terminal 200 may not be loaded with software for business processing, and in addition to temporary storage (volatile storage), the resulting data by the software, Since it may not be stored, the user terminal 200 may be configured to perform the function of an interfacing means in relation to the document centralization system 100, 500, 1000 of the present invention.

본 발명의 문서 중앙화시스템(100, 500, 1000)은 사내 인트라넷 등과 같이 제한적인 비개방형 네트워크를 통하여 사용자 단말(200)과 통신 가능하게 연결될 수 있으며, 실시형태에 따라서 인증과 보안 등이 강화된 웹서버(300)를 경유하는 방법 등을 통하여 사용자 단말(200)과 통신 가능하게 연결될 수 있다.The document centralization system (100, 500, 1000) of the present invention can be communicatively connected to the user terminal (200) through a limited non-open network such as an in-house intranet, and according to the embodiment, authentication and security are reinforced web It may be communicatively connected to the user terminal 200 through a method such as via the server 300 .

고유키의 교환 및 매칭, OTP(One Time Password)에 의한 인증 등과 같이 이용자의 자격을 강화된 기법으로 인증하는 방법이 매개된다면 본 발명의 문서 중앙화시스템(100, 500, 1000)은 사용자의 모바일 단말(400) 또는 PC기반의 단말과도 직접적으로 통신 가능하게 연결될 수 있다.If a method of authenticating a user's qualification with an enhanced technique such as exchange and matching of a unique key, authentication by OTP (One Time Password), etc. is mediated, the document centralization system (100, 500, 1000) of the present invention is a user's mobile terminal (400) or PC-based terminals can also be directly connected to enable communication.

본 발명의 문서 중앙화시스템(100, 500, 1000)은 사용자 개인이 작성하거나 편집 수정하는 문서가 저장되는 공간을 제공하고 서버 공간에 저장된 다양한 데이터 중 사용자의 요청에 부합되는 데이터, 정보, 문서 등을 검색 및 제공함은 물론, 유사문서를 자동으로 추천하는 등의 다양한 부가 기능을 사용자에게 제공하도록 구성될 수 있다.The document centralization system (100, 500, 1000) of the present invention provides a space in which documents written or edited by individual users are stored, and among various data stored in the server space, data, information, documents, etc. that meet the user's request are stored. It may be configured to provide users with various additional functions such as searching and providing, as well as automatically recommending similar documents.

본 발명의 문서 중앙화시스템(100, 500, 1000)은 신규로 작성되는 문서를 대상으로 파싱(parsing) 프로세싱을 수행하여 계층적 또는 트리 구조 등으로 이루어질 수 있는 디렉토리 등의 논리적 공간에 해당 문서를 저장하도록 구성된다. The document centralization system (100, 500, 1000) of the present invention performs parsing processing on a newly created document and stores the document in a logical space such as a directory that can be formed in a hierarchical or tree structure. is configured to

특히 본 발명의 문서 중앙화시스템(100, 500, 1000)은 서버에 저장되는 문서들의 다양한 활용이 가능하도록 문서 중앙화의 대상이 되는 복수 개 문서(이하 '타겟문서'라 지칭한다)를 대상으로 분류 프로세싱을 수행하여 복수 개 카테고리로 범주화하도록 구성된다.In particular, the document centralization system (100, 500, 1000) of the present invention classifies and processes a plurality of documents (hereinafter referred to as 'target documents') that are subject to document centralization to enable various utilization of documents stored in the server. It is configured to categorize into a plurality of categories by performing.

본 발명의 문서 중앙화시스템(100, 500, 1000)은 사용자(유저, user)별 로그 정보를 이용하여, 이용(열람, write, read, 수정, 편집 등)한 문서 내지 문서객체의 식별정보, 해당 문서객체를 이용한 사용자정보 또는 해당 문서를 이용한 시간(일자, 시각 등)정보를 포함하는 이력정보를 지속적으로 갱신 및 생성하도록 구성된다.The document centralization system (100, 500, 1000) of the present invention uses log information for each user (user) to identify information about documents or document objects used (read, write, read, modify, edit, etc.), corresponding It is configured to continuously update and create history information including user information using a document object or time (date, time, etc.) information using the document.

또한, 본 발명의 문서 중앙화시스템(100, 500, 1000)은 생성된 이력정보를 다양한 파라미터와 설정 환경 등에 기반하여 가공하고 연산함으로써 사용자 개인마다 특화되는 맞춤형 추천리스트정보를 생성하고 제공하도록 구성된다.In addition, the document centralization system 100, 500, 1000 of the present invention is configured to generate and provide customized recommendation list information specialized for each user by processing and calculating the generated history information based on various parameters and setting environments.

아울러, 본 발명의 문서 중앙화시스템(100, 500, 1000)은 사용자가 현재 이용하고 있는 문서 또는 사용자가 특정한 문서인 대상문서를 기준으로 이 대상문서와 유사성이 있는 하나 이상의 유사문서를 선별하고 GUI 환경과 같은 인터페이싱을 통하여 선별된 유사문서 또는 이에 대한 리스트정보를 제공하도록 구성될 수 있다.In addition, the document centralization system (100, 500, 1000) of the present invention selects one or more similar documents that are similar to the target document based on the target document, which is a document currently used by the user or a specific document by the user, and provides a GUI environment It may be configured to provide selected similar documents or list information about them through such interfacing.

이하에서는 본 발명만의 고유한 프로세싱과 구성에 대한 각 실시예를 첨부된 도면을 참조하여 상세히 설명하도록 한다. Hereinafter, each embodiment of unique processing and configuration of the present invention will be described in detail with reference to the accompanying drawings.

한편, 이하 설명에서 본 발명의 문서 중앙화시스템(100, 500, 1000)은 실시예를 기준으로 서로 다른 용어 내지 명칭으로 지칭되나, 이는 각 실시예의 주요 기능과 기술사상 등을 효과적으로 반영하고 이를 강조 내지 부각하기 위한 수사학적 표현을 달리한 것일 뿐, 실질적으로 동일한 문서 중앙화시스템(100, 500, 1000)을 의미한다.On the other hand, in the following description, the document centralization system (100, 500, 1000) of the present invention is referred to by different terms or names based on the embodiment, but this effectively reflects the main function and technical idea of each embodiment and emphasizes or It is just a different rhetorical expression to highlight, but it means substantially the same document centralization system (100, 500, 1000).

도 2에 도시된 바와 같이 본 발명의 문서 중앙화시스템을 구현하는 제1실시예에 의한, 분류 체계의 이원적 적용을 이용한 사용자 지향적 문서 분류시스템(이하 '문서 분류시스템'이라 지칭한다)(100)은 제1입력부(110), 벡터생성부(120), 대표벡터설정부(130), 대상벡터생성부(140) 및 데이터처리부(145) 등을 포함하여 구성될 수 있다.As shown in FIG. 2, a user-oriented document classification system (hereinafter referred to as 'document classification system') 100 using a dual application of a classification system according to a first embodiment implementing the document centralization system of the present invention. may include a first input unit 110, a vector generator 120, a representative vector setting unit 130, a target vector generator 140, and a data processing unit 145.

본 발명의 상세한 설명에 앞서, 본 발명의 문서 중앙화시스템(100, 500, 1000)은 온라인 접속 가능한 서버의 형태로 구현될 수 있으므로 도 2, 도 3, 도 7, 도 8, 도 11 등에 도시된 본 발명에 의한 문서 중앙화시스템(100, 500, 1000) 또는 세부 구성을 이루는 각 구성요소는 물리적으로 구분되는 구성요소라기보다는 논리적으로 구분되는 구성요소로 이해되어야 한다.Prior to the detailed description of the present invention, the document centralization system (100, 500, 1000) of the present invention can be implemented in the form of a server that can be accessed online. Each component constituting the document centralization system (100, 500, 1000) or detailed configuration according to the present invention should be understood as a logically separated component rather than a physically separated component.

즉, 각각의 구성은 본 발명의 기술사상을 실현하기 위한 논리적인 구성요소에 해당하므로 각각의 구성요소가 통합 또는 분리되어 구성되더라도 본 발명의 논리 구성이 수행하는 기능이 실현될 수 있다면 본 발명의 범위 내에 있다고 해석되어야 하며, 동일 또는 유사한 기능을 수행하는 구성요소라면 그 명칭 상의 일치성 여부와는 무관히 본 발명의 범위 내에 있다고 해석되어야 함은 물론이다.That is, since each configuration corresponds to a logical component for realizing the technical idea of the present invention, even if each component is integrated or separated, if the function performed by the logical configuration of the present invention can be realized, the present invention It should be interpreted that it is within the scope, and any component performing the same or similar function should be construed as within the scope of the present invention regardless of whether or not the name is identical.

본 발명의 카테고리설정부(105)는 문서가 분류될 N개 카테고리에 대한 정보가 설정되며(S400, 도 4 참조), 추가적으로 해당 정보를 입력한 관리자 정보, 해당 정보가 입력된 시간 정보 등이 저장될 수 있다.In the category setting unit 105 of the present invention, information on N categories into which documents are to be classified is set (S400, see FIG. 4), and information of the manager who additionally inputs the corresponding information, information on the time when the corresponding information was input, etc. are stored. It can be.

후술되는 바와 같이 N개 카테고리는 문서 중앙화의 대상이 되는 타겟문서들을 분류하는 기준이 되는 범주정보(예를 들어, 기획, 재무, 연구, 인사 등)에 해당하므로 기업의 전체적인 업무를 숙지하고 있는 기준 등급 이상의 책임자 또는 복수의 관리자 등에 의하여 설정되도록 구성되는 것이 바람직하며, 다양한 상황을 반영하여 가변적으로 설정되도록 구성하되, 인증 등의 절차 없이는 상기 N개 카테고리에 대한 정보가 변경되지 않도록 구성되는 것이 바람직하다. 복수 개 카테고리가 설정된다는 측면에서 상기 N은 2이상의 자연수가 된다.As will be described later, since N categories correspond to category information (e.g., planning, finance, research, personnel, etc.) that is the criterion for classifying target documents that are subject to document centralization, it is a criterion for understanding the overall business of the company. It is preferable to be configured to be set by a person in charge of a level or higher or a plurality of administrators, etc., and configured to be configured to be set variably by reflecting various situations, but it is preferable to be configured so that information on the N categories is not changed without a procedure such as authentication. . In terms of setting a plurality of categories, N is a natural number of 2 or more.

본 발명의 제1입력부(110)는 상기 N개 카테고리 각각마다 해당 카테고리를 대표하는 복수 개 문서인 샘플문서가 입력된다(S410).In the first input unit 110 of the present invention, sample documents, which are a plurality of documents representing the corresponding category, are input for each of the N categories (S410).

상기 샘플문서는 시스템 관리자 등에 의하여 입력되는 문서로서 해당 카테고리에 속하는 문서임이 명확한 문서를 의미하며, 인공지능에 기반한 학습 내지 모델링의 기준이 되는 raw data로 활용되는 문서에 해당한다.The sample document is a document input by a system administrator, etc., and means a document clearly belonging to a corresponding category, and corresponds to a document used as raw data that is a standard for artificial intelligence-based learning or modeling.

모델링의 정밀성을 향상시키기 위하여 각 카테고리별 샘플문서는 다수 개로 이루어지는 것이 바람직하며, 연산 처리의 효율성을 위하여 각 카테고리별로 동수(同數)의 샘플문서가 입력되도록 구성될 수도 있으나, 이에 제한되지 않음은 물론이다.In order to improve the precision of modeling, it is preferable to have a plurality of sample documents for each category, and for efficiency of calculation processing, the same number of sample documents for each category may be input, but is not limited thereto. Of course.

이와 같이 각 카테고리별로 복수 개의 샘플문서가 입력되면 본 발명의 벡터생성부(120)는 인공지능 기반의 임베딩 모델을 이용하여 상기 샘플문서 각각마다 해당 샘플문서의 내용을 표상하는 다차원 숫자 기반의 임베딩벡터(embedding vector)를 생성한다(S420).In this way, when a plurality of sample documents are input for each category, the vector generator 120 of the present invention uses an artificial intelligence-based embedding model to represent the content of the corresponding sample document for each of the sample documents, a multidimensional number-based embedding vector. (Embedding vector) is created (S420).

벡터를 구성하는 체계와 표현 방식 및 규칙 등은 벡터를 생성하는 모델마다 조금씩 다르나, 기본적으로 임베딩 벡터(벡터 임베딩)는 머신 러닝 및 딥러닝 등에서 사용되는 변수 벡터로서, 정수형 인코딩을 통하여 생성되는 정수를 N차원의 숫자로 표현한 것이며 각 차원의 숫자가 해당 단어의 특성을 가지도록 구성되므로 비슷한 의미의 단어들은 유사한 숫자 값으로 표현되는 특성을 가진다.The system, expression method, and rules constituting the vector are slightly different for each model that generates the vector, but basically, an embedding vector (vector embedding) is a variable vector used in machine learning and deep learning. It is expressed as an N-dimensional number, and since the numbers of each dimension are configured to have the characteristics of the corresponding word, words with similar meanings have characteristics expressed by similar numerical values.

상기 정수형 인코딩은 자연어를 기계가 이해할 수 있는 숫자 형태로 변경하는 기법 중 하나로서 각 단어를 고유한 정수에 매핑(mapping)시키는 프로세싱을 의미하며, Starspace, fastText, word2vec, GloVe, ELMo, BERT 등이 임베딩 모델의 대표적인 예에 해당한다.The integer encoding is one of the techniques for converting natural language into a number form that can be understood by a machine, and means processing that maps each word to a unique integer. Starspace, fastText, word2vec, GloVe, ELMo, BERT, etc. This corresponds to a representative example of an embedding model.

만약 카테고리가 8개이고, 각 카테고리마다 7개의 샘플문서가 입력된다면 본 발명의 벡터생성부(120)는 56개의 임베딩벡터를 생성한다(S420).If there are 8 categories and 7 sample documents are input for each category, the vector generator 120 of the present invention generates 56 embedding vectors (S420).

이와 같이 샘플문서마다 임베딩벡터가 생성되면 본 발명의 대표벡터설정부(130)는 동일 카테고리에 속하는 샘플문서(위 예에서 7개)의 임베딩벡터를 대상으로 통계적 연산을 수행하고, 그 결과값을 해당 카테고리의 대표벡터로 설정한다(S430).In this way, when an embedding vector is generated for each sample document, the representative vector setting unit 130 of the present invention performs statistical calculations on the embedding vectors of the sample documents (7 in the example above) belonging to the same category, and calculates the resulting value. It is set as a representative vector of the corresponding category (S430).

통계적 연산은 평균값 연산이 대표적일 수 있으며, 실시형태에 따라서 각 샘플문서에서 생성된 임베딩벡터의 거리매트릭스, 분산/집중에 관한 분포도 등을 반영한 가중치 연산 등일 수도 있음은 물론이다.Statistical calculation may be average value calculation, and according to the embodiment, of course, it may be weight calculation reflecting distance matrix of embedding vector generated from each sample document, distribution map related to variance/concentration, and the like.

본 발명의 대상벡터생성부(140)는 문서DB부(103)에 저장되는 m(m은 2이상의 자연수)개 타겟문서 즉, 문서 중앙화의 대상이 되는 m개의 문서 내지 문서객체를 대상으로 상기 m개 타겟문서 각각의 임베딩벡터를 생성한다(S440). 이하 설명에서 샘플문서를 대상으로 생성되는 임베딩벡터와 구분하기 위하여 상기 타겟문서(분류 프로세싱의 대상이 되는 문서)를 대상으로 생성되는 임베딩벡터를 '대상벡터'라 지칭한다.The target vector generator 140 of the present invention targets m (m is a natural number of 2 or more) target documents stored in the document DB unit 103, that is, m documents or document objects to be centralized. An embedding vector for each of the target documents is generated (S440). In the following description, in order to distinguish the sample document from the embedding vector generated for the target document, the embedding vector generated for the target document (document subject to classification processing) is referred to as 'object vector'.

m개 타켓문서 각각의 대상벡터가 생성되면 본 발명의 데이터처리부(145)는 m개 타겟문서의 대상벡터를 기준으로 상기 N개 카테고리 각각과의 유사도 프로세싱을 수행하고(S450), 이러한 유사도 프로세싱의 결과를 이용하여 상기 m개 타겟문서 각각이 속하는 하나 이상의 소속카테고리를 선정한다(S480).When the object vectors of each of the m target documents are generated, the data processing unit 145 of the present invention performs similarity processing with each of the N categories based on the object vectors of the m target documents (S450). Using the result, one or more belonging categories to which each of the m target documents belongs are selected (S480).

구체적으로 본 발명의 데이터처리부(145)는 상기 N개 카테고리 각각의 대표벡터와 특정 대상벡터 사이의 대비연산(S450)을 통하여 제1기준값 이상의 유사도를 가지는(S460) P개 카테고리를 특정 타겟문서의 소속카테고리로 선정할 수 있다.Specifically, the data processing unit 145 of the present invention selects P categories having a degree of similarity equal to or higher than the first reference value (S460) through a contrast operation (S450) between the representative vector of each of the N categories and a specific target vector as a specific target document. You can select by category.

특정 타겟문서와 연관성을 가지는 카테고리가 설정되는 것이므로 상기 P개 카테고리는 논리적으로 1 이상 N 이하가 될 수 있으며, 특정 카테고리에 속하는지 여부를 판단 기준이 되는 제1기준값은 다양한 요소와 파라미터 등을 반영하여 가변적으로 조정될 수 있음은 물론이다.Since a category related to a specific target document is set, the P categories can logically be 1 or more and N or less, and the first reference value, which is a criterion for determining whether or not belonging to a specific category, reflects various elements and parameters. Of course, it can be variably adjusted by doing so.

이 경우, 본 발명의 데이터처리부(145)는 특정 타겟문서가 어느 카테고리에 더욱 부합되는지 여부 즉, 유사도 프로세싱의 결과값(유사도의 크기)에 따른 랭킹정보를 특정 타겟문서의 속성정보 또는 태그정보 등에 부여하도록 구성되는(S480) 것이 더욱 바람직하다.In this case, the data processing unit 145 of the present invention converts ranking information according to which category a specific target document more closely matches, that is, the result value of similarity processing (size of similarity) to attribute information or tag information of a specific target document. It is more preferable to be configured to give (S480).

이와 같이 타겟문서의 카테고리를 하나 이상으로 선정/설정하되, 랭킹정보가 타겟문서의 속성정보 등에 부여됨으로써, 향후 사용자 추천, 검색 등의 과정을 더욱 효율적으로 수행할 수 있으며, 나아가 사용자들의 직접적인 확인 과정 및 의사 반영 등을 통하여 더욱 정밀한 카테고리 설정의 기초가 되도록 할 수 있다.In this way, one or more categories of the target document are selected/set, but the ranking information is given to the attribute information of the target document, so that the process of user recommendation, search, etc. can be performed more efficiently in the future, and furthermore, the user's direct confirmation process And through the reflection of opinions, etc., it can be the basis for more precise category setting.

상술된 바와 같이 N개 카테고리 각각의 대표벡터와 대상벡터 사이의 대비 연산을 통하여 제1기준값 이상의 유사도를 가지는 P개 카테고리를 m개 타겟문서 각각의 소속카테고리로 선정하되, 랭킹정보를 m개 타겟문서에 부여하는 프로세싱을 총칭하여 이하에서는 '분류프로세싱'이라 지칭한다.As described above, through a comparison operation between the representative vector and the target vector of each of the N categories, P categories having similarities greater than or equal to the first reference value are selected as belonging categories of each of the m target documents, and the ranking information is converted into m target documents The processing given to is generically referred to as 'classification processing' below.

이하에서는 제1기준값 이상의 유사도를 가지는 카테고리가 존재하지 않는 경우(S460) 즉, 분류프로세싱을 통하여 카테고리로 분류되지 않은 타겟문서가 존재하는 경우 수행되는 본 발명의 클러스터링 프로세싱(S470)을 도 5 등을 참조하여 상세히 설명하도록 한다.Hereinafter, the clustering processing (S470) of the present invention, which is performed when there is no category having a similarity higher than the first reference value (S460), that is, when there is a target document that has not been classified into a category through classification processing, is shown in FIG. 5 and the like. Please refer to it for a detailed explanation.

상술된 바와 같은 분류프로세싱을 수행하더라도 m개 타겟문서들 중 일부는 N개 카테고리 중 어느 하나에도 속하지 않을 수 있다. 이 경우 S400단계에서 설정되는 카테고리의 종류와 개수를 늘리는 경우 무익한 프로세싱의 공전이 반복될 수 있으므로 본 발명은 분류프로세싱에서 제외된 나머지 타겟문서만을 대상으로 클러스터링 프로세싱을 수행한다.Even if the above-described classification processing is performed, some of the m target documents may not belong to any one of the N categories. In this case, if the type and number of categories set in step S400 are increased, useless processing may be repeated, so the present invention performs clustering processing only for the remaining target documents excluded from classification processing.

구체적으로 본 발명의 클러스터링부(150)는 m개 타겟문서 중 제1기준값 이상의 유사도를 가지는 카테고리가 존재하지 않는 e개의 타겟문서를 선정한다(S500). Specifically, the clustering unit 150 of the present invention selects e target documents in which no category having a similarity higher than the first reference value exists among m target documents (S500).

예를 들어 전체 1,000(m)개의 타겟문서 중 분류프로세싱을 통하여 700개의 타겟문서에 해당하는 하나 이상의 카테고리가 설정되었다면 300개(e=1,000-700)의 타겟문서가 클러스터링 프로세싱의 대상이 된다. 이러한 점에서 논리적으로 상기 e는 1이상 m이하의 자연수에 해당한다.For example, if one or more categories corresponding to 700 target documents are set through classification processing among a total of 1,000 (m) target documents, 300 (e = 1,000-700) target documents are subject to clustering processing. In this respect, logically, e corresponds to a natural number of 1 or more and m or less.

본 발명의 클러스터링부(150)는 상기 e개의 타겟문서를 대상으로 클러스터링 알고리즘을 적용하여 k(k는 1이상의 자연수)개의 그룹을 생성한다(S510). The clustering unit 150 of the present invention applies a clustering algorithm to the e target documents to generate k (k is a natural number greater than or equal to 1) groups (S510).

클러스터링 알고리즘은 K-means와 같은 비지도 학습 기법이 그 대표적인 예가 되며, 계층적 클러스터링, 최대 거리 알고리즘, Isodata 알고리즘, 로빈-먼로법 등도 상기 클러스터링 알고리즘으로 적용될 수 있다.An unsupervised learning technique such as K-means is a representative example of the clustering algorithm, and hierarchical clustering, maximum distance algorithm, Isodata algorithm, Robin-Monroe method, and the like can also be applied as the clustering algorithm.

k개의 그룹이 생성되면, 본 발명의 제2대표벡터설정부(160)는 상기 k개 그룹 각각에 속한 타겟문서의 대상벡터(임베딩벡터)를 대상으로 통계적 연산을 수행하고 그 결과값을 상기 k개 그룹 각각의 대표벡터로 설정한다(S520). 앞서 대표벡터설정부(130)에서 설명된 바와 같이 상기 통계적 연산은 평균값 연산 등이 적용될 수 있다. When k groups are generated, the second representative vector setting unit 160 of the present invention performs a statistical operation on target vectors (embedding vectors) of target documents belonging to each of the k groups, and calculates the resultant value in the k groups. It is set as a representative vector for each group (S520). As described above in the representative vector setting unit 130, an average value calculation may be applied to the statistical calculation.

이와 같이 k개 그룹 각각의 대표벡터가 설정되면 본 발명의 제2데이터처리부(165)는 k개의 대표벡터와 e개 타겟문서 각각의 대상벡터 사이의 유사도 대비 프로세싱을 수행한다(S530).In this way, when the representative vectors of each of the k groups are set, the second data processing unit 165 of the present invention performs similarity comparison processing between the k representative vectors and the object vectors of each of the e target documents (S530).

상기 유사도 대비 프로세싱의 결과를 이용하여 본 발명의 제2데이터처리부(165)는 제2기준값 이상의 유사도를 가지는 S(S는 1이상 k이하의 자연수)개 그룹을 상기 e개 타겟문서 각각의 소속그룹으로 선정하되, 상기 S개 카테고리마다 유사도의 크기에 따른 랭킹정보를 상기 e개 타겟문서 각각에 부여한다(S550).Using the result of the similarity comparison processing, the second data processing unit 165 of the present invention assigns S (S is a natural number between 1 and k) groups having a similarity higher than the second reference value to each of the e target documents. , but ranking information according to the degree of similarity for each of the S categories is given to each of the e target documents (S550).

k개 그룹 중 특정 그룹에 속한 특정 타겟문서는 클러스터링 알고리즘에 의하여 특정 그룹에 속하게 되므로 상기 유사도 대비 프로세싱(해당 그룹의 대표벡터 vs. 자신의 대상벡터 사이의 유사도)에서 자신의 그룹을 대표하는 대표벡터와의 유사도가 가장 높을 수 있다.Since a specific target document belonging to a specific group among k groups belongs to a specific group by a clustering algorithm, a representative vector representing its group in the similarity comparison processing (representative vector of the corresponding group vs. similarity between its target vector). may have the highest degree of similarity.

이와 관련하여, 본 발명은 k개 그룹 각각의 대표벡터와 특정 타겟문서의 대상벡터 사이의 유사도 대비 프로세싱을 수행하게 되므로 자신과 관련성이 높은 그룹을 추가적으로 확인할 수 있어 카테고리 분류의 정밀성을 더욱 높일 수 있다. In this regard, since the present invention performs similarity comparison processing between the representative vectors of each of the k groups and the subject vector of a specific target document, it is possible to additionally identify groups highly related to the present invention, thereby further enhancing the precision of category classification. .

이와 함께, 랭킹정보를 통하여 복수 개 그룹(카테고리)과의 관련성이 수치정보 등으로 확인되므로 앞서 기술된 바와 같이 후속 응용 프로세싱을 다양하게 활용하는 기초를 제공할 수 있다.In addition, since the relationship with a plurality of groups (categories) is confirmed through numerical information through ranking information, it is possible to provide a basis for various utilization of subsequent application processing as described above.

한편, 통상적으로 클러스터링 기법의 경우, 복수 개 오브젝트(object)를 성질이나 특성이 상호 대응되는 복수 개 집단으로 분류하는 것은 가능하나, 해당 집단에 소속된 오브젝트를 직접 확인하지 않는 한, 해당 집단의 특성을 직관적이고 시인성 높게 사용자가 인식하는 것은 불가능하다.On the other hand, in the case of the clustering technique, it is possible to classify a plurality of objects into a plurality of groups in which properties or characteristics correspond to each other, but unless objects belonging to the group are directly identified, the characteristics of the group It is impossible for users to recognize intuitively and with high visibility.

본 발명은 이러한 종래 문제점을 해소하기 위하여 텍스트추출부(107)와 네이밍부(170)를 더 포함할 수 있다. The present invention may further include a text extraction unit 107 and a naming unit 170 to solve these conventional problems.

본 발명의 텍스트추출부(107)는 상기 k개 그룹 각각에 속하는 타겟문서의 파일명 정보를 대상으로 형태소 파싱 알고리즘을 적용하여 명사에 해당하는 텍스트를 추출한다(S550).The text extraction unit 107 of the present invention applies a morpheme parsing algorithm to file name information of target documents belonging to each of the k groups to extract text corresponding to nouns (S550).

명사에 해당하는 텍스트가 추출되면 본 발명의 네이밍부(170)는 이 추출된 텍스트 중 빈번도가 상대적으로 높은 텍스트인 중심텍스트를 이용하여 상기 k개 그룹 각각의 카테고리 네이밍을 선정한다(S580).When the text corresponding to the noun is extracted, the naming unit 170 of the present invention selects the category naming of each of the k groups using the center text, which is a relatively high-frequency text among the extracted text (S580).

구체적으로 네이밍부(170)의 일 구성인 선별부(171)가 사용 빈번도를 기준으로 상기 중심텍스를 복수 개 선별하면(S560), 연산처리부(173)는 상기 복수 개 중심텍스트 각각이 해당 그룹의 타겟문서에서 사용된 전체횟수정보 및 동일 타겟문서를 이루는 각 문장을 기준으로 상기 복수 개 중심텍스트 각각이 중복으로 사용된 횟수인 중복횟수정보를 연산한다(S570).Specifically, when the selection unit 171, which is one component of the naming unit 170, selects a plurality of center texts based on the frequency of use (S560), the calculation processing unit 173 selects each of the plurality of center texts as a corresponding group. Duplicate count information, which is the number of times each of the plurality of central texts is used repeatedly, is calculated based on the total count information used in the target document and each sentence constituting the same target document (S570).

본 발명의 네이밍부(170)는 중심텍스트 각각의 전체횟수정보 및 중복횟수정보를 조합적으로 적용하여 중심텍스트 중 하나를 해당 그룹의 특성을 대표할 수 있는 카테고리 네이밍으로 선정한다(S580). 실시형태에 따라서 차순위 등의 중심텍스트를 예비 카테고리 네이밍으로 선정할 수도 있다.The naming unit 170 of the present invention selects one of the central texts as a category naming that can represent the characteristics of the corresponding group by applying the total number of times information and the overlapping number information of each central text in combination (S580). Depending on the embodiment, a central text such as a secondary ranking may be selected as preliminary category naming.

이와 같이 본 발명은 사용자가 스스로 정한 파일명을 기준으로 중심텍스트를 선별함으로써 사용자가 의도한 주관적 목적성 내지 의사를 1차적으로 반영하고, 나아가 이들 중심텍스트가 실제 해당 타겟문서 내에서 사용되는 빈번도는 물론, 타겟문서의 동일 문장 내 반복 사용되는 상태정보 등을 형태소 분석 및 데이터 분석 등을 통하여 종합적으로 고려하여 카테고리 네이밍을 선정할 수 있어 문서 내용의 실질성 또한, 반영할 수 있다.In this way, the present invention primarily reflects the subjective purpose or intention intended by the user by selecting the central text based on the file name set by the user himself, and furthermore, the frequency with which these central texts are actually used in the target document In addition, since the category naming can be selected by comprehensively considering status information repeatedly used in the same sentence of the target document through morpheme analysis and data analysis, the substance of the document content can also be reflected.

이 경우, 본 발명의 네이밍선정부(175)는 아래 수학식 1의 결과값 크기를 이용하여 상기 복수 개 중심텍스트 중 하나를 해당 그룹의 카테고리 네이밍으로 선정하도록 구성될 수 있다.In this case, the naming selector 175 of the present invention may be configured to select one of the plurality of central texts as the category naming of the corresponding group by using the size of the resulting value of Equation 1 below.

상기 수식에서 A는 복수 개 중심텍스트 각각이 해당 그룹의 타겟문서에서 사용된 전체횟수정보, B는 상기 복수 개 중심텍스트 각각이 해당 그룹에 속한 동일 타겟문서의 각 문장을 기준으로 중복적으로 사용된 중복횟수정보이며, 상수a는 1이상의 실수, 상수b는 1이상 a이하의 실수이다.In the above formula, A is information on the total number of times each of the plurality of central texts is used in the target document of the corresponding group, and B is information on the total number of times each of the plurality of central texts is used repeatedly based on each sentence of the same target document belonging to the corresponding group. It is information on the number of duplicates, constant a is a real number of 1 or more, and constant b is a real number of 1 or more and less than or equal to a.

이와 같이 전체횟수정보를 대수적 처리를 위한 수치로 반영하고, 중복횟수정보를 지수적 처리를 위한 수치로 반영함으로써, 전체횟수정보와 중복횟수정보를 종합적으로 고려하되, 동일 문장 내 중복된 횟수가 많을수록 더욱 가중된 결과를 도출할 수 있어 실제 타겟문서 내 해당 중심텍스트의 중요도를 더욱 실질적으로 반영할 수 있다.In this way, by reflecting the total count information as a numerical value for algebraic processing and reflecting the duplicate count information as a numerical value for exponential processing, the total count information and the duplicate count information are comprehensively considered. Since a more weighted result can be derived, the importance of the corresponding central text in the actual target document can be more practically reflected.

또한, 전체횟수정보(A)를 가중 반영하는 상수 a와 중복횟수정보(B)를 가중 반영하는 상수 b를 상호 연관되도록 설정함으로써 어느 하나가 다른 하나에 비하여 지나치게 반영되는 것을 억제할 수 있어 두 파라미터를 더욱 조화롭게 반영할 수 있다.In addition, by setting a constant a weightedly reflecting the total number of times information (A) and a constant b weightedly reflecting the overlapping number information (B) to be correlated, it is possible to suppress excessive reflection of one of the two parameters compared to the other. can be reflected more harmoniously.

실시형태에 따라서 본 발명의 문서 분류시스템(100)은 도 3에 도시된 바와 같이 모니터링부(180), 처리제어부(190) 및 피드백처리부(195)를 더 포함할 수 있다.According to the embodiment, the document classification system 100 of the present invention may further include a monitoring unit 180, a processing control unit 190, and a feedback processing unit 195 as shown in FIG.

본 발명의 모니터링부(180)는 형태소 분석 엔진 등을 이용하여 명사를 중심으로 타겟문서의 내용이 수정되는지 여부를 모니터링하도록 구성될 수 있다.The monitoring unit 180 of the present invention may be configured to monitor whether the contents of the target document are modified based on nouns using a morpheme analysis engine or the like.

이와 같이 특정 타겟문서의 내용이 수정되는 경우, 본 발명의 처리제어부(190)는 수정된 타겟문서를 대상으로 제2대상벡터를 생성하도록 상기 대상벡터생성부(140)를 제어한다. 상기 제2대상벡터는 수정이 이루어진 타겟문서를 대상으로 생성되는 임베딩벡터를 의미한다.In this way, when the content of a specific target document is modified, the processing control unit 190 of the present invention controls the target vector generator 140 to generate a second target vector for the modified target document. The second target vector refers to an embedding vector generated for a modified target document.

이와 같이 제2대상벡터가 생성되면 본 발명의 처리제어부(190)는 상기 제2대상벡터를 이용하여 상술된 '분류프로세싱'을 수행하도록 데이터처리부(145)를 제어하며 나아가 타겟문서의 수정이 이루어진 시간별로 상기 수정된 타겟문서의 시계열 랭킹정보를 생성하도록 제어한다.When the second object vector is generated in this way, the processing control unit 190 of the present invention controls the data processing unit 145 to perform the above-described 'classification processing' using the second object vector, and furthermore, the target document is modified. Control to generate time-series ranking information of the modified target document by time.

이와 같이 특정 타겟문서에 시계열 랭킹정보가 속성정보 등으로 부여되면 사용자 또는 관리자는 특정 타겟문서의 내용 변화를 직간접적으로 확인할 수 있으며 이를 기초로 다른 후속 응응 프로세싱의 효용성을 높일 수 있다.In this way, if time series ranking information is given as attribute information to a specific target document, the user or manager can directly or indirectly check the change in the content of the specific target document, and based on this, the effectiveness of other subsequent response processing can be increased.

또한, 본 발명의 피드백처리부(195)는 상기 시계열 랭킹정보의 변화가 기준값 이상 발생하는 경우 최초 문서 작성자, 문서 수정자 또는 관리자 등에게 관련 정보가 공유되도록 제어하여 해당 문서의 오용이나 잘못된 수정 등이 발생하는지 여부를 지속적으로 모니터링하여 중앙화된 문서를 더욱 효과적으로 관리할 수 있다.In addition, the feedback processing unit 195 of the present invention controls the sharing of related information to the first document creator, document modifier, or manager when the change in the time series ranking information occurs above the reference value, so that misuse or incorrect modification of the document is prevented. Centralized documents can be managed more effectively by continuously monitoring whether or not they occur.

도 6은 신규문서를 대상으로 수행되는 카테고리 분류 과정을 도시한 흐름도이다. 6 is a flowchart illustrating a category classification process performed for new documents.

본 발명의 대상벡터생성부(140)는 제1입력부(110) 등을 통하여 신규문서가 입력되면 상기 입력된 신규문서를 대상으로 임베딩벡터를 생성한다(S600). 실시형태에 따라서 전처리과정 등을 통하여 신규문서에 해당하는 임베딩벡터가 신규문서와 함께 입력될 수도 있음은 물론이다.When a new document is input through the first input unit 110, the object vector generator 140 of the present invention generates an embedding vector for the input new document (S600). Of course, depending on the embodiment, the embedding vector corresponding to the new document may be input together with the new document through a preprocessing process or the like.

신규문서의 임베딩벡터가 확인되거나 생성되면, 본 발명의 데이터처리부(145)는 문서 중앙화에 이용되었던 N개 카테고리 각각의 대표벡터를 독출(access and read)하고(S610) 앞서 기술된 바와 같이, 신규문서의 임베딩벡터를 기준으로 신규문서의 임베딩벡터와 N개 카테고리 각각의 대표벡터 사이의 유사도 대비 프로세싱을 수행한다(S620).When the embedding vector of the new document is confirmed or generated, the data processing unit 145 of the present invention accesses and reads the representative vector of each of the N categories used for document centralization (S610), and as described above, Based on the embedding vector of the document, similarity comparison processing is performed between the embedding vector of the new document and the representative vector of each of the N categories (S620).

제3기준값 이상의 유사도를 가지는 하나 이상의 카테고리가 존재하면 본 발명의 데이터처리부(145)는 유사도의 크기에 따른 랭킹정보를 부여하여 신규문서의 소속 카테고리를 설정한다(S640).If there is one or more categories having a similarity higher than the third reference value, the data processing unit 145 of the present invention assigns ranking information according to the degree of similarity to set the category to which the new document belongs (S640).

상기 제3기준값 또는 후술되는 제4기준값은 앞서 기술된 제1 또는 제2기준값과 동일한 값으로 설정될 수도 있으나 카테고리와의 관련성을 다양하게 반영하기 위하여 실시형태에 따라서 서로 다른 값으로 설정될 수도 있음은 물론이다.The third reference value or the fourth reference value described later may be set to the same value as the first or second reference value described above, but may be set to different values depending on the embodiment in order to reflect the relationship with the category in various ways. is of course

유사도 대비 프로세싱(S630)에서 제3기준값 이상의 유사도를 가지는 카테고리가 존재하지 않는 경우, 본 발명의 데이터처리부(145)는 클러스터링 프로세싱에 이용되었던 k개 그룹의 대표벡터를 로드(Load)하고(S650) k개 그룹 각각의 대표벡터와 신규문서의 대상벡터(임베딩벡터) 사이의 유사도를 대비하는 프로세싱을 수행한다(S660).In the similarity comparison processing (S630), when there is no category having a similarity higher than the third reference value, the data processing unit 145 of the present invention loads representative vectors of k groups used for clustering processing (S650) Processing is performed to compare the degree of similarity between the representative vector of each of the k groups and the target vector (embedding vector) of the new document (S660).

유사도 대비 프로세싱의 결과, 제4기준값 이상의 유사도를 가지는 카테고리(그룹)가 존재하는 경우(S670), 앞서 기술된 바와 같이 하나 카테고리(그룹)마다 유사도의 크기에 따른 랭킹정보를 부여하여 상기 신규문서의 소속카테고리(그룹)를 설정한다(S640).As a result of similarity comparison processing, if there is a category (group) having a similarity higher than the fourth reference value (S670), as described above, ranking information according to the degree of similarity is assigned to each category (group) to obtain the new document. An affiliated category (group) is set (S640).

만약 유사도 대비 프로세싱(S670)의 결과, 신규문서가 속할 수 있는 카테고리가 존재하지 않는다면, 이는 신규문서가 기존의 N+k개 카테고리로 분류될 수 없는 내용의 문서임을 의미하므로 해당 신규문서를 후속적으로 처리하기 위하여 별도의 저장 공간에 저장하는 등의 예비 처리 프로세싱을 수행한다(S680).If, as a result of the similarity comparison processing (S670), if there is no category to which the new document can belong, this means that the new document is a document with content that cannot be classified into the existing N+k categories, so that the new document In order to process as , preliminary processing such as storing in a separate storage space is performed (S680).

별도의 저장 공간에 저장되는 신규문서의 개수가 정해진 기준 개수에 해당하면 앞서 기술된 바와 같은 클러스터링 프로세싱이 자동적으로 수행되도록 하여 신규문서들만의 그룹핑 과정 및 네이밍 과정이 진행될 수 있다.When the number of new documents stored in the separate storage space corresponds to the predetermined reference number, the clustering processing described above may be automatically performed so that grouping and naming of only new documents may be performed.

또한, 실시형태에 따라서 관리자의 직접적인 처리가 유도될 수 있도록 별도의 저장 공간에 신규문서가 저장되는 경우 관리자 단말 등으로 해당 사실이 안내되도록 구성할 수도 있다.In addition, according to the embodiment, when a new document is stored in a separate storage space so that direct processing by an administrator can be induced, a corresponding fact may be informed to an administrator terminal or the like.

상술된 본 발명의 프로세싱은 기 설정된 종료 조건이 충족되지 않는다면(S690) 순환적으로 적용되도록 구성될 수 있음은 물론이다.Of course, the above-described processing of the present invention may be configured to be applied cyclically if a preset termination condition is not satisfied (S690).

신규문서의 분류 과정에서 N+k개 대표벡터 전체를 로딩하고 이를 기준으로 신규문서에 해당하는 대상벡터(임베딩벡터)와의 유사도 대비 프로세싱을 수행하는 방법도 적용 가능함은 물론이다.Of course, a method of loading all N+k representative vectors in the classification process of a new document and performing similarity comparison processing with a target vector (embedding vector) corresponding to the new document based on this is also applicable.

다만, N개의 카테고리는 사내 문서의 전체적인 내용 등에 근거하여 설정되는 경우가 일반적일 것이므로 N개의 카테고리를 기준으로 분류프로세싱을 먼저 수행하고 이 과정에서 분류되지 않는 타겟문서(신규문서)가 발생되는 경우에 한해, k개 그룹의 대표벡터를 이용한 프로세싱을 수행하도록 설계되는 것이 연산 처리의 효율성 등의 측면에서 바람직할 수 있다.However, since it is common for N categories to be set based on the overall content of in-house documents, etc., classification processing is performed first based on N categories, and in this process, if a target document (new document) that is not classified is generated, For one year, it may be preferable in terms of efficiency of calculation processing to be designed to perform processing using representative vectors of k groups.

이하에서는 첨부된 도면 등을 참조하며 본 발명의 문서 중앙화시스템을 구현하는 제2실시예에 의한, 시계열 패턴정보의 동적 변화를 이용한 맞춤형 문서 추천시스템(이하 '문서 추천시스템'이라 지칭한다)(500)의 구체적인 구성 및 프로세싱에 대하여 상세히 설명하도록 한다.Hereinafter, a customized document recommendation system (hereinafter referred to as 'document recommendation system') using dynamic change of time-series pattern information according to the second embodiment implementing the document centralization system of the present invention with reference to the accompanying drawings (500 ) will be described in detail with respect to the specific configuration and processing.

도 7은 본 발명의 제2실시예에 의한 문서 추천시스템(500)의 상세 구성을 도시한 블록도이며, 도 9는 본 발명의 제2실시예에 의한 문서 추천 프로세싱 과정을 도시한 흐름도이다.7 is a block diagram showing the detailed configuration of a document recommendation system 500 according to the second embodiment of the present invention, and FIG. 9 is a flowchart showing a document recommendation processing process according to the second embodiment of the present invention.

도 7에 도시된 바와 같이 본 발명의 제2실시예에 의한 문서 추천시스템(500)은 문서객체DB부(510), 이력정보생성부(520), 요청정보입력부(530), 제1선별부(540), 추천문서선별부(550) 및 리스트제공부(560)를 포함하여 구성될 수 있다.As shown in FIG. 7, the document recommendation system 500 according to the second embodiment of the present invention includes a document object DB unit 510, a history information generator 520, a request information input unit 530, and a first selection unit. 540, a recommendation document selection unit 550, and a list providing unit 560 may be included.

제2실시예의 문서객체DB부(510)는 앞서 기술된 실시예에서 설명된 문서DB부(103)와 상응하는 구성으로서 문서 중앙화로 집합된 문서인 복수 개 문서객체가 저장된다(S3000, 도 9참조).The document object DB unit 510 of the second embodiment has a configuration corresponding to the document DB unit 103 described in the above-described embodiment, and stores a plurality of document objects, which are documents collected by document centralization (S3000, FIG. 9). reference).

사용자에 의하여 문서객체가 이용되면(S3100) 본 발명의 이력정보생성부(520)는 상기 문서객체를 이용한 사용자정보 및 해당 문서객체가 이용된 이용시간정보(시간, 일자 등)를 포함하는 정보인 이력정보를 해당 문서객체와 연계하여 저장한다(S3200).When the document object is used by the user (S3100), the history information generation unit 520 of the present invention is information including user information using the document object and usage time information (time, date, etc.) when the document object was used. The history information is stored in association with the corresponding document object (S3200).

보안 등급 또는 취급 인가 기준이나 운용 정책 등 다양한 요소에 따라 다를 수 있으나, 문서객체DB부(510)에 저장된 문서객체의 수가 10만개라면, 이 10만개 문서객체 각각을 기준으로 상기 이력정보가 논리적으로 생성될 수 있다.Although it may vary according to various factors such as security level, handling authorization standard, or operation policy, if the number of document objects stored in the document object DB unit 510 is 100,000, the history information is logically based on each of the 100,000 document objects. can be created

사용자는 아이디 등을 이용한 로그인 또는 보안이나 인증 방식 등을 통하여 문서 추천시스템(500)이 구현되는 서버와 접속하게 되므로 이 과정에서 사용자의 식별정보(아이디 정보 등) 또는 특정 사용자 단말의 고유번호(IP, 맥어드레스 등) 등이 확인될 수 있다. 본 발명의 이력정보생성부(520)는 이러한 정보들을 이용하여 상기 이력정보를 생성한다.Since the user accesses the server where the document recommendation system 500 is implemented through log-in using an ID or security or authentication method, in this process, the user's identification information (ID information, etc.) or a unique number (IP) of a specific user terminal , MAC address, etc.) and the like can be confirmed. The history information generation unit 520 of the present invention generates the history information using such information.

상기 이력정보는 시스템 내부에서 구현되는 알고리즘의 순환적 처리 및 타임스탬프(time stamp) 등을 통하여 지속적으로 생성, 갱신 및 저장될 수 있음은 물론이다.Of course, the history information can be continuously generated, updated, and stored through cyclical processing of an algorithm implemented inside the system, time stamp, and the like.

그러므로 상기 이력정보는 사용자들 각각이 문서객체를 이용한 시계열적 정보로서 시간에 따라 동적으로 변화되는 각 사용자들의 패턴을 분석하는 기본데이터(raw data)로 기능할 수 있다.Therefore, the history information is time-series information for each user using a document object, and can function as raw data for analyzing a pattern of each user that dynamically changes over time.

본 발명의 요청정보입력부(530)는 사용자로부터 리퀘스트신호를 입력받는(S3300) 구성으로서, 상기 리퀘스트신호는 정보를 요청하는 요청자정보 및 요일별, 주별, 월별 등과 같이 서로 다른 복수 개 설정기간 중 하나 이상의 정보인 요청기간정보를 포함한다.The request information input unit 530 of the present invention is a component that receives a request signal from a user (S3300), and the request signal is one or more of a plurality of different set periods such as information of a requester requesting information and each day, week, month, etc. It includes request period information, which is information.

상기 요청자정보는 문서 추천시스템(500)이 운용되는 조직, 회사 등에 소속된 자로 이루어질 수 있는 회원(직원)으로서 주기적인 리포트를 받고자 하는 요청자의 정보를 의미한다. 이 요청자정보는 앞서 설명된 바와 같이 로그인 또는 인증 등의 절차를 비롯하여 다양한 방법을 통하여 확인될 수 있음은 물론이다.The requestor information refers to information of a requestor who wants to receive periodic reports as a member (employee) who may be a member of an organization, company, etc. in which the document recommendation system 500 is operated. It goes without saying that this requestor information can be confirmed through various methods including procedures such as log-in or authentication as described above.

이와 같이 리퀘스트신호가 입력되면 본 발명의 제1선별부(540)는 상기 리퀘스트신호 및 상기 이력정보생성부(520)에 저장되어 있는 이력정보를 입체적으로 이용하여 상기 요청자정보에 해당하는 사용자(요청자)가 이용한 문서객체인 사용문서객체를 선별한다(S3400).In this way, when a request signal is input, the first sorting unit 540 of the present invention uses the request signal and the history information stored in the history information generation unit 520 in three dimensions to use the user (requester) corresponding to the requestor information. ) Selects a used document object, which is a used document object (S3400).

여기에서 문서객체의 '이용'은 해당 문서객체를 대상으로 한 읽기(read), 작성(write, make), 수정(amend), 편집(edit) 등과 같은 문서객체와 관련된 다양한 작업 활동을 의미한다.Here, 'use' of a document object refers to various work activities related to the document object, such as read, write, make, amend, and edit.

이와 같이 문서객체들 중 요청자의 인적 기준으로 구분되는 문서객체인 사용문서객체가 선별되면 본 발명의 추천문서선별부(550)는 상기 사용문서객체 중 특정요건에 부합되는 추천문서를 선별한다(S3500).In this way, when a used document object, which is a document object classified by the requester's personal criteria, is selected among document objects, the recommendation document selection unit 550 of the present invention selects a recommended document that meets a specific requirement among the used document objects (S3500 ).

이에 대한 하나의 예로, 본 발명의 추천문서선별부(550)는 사용문서객체 중 상기 요청기간정보의 기준시점과 상대적으로 근접한 이용시간정보를 가지는 하나 이상의 문서객체를 이용하여 상기 추천문서를 선별할 수 있다.As an example of this, the recommended document selection unit 550 of the present invention selects the recommended document using one or more document objects having use time information relatively close to the reference point of the request period information among the used document objects. can

이와 같이 추천문서가 선별되면, 본 발명의 리스트제공부(560)는 상기 추천문서 자체 또는 추천문서에 해당하는 목록정보인 추천리스트정보를 생성하고 이를 상기 요청자정보에 해당하는 사용자 단말 또는 상기 요청자가 지정된 e-메일, 휴대단말 등으로 제공한다(S3600).When a recommended document is selected in this way, the list providing unit 560 of the present invention generates recommendation list information, which is list information corresponding to the recommended document itself or the recommended document, and generates it to the user terminal corresponding to the requester information or to the requester. It is provided by designated e-mail, mobile terminal, etc. (S3600).

상술된 본 발명의 제2실시예에 의한 프로세싱은 특정한 종료 조건이 충족되지 않는 한 반복 내지 순환적으로(S3700) 적용될 수 있음은 물론이다.Of course, the above-described processing according to the second embodiment of the present invention can be applied repeatedly or cyclically (S3700) unless a specific end condition is satisfied.

도 8은 도 7의 추천문서선별부(550)에 대한 상세 구성을 도시한 블록도이며, 도 10은 상기 추천문서선별부(550)에서 추천문서가 생성되는 구체적인 프로세싱 과정을 도시한 흐름도이다.FIG. 8 is a block diagram showing the detailed configuration of the recommended document selection unit 550 of FIG. 7 , and FIG. 10 is a flowchart illustrating a specific processing process for generating a recommended document in the recommendation document selection unit 550 .

도 8에 도시된 바와 같이 본 발명의 추천문서선별부(550)는 주기가중치처리부(551), 연속가중치처리부(552), 근접가중치처리부(553), 텍스트선별부(554), 빈도가중치처리부(555), 트래킹부(556), 주목도처리부(557) 및 선별처리부(558)를 포함하여 구성될 수 있다.As shown in FIG. 8, the recommended document selection unit 550 of the present invention includes a period weight processing unit 551, a continuous weight processing unit 552, a proximity weight processing unit 553, a text selection unit 554, a frequency weight processing unit ( 555), a tracking unit 556, a attention level processing unit 557, and a selection processing unit 558.

앞서 설명된 바와 같이 요일별, 주별, 월별, 분기별, 년별 등과 같은 정보로 이루어질 수 있는 요청기간정보가 포함된 리퀘스트신호가 입력되면 본 발명의 제1선별부(540)는 상기 요청기간정보를 확인하고(S4000) 상기 요청기간정보에 따라 서로 다르게 설정되는 조회기간 동안의 이력정보를 조회하고 확인한다(S4100).As described above, when a request signal including request period information, which may consist of information such as day, week, month, quarter, year, etc., is input, the first screening unit 540 of the present invention checks the requested period information. (S4000) and inquires and checks the history information during the inquiry period set differently according to the requested period information (S4100).

실시예의 구현과 관련한 일 예로서, 요청자가 요일(월요일)을 기준으로 추천리스트정보를 제공받고자 신청하는 경우 일주일의 정수배(14일, 21일 등) 등의 조회기간 동안, 해당하는 이용시간정보(월요일)가 포함된 이력정보를 조회한다. 요청기간정보가 주간별인 경우 4주를 기준으로 하는 30일 또는 이의 정수배인 90일 등이 조회기간이 될 수 있다.As an example related to the implementation of the embodiment, when a requester applies to be provided with recommendation list information based on a day of the week (Monday), the corresponding usage time information ( Monday) search history information. If the requested period information is weekly, 30 days based on 4 weeks or 90 days, which is an integer multiple thereof, may be the inquiry period.

실시형태에 따라서 상기 요청기간정보는 복수 개로 설정될 수 있음은 물론이며 이 경우 본 발명의 문서 추천시스템(500)은 특정 사용자에게 요일별 추천리스트정보, 주간별 추천리스트정보 및 월별 추천리스트정보 중 하나 이상을 제공한다.Depending on the embodiment, of course, the request period information can be set to a plurality, and in this case, the document recommendation system 500 of the present invention provides one of day-of-week recommendation list information, weekly recommendation list information, and monthly recommendation list information to a specific user. provides more than

이와 같이 특정 조회기간 동안의 이력정보가 조회 및 확인되면, 주기가중치처리부(551)는 상기 사용문서객체마다 제1랭크정보를 부여하되, 상기 이력정보를 이용하여 사용문서객체(요청자가 이용한 문서객체)의 이용시간정보와 상기 요청기간정보(요일, 주, 월, 분기 등)의 기준시점 사이의 이격이 작을수록 상대적으로 높은 제1랭크정보를 부여한다(S4200).In this way, when the history information for a specific inquiry period is inquired and confirmed, the period weight processing unit 551 assigns the first rank information to each document object used, and uses the history information to determine the document object used (the document object used by the requester) ) and the reference time of the requested period information (day, week, month, quarter, etc.), the relatively higher first rank information is assigned (S4200).

요청기간정보가 요일별로 설정되고 그 기준일이 목요일인 경우, 기준일 이전 조회기간(예를 들어 21일 등) 동안의 사용문서객체 중 목요일에 해당하는 이용시간정보가 이력정보로 포함된 사용문서객체에 가장 높은 제1랭크정보가 부여되며, 이용시간정보가 수요일 및 금요일인 사용문서객체에 차순위의 제1랭크정보가 부여된다.If the request period information is set for each day of the week and the base date is Thursday, the usage time information corresponding to Thursday among the used document objects during the inquiry period (for example, 21 days, etc.) before the base date is included as history information in the used document object. The highest first rank information is given, and the second rank first rank information is given to the used document objects whose usage time information is Wednesday and Friday.

대응되는 관점에서 월요일 및 일요일이 이용시간정보인 사용문서객체에 상대적으로 가장 낮은 제1랭크정보가 부여된다.From a corresponding point of view, the first rank information, which is relatively the lowest, is assigned to a used document object in which Monday and Sunday are used time information.

알고리즘의 구현에 대한 예로서, 각 요일마다 숫자체계를 부여하고(예를 들어, 월요일 : 1, 화요일 : 2, 수요일 : 3, 목요일 : 4, 금요일 : 5, 토요일 : 6, 일요일 : 7) 기준요일(목요일)에 해당하는 숫자체계와 이용시간정보의 요일의 절대값 차이에 대한 연산을 수행하며 그 결과값이 3을 초과하는 경우 7에서 그 결과값을 뺀 결과값을 산출하는 방식이 적용될 수 있다.As an example of the implementation of the algorithm, each day of the week is given a number system (e.g., Monday: 1, Tuesday: 2, Wednesday: 3, Thursday: 4, Friday: 5, Saturday: 6, Sunday: 7) and based on An operation is performed on the difference between the number system corresponding to the day of the week (Thursday) and the absolute value of the day of the usage time information, and if the result value exceeds 3, a method of calculating the result value by subtracting the result value from 7 can be applied. there is.

이와 같이 각 요일별 수치 결과값이 도출되면 그 결과값이(이격도(기준 요일과 해당 요일과의 차이)) 작을수록 높은 점수가 부여되고 클수록(이격도가 큰 경우) 낮은 점수가 부여되어야 하므로 하이퍼볼릭 탄젠트 함수(hyperbolic tangent function) 등과 같은 반비례 관계를 반영하는 제1함수를 이용하여 상기 제1랭크정보가 산출되도록 구성될 수 있다.In this way, when the numerical result value for each day of the week is derived, the lower the result value (separation degree (difference between the reference day and the corresponding day)), the higher score should be given, and the greater the difference (separation degree), the lower score should be given, so the hyperbolic tangent The first rank information may be calculated using a first function that reflects an inverse proportional relationship such as a hyperbolic tangent function.

실시형태에 따라서 제1함수의 치역(제1랭크정보)이 특정 범위(예를 들어 0 내지 1.5)가 되도록 상기 제1함수를 표준화하는 등의 프로세싱이 적용될 수도 있음은 물론이다.It goes without saying that, depending on the embodiment, processing such as standardizing the first function so that the range (first rank information) of the first function is in a specific range (eg, 0 to 1.5) may be applied.

이와 같이 사용문서객체마다 제1랭크정보가 산출되면 본 발명의 선별처리부(558)는 사용문서객체 중 제1랭크정보의 크기를 기준(내림차순)으로 하나 이상의 추천문서를 선별한다(S4800). 추천문서가 선별되면 앞서 기술된 바와 같이 리스트제공부(560)는 선별된 추천문서 또는 추천문서 목록에 대한 추천리스트정보를 요청자에게 제공한다(S3600).As such, if the first rank information is calculated for each used document object, the selection processing unit 558 of the present invention selects one or more recommended documents based on the size of the first rank information among the used document objects (in descending order) (S4800). When the recommended document is selected, as described above, the list providing unit 560 provides the requester with recommendation list information for the selected recommended document or list of recommended documents (S3600).

요청기간정보가 주간별로 설정되는 경우, 이와 유사하게 각 주 차수별 순서에 따라 1 내지 4(제1주 : 1, 제2주 : 2, 제3주 : 3, 제4주(제5주) : 4)의 숫자체계를 부여하고 현재 시점의 주차와 이용시간정보의 주차에 대한 정보 및 이를 독립변수로 하는 함수를 이용하여 상기 추천문서를 선별한다.If the request period information is set by week, similarly, 1 to 4 (1st week: 1, 2nd week: 2, 3rd week: 3, 4th week (5th week): The number system of 4) is assigned, and the recommendation document is selected using information on parking at the current time and parking time information and a function using it as an independent variable.

이와 같이 주기성을 가지는 문서객체로서 요청자가 해당 요일, 주, 월 등에 이용한 문서객체를 기준으로 문서가 추천되므로 요청자(사용자)는 해당 요일/주간/월간 등을 기준으로 관련성 높은 문서의 탐색 시간을 비약적으로 단축시킬 수 있음은 물론, 해당 주기에 확인/작성 등을 수행해야 할 문서객체(문서)를 빠짐없이 정확하게 확인할 수 있도록 유도할 수 있다.As a document object with periodicity, documents are recommended based on the document object used by the requester on the corresponding day, week, month, etc., so the requester (user) can significantly reduce the search time for documents with high relevance based on the corresponding day/week/month. It can be shortened to , and it can be induced to accurately check all the document objects (documents) to be checked/created in the corresponding cycle.

본 발명의 연속가중치처리부(552)는 상기 사용문서객체마다 제2랭크정보를 부여하되, 연속적으로 이용한 기간(시간)이 길수록 또는 이용횟수가 많을수록 상대적으로 높은 제2랭크정보를 부여한다(S4300).The continuous weight processing unit 552 of the present invention assigns second rank information to each used document object, but gives relatively higher second rank information as the period (time) of continuous use is longer or the number of times of use is greater (S4300). .

상기 제2랭크정보의 경우 독립변수(연속이용기간, 이용횟수)가 클수록(많을수록) 높은 점수가 부여되어야 하므로 상기 제2랭크정보를 알고리즘으로 구현하는 제2함수는 로그 기반의 함수(function based on log) 등과 같이 비례 관계를 표상함과 동시에 수렴하는 특성을 가지는 함수가 적용되는 것이 바람직하다.In the case of the second rank information, the higher the independent variable (period of continuous use, number of times of use) is, the higher the score should be. Therefore, the second function that implements the second rank information as an algorithm is a log-based function (function based on It is preferable to apply a function having a characteristic of representing a proportional relationship and converging at the same time, such as log).

이 경우 본 발명의 선별처리부(558)는 제1랭크정보 또는 제2랭크정보 중 하나 이상를 기준으로 상기 추천문서를 선별하며, 바람직하게 사용자 패턴을 더욱 입체적으로 반영하기 위하여 제1랭크정보와 제2랭크정보의 합산된 결과를 기준으로 상기 추천문서가 선별되도록 구성될 수 있다.In this case, the selection processing unit 558 of the present invention selects the recommended document based on at least one of the first rank information and the second rank information, and preferably, to reflect the user pattern more three-dimensionally, the first rank information and the second rank information The recommendation document may be selected based on the sum of rank information.

본 발명의 근접가중치처리부(553)는 사용문서객체의 이력정보에 포함된 이용시간정보 중 마지막 이용시간정보를 추출하고 이 마지막 이용시간정보가 현재시점과 근접할수록 상대적으로 높은 제3랭크정보를 부여하도록(S4400) 구성될 수 있다.The proximity weight processing unit 553 of the present invention extracts the last use time information among the use time information included in the history information of the used document object, and gives a relatively high third rank information as the last use time information approaches the current time. It may be configured to (S4400).

상기 제3랭크정보의 경우 독립변수(마지막 이용시간정보와의 간격)가 클수록 낮은 점수가 부여되어야 하고 근접 영역 부분의 독립변수에 상대적으로 높은 가중치가 부여되는 것이 바람직하므로 이를 반영할 수 있는 아래와 같은 함수가 적용될 수 있다.In the case of the third rank information, the higher the independent variable (interval from the last use time information) is, the lower the score should be, and it is desirable to assign a relatively high weight to the independent variable in the vicinity area. function can be applied.

상기 수식에서 y는 제3랭크정보, x(0이상의 실수)는 마지막 이용시간정보와 현재시점과의 차이값이다.In the above formula, y is the third rank information, and x (a real number greater than or equal to 0) is the difference between the last use time information and the current time.

이 경우 본 발명의 선별처리부(558)는 제1랭크정보 내지 제3랭크정보 중 하나 이상를 기준으로 상기 추천문서를 선별하며, 바람직하게 사용자 패턴을 더욱 종합적으로 반영하기 위하여 제1랭크정보 내지 제3랭크정보의 합산된 결과 또는 이에 준하는 연산값을 기준으로 상기 추천문서가 선별되도록 구성될 수 있다.In this case, the selection processing unit 558 of the present invention selects the recommended document based on one or more of the first rank information to the third rank information, and preferably, the first rank information to the third rank information in order to more comprehensively reflect the user pattern. The recommendation document may be selected based on a summed result of rank information or an operation value corresponding thereto.

본 발명의 텍스트선별부(554)는 상기 사용문서객체의 파일명 정보를 대상으로 형태소 파싱 프로세싱을 수행하여 명사에 해당하는 복수 개 텍스트를 선별하고 이 선별된 복수 개 텍스트 중 사용빈도를 기준으로 상위 m(여기서 m은 2이상의 자연수)개의 주요텍스트를 선별한다.The text selection unit 554 of the present invention selects a plurality of texts corresponding to nouns by performing morpheme parsing processing on the file name information of the used document object, and selects a plurality of texts corresponding to nouns, and selects the top m based on the frequency of use of the selected plurality of texts. (where m is a natural number of 2 or more) main texts are selected.

이와 같이 복수 개 주요텍스트가 선별되면 본 발명의 빈도가중치처리부(555)는 상기 사용문서객체마다 제4랭크정보를 부여하되, 상기 상위 m개의 주요텍스트 중 해당 사용문서객체의 파일명에 포함된 주요텍스트의 개수가 많을수록 상대적으로 높은 제4랭크정보를 부여하도록(S4500) 구성된다.In this way, when a plurality of main texts are selected, the frequency weighting processing unit 555 of the present invention assigns fourth rank information to each of the used document objects, and the main text included in the file name of the corresponding used document object among the top m main texts. As the number of is increased, relatively high fourth rank information is assigned (S4500).

이 경우 본 발명의 선별처리부(558)는 제1랭크정보 내지 제4랭크정보 중 하나 이상를 기준으로 상기 추천문서를 선별하며, 바람직하게 사용자 패턴을 더욱 다각도로 반영하기 위하여 제1랭크정보 내지 제4랭크정보 중 복수 개, 바람직하게 전부가 합산된 결과를 기준으로 상기 추천문서가 선별되도록 구성될 수 있다.In this case, the selection processing unit 558 of the present invention selects the recommended document based on at least one of the first rank information to the fourth rank information, and preferably, the first rank information to the fourth rank information to reflect the user pattern in more diverse ways. The recommendation document may be selected based on a result of summing up a plurality of pieces of rank information, preferably all of them.

더욱 바람직한 실시형태의 구현을 위하여, 본 발명의 추천문서선별부(550)는 트래킹부(556)를 더 포함할 수 있다. 상기 트래킹부(556)는 사용문서객체의 내용 수정이 이루어진 수정횟수정보 또는 상기 사용문서객체의 전체 내용 중 수정이 이루어진 부분의 수정비율정보 중 하나 이상을 포함하는 트래킹정보를 생성하도록(S4600) 구성된다.To implement a more preferred embodiment, the recommendation document selection unit 550 of the present invention may further include a tracking unit 556. The tracking unit 556 is configured to generate tracking information including one or more of information on the number of times the content of the document object has been modified or information on the modification rate of the entire content of the document object that has been modified (S4600). do.

이와 같이 트래킹정보가 생성되면 본 발명의 선별처리부(558)는 상기 트래킹정보의 크기를 기준으로 상기 추천문서를 선별할 수 있으며 실시형태에 따라서 상술된 제1랭크정보 내지 제4랭크정보 중 하나 이상과 상기 트래킹정보의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.When the tracking information is generated in this way, the screening processing unit 558 of the present invention can select the recommended document based on the size of the tracking information, and according to the embodiment, one or more of the above-described first rank information to fourth rank information And it may be configured to select the recommended document based on the calculated size of the tracking information.

수정이 이루어진 횟수가 많거나 또는 수정비율이 많은 사용문서객체의 경우 작업이 진행되고 있는 문서에 해당할 가능성이 높다고 볼 수 있으므로 위와 같이 구성되는 경우 요청자의 업무 내지 작업의 현재 진행 관련성을 효과적으로 반영할 수 있게 된다.In the case of a document object that has been modified a lot or has a high modification rate, it is highly likely that it corresponds to a document in progress. Therefore, if configured as above, the relevance of the requester's task or the current progress of the task can be effectively reflected. be able to

또한, 본 발명의 주목도처리부(557)는 상기 사용문서객체마다 주목도지표를 부여하되, 상기 요청자정보에 해당하는 사용자 이외의 사용자로서 상기 사용문서객체의 내용을 수정한 사용자의 수가 많을수록 상대적으로 높은 주목도지표를 부여하도록 구성된다.In addition, the attention processing unit 557 of the present invention assigns an attention index to each document object, but the higher the number of users other than the requester information who modified the contents of the document object, the higher the degree of attention. It is configured to give indicators.

이와 같이 주목도지표가 생성되면 본 발명의 선별처리부(558)는 상기 주목도지표의 크기를 기준으로 상기 추천문서를 선별할 수 있으며, 상술된 제1랭크정보 내지 제4랭크정보 또는 트래킹정보 중 하나 이상과 상기 주목도지표의 연산된 크기를 기준으로 상기 추천문서를 선별하도록 구성될 수 있다.When the attention index is generated in this way, the selection processing unit 558 of the present invention may select the recommended document based on the size of the attention index, and at least one of the above-described first to fourth rank information or tracking information. And it may be configured to select the recommended document based on the computed size of the attention index.

이와 같이 정보요청자를 기준으로 자신 이외의 자에 타인에 의하여 이용되거나 수정되는 횟수가 많은 문서객체의 경우, 복수 인원이나 팀원에 의하여 공동으로 작성되는 문서일 가능성이 높다고 볼 수 있어 주기적 확인이 유도될 필요성이 크다고 할 수 있다.In this way, in the case of a document object that is frequently used or modified by a person other than himself based on the information requester, it is highly likely that the document is jointly created by multiple people or team members, so that periodic confirmation is induced. The need can be said to be great.

도 11에 도시된 바와 같이 본 발명의 문서 중앙화시스템을 구현하는 제3실시예에 의한, 유저 프로파일 기반의 유사문서 추천시스템(이하 '유사문서 추천시스템'이라 지칭한다)(1000)은 문서DB부(1100), 임베딩벡터생성부(1200), 리퀘스트입력부(1300), 유사문서선별부(1400), 메인처리부(1500), 제1랭킹정보생성부(1600), 프로파일생성부(1700), 학습객체생성부(1800) 및 러닝처리부(1900)를 포함하여 구성될 수 있다.As shown in FIG. 11, the user profile-based similar document recommendation system (hereinafter referred to as 'similar document recommendation system') 1000 according to the third embodiment implementing the document centralization system of the present invention is a document DB unit. 1100, embedding vector generator 1200, request input unit 1300, similar document sorter 1400, main processing unit 1500, first ranking information generator 1600, profile generator 1700, learning It may include an object creation unit 1800 and a running processing unit 1900.

본 발명의 제3실시예에 의한 유사문서 추천시스템(1000)은 사용자가 현재 이용하고 있는 문서 또는 사용자가 지정한 문서 등(이하 '대상문서'라 지칭하다)과 유사성을 가지는 복수 개 문서 또는/및 이에 대한 리스트정보를 GUI 환경 등으로 사용자에게 제공함으로써 문서 활용도를 더욱 효과적으로 향상시키는 시스템에 해당한다.The similar document recommendation system 1000 according to the third embodiment of the present invention is a document currently in use by a user or a document designated by the user (hereinafter referred to as 'target document') and a plurality of documents having similarities and/or It corresponds to a system that improves document utilization more effectively by providing list information to the user through a GUI environment.

본 발명의 문서DB부(1100)는 앞서 기술된 제1실시예의 문서DB부(103) 및 제2실시예의 문서객체DB부(510)와 같이 문서중앙화로 집합된 문서인 문서객체가 저장공간에 저장된다(S700, 도 12 참조).In the document DB unit 1100 of the present invention, like the document DB unit 103 of the first embodiment and the document object DB unit 510 of the second embodiment described above, document objects, which are documents collected by document centralization, are stored in a storage space. It is stored (S700, see FIG. 12).

상기 문서객체들은 도 14에 예시된 바와 같이 상호 링크되어 있으며 상하위 S개(S는 2이상의 자연수)의 계층적 트리 구조를 가지는 각각의 저장공간에 저장되며, 시스템 내부에서는 특정 파일이 저장된 위치정보가 해당하는 일련의 트리 구조의 계층적 상호 관계를 연계시킨 정보 형태 등으로 특정 파일과 연계되어 저장된다.As illustrated in FIG. 14, the document objects are linked to each other and stored in each storage space having a hierarchical tree structure of upper and lower S (S is a natural number of 2 or more). It is stored in association with a specific file in the form of information linking the hierarchical interrelationships of a corresponding series of tree structures.

도 14에는 4개(S=4)의 계층적 트리 구조를 가지는 DB구조가 도시되어 있으나 이는 하나의 예시로서 이와 다른 개수의 상하위 계층을 가질 수 있음은 물론이다. 14 shows a DB structure having a hierarchical tree structure of four (S=4), but this is an example and may have a different number of upper and lower hierarchies.

또한 후술되는 바와 같이 저장경로정보의 벡터 함수 처리를 더욱 효과적으로 구현하기 위하여 상기 트리구조 중 최상위 계층의 공간(directory, archive 등)은 길이 1(length 1) 등으로 설정되며, 이하 하위 계층으로 내려올수록 순차적으로 증가하는 length를 가지도록 설정될 수 있으며, 계층별 저장공간의 네이밍(naming)은 이를 독립적으로 구분하고 표상하는 디지털정보로 저장될 수 있음은 물론이다.In addition, as will be described later, in order to more effectively implement the vector function processing of storage path information, the top layer space (directory, archive, etc.) of the tree structure is set to length 1, etc. It can be set to have a length that increases sequentially, and the naming of the storage space for each layer can be stored as digital information that independently classifies and represents it.

본 발명의 임베딩벡터생성부(1200)는 앞서 기술된 바와 같이 상기 문서DB부(1100)에 저장되어 있는 문서객체 각각을 대상으로 인공지능 기반의 임베딩 모델을 적용하여 해당 문서객체의 내용을 표상하는 다차원 숫자 기반의 임베딩벡터를 생성한다(S710).As described above, the embedding vector generator 1200 of the present invention applies an artificial intelligence-based embedding model to each document object stored in the document DB unit 1100 to represent the contents of the corresponding document object. A multi-dimensional number-based embedding vector is generated (S710).

상기 임베딩벡터생성부(1200)는 저장된 문서객체의 내용이 수정, 편집되는 이벤트가 발생되거나 가변적으로 설정될 수 있는 특정 주기마다 각 문서객체의 임베딩벡터를 갱신하여 생성하도록 구성되는 것이 바람직하며, 신규 문서객체가 생성되는 경우 자동적으로 신규 문서객체에 대한 임베딩벡터를 생성하도록 구성될 수 있음은 물론이다.Preferably, the embedding vector generation unit 1200 is configured to update and generate the embedding vector of each document object every specific period that can be set variably or when an event in which the content of the stored document object is modified or edited occurs. Of course, it can be configured to automatically generate an embedding vector for a new document object when a document object is created.

본 발명의 리퀘스트입력부(1300)는 유사문서 추천에 대한 리퀘스트신호를 사용자로부터 입력받는다(S720). The request input unit 1300 of the present invention receives a request signal for similar document recommendation from the user (S720).

상기 리퀘스트신호는 유사문서를 요청한 사용자(요청자)에 대한 인적정보 및 대상문서를 식별할 수 있는 아이디정보 등이 포함되며, 사용자가 특정 문서를 열람하는 경우 사용자 접속정보와 현재 열람된 문서정보를 이용하여 자동적으로 생성되도록 구성될 수 있으며, 실시형태에 따라서 팝업창 등의 형태로 요청정보를 입력받는 환경을 표출하고 이에 대한 선택을 입력받는 방식 등 다양한 방식으로 입력될 수 있다.The request signal includes personal information of the user (requester) who requested the similar document and ID information for identifying the target document. When a user browses a specific document, user access information and currently viewed document information are used. According to the embodiment, it can be input in various ways, such as displaying an environment for receiving requested information in the form of a pop-up window and receiving a selection for it.

이와 같이 리퀘스트신호가 입력되면 본 발명의 유사문서선별부(1400)는 상기 대상문서의 임베딩벡터와 문서DB부(1100)에 저장된 문서객체의 임베딩벡터 사이의 코사인 유사도(cosine similarity)와 같은 제1유사도 프로세싱을 수행하여(S730) 상기 대상문서와 내용적 유사성을 가지는 k개(k는 2이상의 자연수) 유사문서를 선별한다(S740).In this way, when a request signal is input, the similar document sorting unit 1400 of the present invention obtains a first value equal to the cosine similarity between the embedding vector of the target document and the embedding vector of the document object stored in the document DB unit 1100. Similarity processing is performed (S730) to select k (k is a natural number equal to or greater than 2) similar documents having content similarity with the target document (S740).

이 경우 본 발명의 제1랭킹정보생성부(1600)는 k개 유사문서 각각과 상기 대상문서 사이의 유사도를 정량적 수치로 표현한 유사도랭킹정보를 생성할 수 있다.In this case, the first ranking information generator 1600 of the present invention may generate similarity ranking information expressing the degree of similarity between each of the k similar documents and the target document as a quantitative value.

이와 같이 k개의 유사문서가 선별하고 각 유사문서마다의 유사도랭킹정보가 생성되면 본 발명의 메인처리부(1500)는 상기 k개 유사문서에 대한 정보가 포함되는 유사리스트정보를 상기 요청자에게 제공하며(S750), 실시형태에 따라서 유사도랭킹정보에 따른 정렬이 이루어진 리스트정보를 제공하거나 또는 유사도랭킹정보에 따른 순위 또는 유사도 지표 등을 함께 제공할 수 있다.In this way, when k similar documents are selected and similarity ranking information for each similar document is generated, the main processing unit 1500 of the present invention provides similarity list information including information on the k similar documents to the requestor ( S750), depending on the embodiment, list information sorted according to the similarity ranking information may be provided, or a rank or a similarity index according to the similarity ranking information may be provided together.

상술된 유사문서 제공에 대한 프로세싱은 특정 종료조건이 충족되지 않는다면(S760) 순환적으로 그리고 지속적으로 수행될 수 있음은 물론이다.It goes without saying that the above-described similar document provision processing may be repeatedly and continuously performed if a specific end condition is not met (S760).

이하에서는 도 13 등을 참조하여 유사문서의 유사성 내지 유사도를 더욱 실질적으로 반영하는 본 발명의 실시예를 상세히 설명하도록 한다.Hereinafter, an embodiment of the present invention that more substantially reflects the similarity or degree of similarity of similar documents will be described in detail with reference to FIG. 13 and the like.

본 발명의 프로파일생성부(1700)는 문서객체가 상기 저장공간(S개 상하위 계층적 트리 구조를 가지는 공간)에 저장된 저장경로정보(접근정보), 해당 문서객체가 이용된 이용시간정보, 해당 문서객체를 이용한 사용자정보 및 해당 사용자의 소속정보를 포함하는 프로파일정보를 생성하고 저장한다(S800). The profile creation unit 1700 of the present invention includes storage path information (access information) in which a document object is stored in the storage space (a space having an upper and lower hierarchical tree structure), usage time information when the corresponding document object is used, and corresponding document Profile information including user information using an object and belonging information of the corresponding user is created and stored (S800).

상기 프로파일생성부(1700)는 제2실시예의 이력정보생성부(520)와 일부 기능이 대응된다. 또한, 프로파일정보는 앞서 기술된 바와 같이 시스템 내부에서 구현되는 알고리즘의 순환적 처리 및 타임스탬프(time stamp) 등을 통하여 지속적으로 생성, 갱신 및 저장될 수 있음은 물론이다.The profile generator 1700 corresponds to some functions of the history information generator 520 of the second embodiment. In addition, as described above, the profile information can be continuously generated, updated, and stored through cyclical processing of an algorithm implemented inside the system and a time stamp.

그러므로 상기 프로파일정보는 사용자들 각각이 문서객체를 이용한 시계열적 정보로서 시간에 따라 동적으로 변화되는 각 사용자들의 패턴을 분석하는 기본데이터(raw data)로 기능할 수 있다.Therefore, the profile information is time-series information for each user using a document object, and can function as raw data for analyzing patterns of each user that dynamically change over time.

본 발명의 학습객체생성부(1800)는 임베딩벡터를 이용하여 컨텐츠 내용을 기준으로 대상문서와 유사성이 있다고 판단되는 k개의 유사문서 각각에 대응되는 k개의 학습객체를 생성한다(S810).The learning object creation unit 1800 of the present invention creates k learning objects corresponding to each of the k similar documents determined to have similarities with the target document based on the content by using the embedding vector (S810).

상기 학습객체는 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 요청자에 대한 인적정보, 상기 요청자의 소속정보(부서, 팀 등)에 대한 각각의 엘리먼트로 이루어지며, 시간 대응성을 정밀하게 연산하기 위하여 상기 제1시간정보는 월, 요일, 시각 정보 각각이 포함되도록 구성되는 것이 바람직하다.The learning object is composed of each element for the storage path information of the similar document, the first time information at which the request signal was input, the personal information of the requestor, and the affiliation information (department, team, etc.) of the requestor, and corresponds to time. It is preferable that the first time information is configured to include month, day, and time information in order to precisely calculate the last name.

구체적으로 학습객체생성부(1800)는 개별벡터생성부(1810) 및 객체생성부(1820)를 포함할 수 있다. 개별벡터생성부(1810)는 상기 k개 유사문서의 저장경로정보, 리퀘스트신호가 입력된 제1시간정보, 요청자에 대한 정보, 상기 요청자의 소속정보 각각의 엘리먼트(element)마다 인코딩 프로세싱을 수행하여 해당 엘리먼트를 표상하는 숫자 체계의 개별벡터를 생성한다(S811).Specifically, the learning object generator 1800 may include an individual vector generator 1810 and an object generator 1820. The individual vector generator 1810 performs encoding processing for each element of the storage path information of the k similar documents, the first time information at which the request signal was input, information about the requester, and the belonging information of the requestor. An individual vector of a number system representing the corresponding element is created (S811).

더욱 바람직하게, 상기 학습객체생성부(1800)는 도 15에 예시된 바와 같이 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 요청자에 대한 정보 및 요청자의 소속정보마다 독립된 개별벡터를 생성하되, 상기 k개 유사문서의 저장경로정보의 경우, 해당 저장경로정보를 이루는 엘리먼트로서 계층적 트리구조를 형성하는 S개 계층 각각의 엘리먼트마다 독립된 개별벡터를 생성하도록 구성될 수 있다.More preferably, as illustrated in FIG. 15, the learning object creation unit 1800 provides independent individual information for each similar document's storage path information, the first time information at which the request signal was input, requester information, and requester's affiliation information. A vector may be generated, but in the case of the storage path information of the k similar documents, an independent individual vector may be generated for each element of the S hierarchical tree structure as an element constituting the corresponding storage path information.

위 학습객체의 엘리먼트 중 시간정보 및 인적 관련 정보는 유사문서의 프로파일정보에 수록되어 있는 시간정보, 사용자정보 및 그 사용자의 소속정보가 아니라 리퀘스트신호가 입력된 시간정보, 요청자정보 및 해당 요청자의 소속정보가 된다.Among the elements of the learning object above, time information and human-related information are not time information, user information, and the user's affiliation information contained in the profile information of similar documents, but time information when the request signal was entered, requester information, and the requester's affiliation. becomes information.

이와 같이 학습객체를 생성하는 경우, 해당 학습객체에 수록된 9개(예시)의 정보(즉, 벡터체계로 변환된 정보) 전체를 기준으로 이 9개 정보와 동일/유사한 프로파일정보가 존재할 확률 또는 0 내지 9개 엘리먼트 각각과 대응되는 엘리먼트가 존재할 확률 내지 유사성의 정도에 대한 수치적 결과 등을 학습 모델을 통하여 생성하고 종합적으로 활용할 수 있다.When a learning object is created in this way, the probability that profile information identical/similar to these 9 pieces of information exists based on all 9 (examples) of information (i.e., information converted to a vector system) included in the learning object, or 0 Through a learning model, a probability of existence of an element corresponding to each of the nine elements or a numerical result for the degree of similarity may be generated and used comprehensively.

이와 같이 각 엘리먼트마다 개별벡터가 생성되면 본 발명의 객체생성부(1820)는 상기 개별벡터를 결합한 다차원최종벡터를 생성하고 상기 생성된 다차원최종벡터를 이용하여 상기 k개 학습객체를 생성한다(S813).In this way, when an individual vector is generated for each element, the object generator 1820 of the present invention generates a multidimensional final vector by combining the individual vectors and creates the k learning objects using the generated multidimensional final vector (S813 ).

도 15에 예시된 학습객체의 경우, 다차원최종벡터는 9개 개별벡터가 결합된 703차원의 벡터체계(10+50+100+200+12+7+24+100+200)가 된다.In the case of the learning object illustrated in FIG. 15, the multidimensional final vector becomes a 703-dimensional vector system (10+50+100+200+12+7+24+100+200) in which 9 individual vectors are combined.

k개 유사문서를 대상으로 k개의 학습객체가 생성되면(S810) 본 발명의 러닝처리부(1900)는 상기 k개 학습객체를 대상으로 문서객체마다 생성된 프로파일정보를 이용한 학습 알고리즘을 수행하여(S520) 상기 프로파일정보 중 해당 학습객체에 대응되는 프로파일정보가 존재할 확률을 표상하는 가중치정보를 k개 유사문서마다 생성한다(S830). When k learning objects are generated for k similar documents (S810), the learning processing unit 1900 of the present invention performs a learning algorithm using the profile information generated for each document object for the k learning objects (S520 ) Among the profile information, weight information representing the probability that the profile information corresponding to the learning object exists is generated for each k similar document (S830).

앞서 설명된 바와 같이 이 가중치정보는 문서중앙화로 집합된 문서객체의 프로파일정보 중 해당 학습객체와 대응되는 하나 이상의 엘리먼트가 포함된 프로파일정보가 존재할 확률 또는 해당 학습객체와 유사성을 가지는 수치적 크기 정보를 표상하므로 요청자가 유사문서를 열람(access 등)할 확률을 의미하게 된다.As described above, this weight information is the probability of existence of profile information including one or more elements corresponding to the learning object among the profile information of document objects collected by document centralization, or numerical size information similar to the learning object. Since it is represented, it means the probability that the requester reads similar documents (access, etc.).

일 예로, 본 발명의 프로세싱에 의하는 경우 정보요청자의 소속정보(부서, 팀 등)의 상호 관련성 여부 및 그 관련되는 정도 등을 정량적 수치에 의한 연산 결과로 피드백할 수 있어 소속정보의 관련성이 정보요청자에게 유사문서를 추천하는 일 기준으로 기능할 수 있다.For example, in the case of the processing of the present invention, whether or not the information requester's affiliation information (department, team, etc.) is related to each other and the degree of the correlation can be fed back as a result of calculation based on quantitative values, so that the relevance of the affiliation information is information It can function as a basis for recommending similar documents to the requester.

즉, 본 발명의 러닝처리부(1900)는 상기 k개 유사문서의 저장경로정보, 상기 리퀘스트신호가 입력된 제1시간정보, 상기 요청자에 대한 정보 및 상기 요청자의 소속정보에 대한 엘리먼트 중 대응되는 엘리먼트의 개수 및 대응되는 엘리먼트의 유사성 정도를 조합적으로(in combination) 적용하여 상기 가중치정보를 생성한다.That is, the learning processing unit 1900 of the present invention provides the corresponding element among the storage path information of the k similar documents, the first time information at which the request signal was input, the information about the requestor, and the elements about the belonging information of the requestor. The weight information is generated by applying the number of and the degree of similarity of corresponding elements in combination.

이와 같이 러닝처리부(1900)에서 가중치정보가 생성되면 본 발명의 메인처리부(1500)는 상기 가중치정보와 유사도랭킹정보(내용적 유사도에 대한 정량적 수치)를 종합적으로 반영한 최종랭킹정보를 생성하고(S840) 상기 최종랭킹정보에 따른 유사리스트정보를 요청자에게 제공하도록(S850) 구성된다.In this way, when weight information is generated in the learning processing unit 1900, the main processing unit 1500 of the present invention generates final ranking information that comprehensively reflects the weight information and similarity ranking information (quantitative numerical value for similarity in content) (S840 ) It is configured to provide the similarity list information according to the final ranking information to the requester (S850).

이상에서 본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.Although the present invention has been described above with limited examples and drawings, the present invention is not limited thereto and will be described below and the technical spirit of the present invention by those skilled in the art to which the present invention belongs. Of course, various modifications and variations are possible within the scope of the claims.

상술된 본 발명의 설명에 있어 제1 및 제2 등과 같은 수식어는 상호 간의 구성요소를 상대적으로 구분하기 위하여 사용되는 도구적 개념의 용어일 뿐이므로, 특정의 순서, 우선순위 등을 나타내기 위하여 사용되는 용어가 아니라고 해석되어야 한다.In the description of the present invention described above, modifiers such as first and second are only terms of instrumental concepts used to relatively distinguish components from each other, so they are used to indicate a specific order, priority, etc. It should be interpreted that it is not a term that

본 발명의 설명과 그에 대한 실시예의 도시를 위하여 첨부된 도면 등은 본 발명에 의한 기술 내용을 강조 내지 부각하기 위하여 다소 과장된 형태로 도시될 수 있으나, 앞서 기술된 내용과 도면에 도시된 사항 등을 고려하여 본 기술분야의 통상의 기술자 수준에서 다양한 형태의 변형 적용 예가 가능할 수 있음은 자명하다고 해석되어야 한다.Although the accompanying drawings and the like for illustration of the description of the present invention and its embodiments may be shown in a slightly exaggerated form in order to emphasize or highlight the technical contents according to the present invention, the above-described contents and matters shown in the drawings Taking into account, it should be interpreted that it is obvious that various types of modifications can be applied at the level of those skilled in the art.

상술된 본 발명의 카테고리 분류방법, 문서 추천방법 및 유사문서 추천방법 컴퓨터로 읽을 수 있는 기록매체에 컴퓨터가 읽을 수 있는 코드로서 구현될 수 있다. 컴퓨터가 읽을 수 있는 기록매체는 컴퓨터에 의하여 읽혀질 수 있는 데이터가 저장되는 모든 종류의 기록장치(시디롬, 램, 롬, 플로피 디스크, 자기 디스크, 하드 디스크, 광자기 디스크 등)를 포함하며, 유무선 인터넷 전송을 위한 서버도 포함한다.The above-described category classification method, document recommendation method, and similar document recommendation method of the present invention may be implemented as computer readable codes on a computer readable recording medium. Computer-readable recording media include all types of recording devices (CD-ROM, RAM, ROM, floppy disk, magnetic disk, hard disk, magneto-optical disk, etc.) in which data that can be read by a computer is stored. It also includes a server for transmission.

100 : 문서 분류시스템 200 : 사용자 단말
300 : 웹서버 400 : 모바일 단말
103 : 문서DB부 105 : 카테고리설정부
107 : 텍스트추출부 110 : 제1입력부
120 : 벡터생성부 130 : 대표벡터설정부
140 : 대상벡터생성부 145 : 데이터처리부
150 : 클러스터링부 160 : 제2대표벡터설정부
165 : 제2데이터처리부 170 : 네이밍부
171 : 선별부 173 : 연산처리부
175 : 네이밍선정부 180: 모니터링부
190 : 처리제어부 195 : 피드백처리부
500 : 문서 추천시스템 510 : 문서객체DB부
520 : 이력정보생성부 530 :　요청정보입력부
540 : 제1선별부 550 : 추천문서선별부
551 : 주기가중치처리부 552 : 연속가중치처리부
553 : 근접가중치처리부 554 : 텍스트선별부
555 : 빈도가중치처리부 556 : 트래킹부
557 : 주목도처리부 558 : 선별처리부
560 : 리스트제공부
1000 : 유사문서 추천시스템 1100 : 문서DB부
1200 : 임베딩벡터생성부 1300 : 리퀘스트입력부
1400 : 유사문서선별부 1500 : 메인처리부
1600 : 제1랭킹정보생성부 1700 : 프로파일생성부
1800 : 학습객체생성부 1810 : 개별벡터생성부
1820 : 객체생성부 1900 : 러닝처리부100: document classification system 200: user terminal
300: web server 400: mobile terminal
103: document DB unit 105: category setting unit
107: text extraction unit 110: first input unit
120: vector generator 130: representative vector setting unit
140: object vector generation unit 145: data processing unit
150: clustering unit 160: second representative vector setting unit
165: second data processing unit 170: naming unit
171: selection unit 173: calculation processing unit
175: naming selection unit 180: monitoring unit
190: processing control unit 195: feedback processing unit
500: document recommendation system 510: document object DB unit
520: history information generation unit 530: request information input unit
540: first selection unit 550: recommendation document selection unit
551: period weight processing unit 552: continuous weight processing unit
553: proximity weight processing unit 554: text selection unit
555: frequency weighting unit 556: tracking unit
557: Attention level processing unit 558: Selective processing unit
560: list provision unit
1000: Similar document recommendation system 1100: Document DB unit
1200: Embedding vector generator 1300: Request input unit
1400: Similar document selection unit 1500: Main processing unit
1600: first ranking information generating unit 1700: profile generating unit
1800: learning object generation unit 1810: individual vector generation unit
1820: object creation unit 1900: running processing unit

Claims

a document object DB unit in which document objects, which are documents collected by document centralization, are stored;
When the document object is used by a user, a history information generator for storing history information including user information using the document object and usage time information when the document object is used in association with the document object;
a request information input unit into which a request signal including requester information and at least one request period information among a plurality of different set periods is input;
a first sorting unit which selects a used document object, which is a document object used by a user corresponding to the requestor information, from among the document objects by using the history information;
a recommendation document sorting unit which selects one or more recommended documents, which are document objects having use time information relatively close to the reference point in time of the request period information, among the used document objects; and
and a list providing unit generating recommendation list information corresponding to the recommended document and providing the generated recommendation list information to the requester.

The method of claim 1, wherein the recommendation document selection unit,
a periodic weight processing unit that assigns first rank information to each used document object, and assigns relatively higher first rank information as the distance between the use time information of the used document object and the reference time point of the request period information is smaller; and
and a selection processing unit for selecting the recommended document based on the size of the first rank information among the used document objects.

The method of claim 2, wherein the recommendation document selection unit,
Further comprising a continuous weight processing unit that assigns second rank information to each used document object, and assigns relatively high second rank information as the continuously used period is longer or the number of times of use is greater,
The selection processing unit,
A customized document recommendation system using dynamic change of time-series pattern information, characterized in that the recommended document is selected based on the calculated size of the first and second rank information among the used document objects.

The method of claim 3, wherein the recommendation document selection unit,
Further comprising a proximity weight processing unit for extracting the last use time information from among the use time information included in the history information of the used document object, and giving a relatively high third rank information as the last use time information is closer to the current time point; ,
The selection processing unit,
A customized document recommendation system using dynamic change of time-series pattern information, characterized in that the recommended document is selected based on the calculated size of the first to third rank information among the used document objects.

The method of claim 4, wherein the recommendation document selection unit,
A plurality of texts corresponding to nouns are selected by performing morpheme parsing processing on the file name information of the used document object, and among the selected plurality of texts, the top m (m is a natural number of 2 or more) main texts based on the frequency of use. a text selection unit that selects; and
A frequency weighting processing unit that assigns fourth rank information to each used document object, and assigns a relatively high fourth rank information as the number of main texts included in the file name of the corresponding used document object among the top m main texts increases. include,
The selection processing unit,
A customized document recommendation system using dynamic change of time-series pattern information, characterized in that the recommended document is selected based on the calculated size of the first to fourth rank information among the used document objects.

The method of claim 2, wherein the recommendation document selection unit,
Further comprising a tracking unit for generating tracking information including at least one of information on the number of times the content of the document object has been modified or information on a modification rate of a modified portion of the entire content of the document object,
The selection processing unit,
A customized document recommendation system using dynamic change of time series pattern information, characterized in that the recommended document is selected based on the size of the first rank information and the calculated size of the tracking information among the used document objects.

The method of claim 2, wherein the recommendation document selection unit,
Further comprising an attention processing unit that assigns an attention index to each document object, and assigns a relatively high attention index as the number of users who have modified the content of the document object increases, as a user other than the user corresponding to the requester information. ,
The selection processing unit,
A customized document recommendation system using dynamic change of time-series pattern information, characterized in that the recommended document is selected based on the size of the first rank information and the calculated size of the attention index among the used document objects.