KR20200053334A

KR20200053334A - Method and System for the Researcher Map to Promote the Convergence Research

Info

Publication number: KR20200053334A
Application number: KR1020180136832A
Authority: KR
Inventors: 남영광; 황상원; 서강원; 류원철
Original assignee: 연세대학교 원주산학협력단
Priority date: 2018-11-08
Filing date: 2018-11-08
Publication date: 2020-05-18

Abstract

The present invention relates to technology about a system which compares and analyzes research data such as papers and intellectual properties to analyze convergence research among researchers who belong to different research fields, and constructs a visualization map about the researchers related to the convergence research. According to the present invention, the system comprises: a data extraction unit extracting a researcher and a research abstract from the research data to construct a database; a morpheme analysis unit extracting a noun and an adjective from the research abstract to generate a word set; a topic modeling unit using a corpus to extract topic words, which become a subject of the research data, from the word set and grouping the related topic words to form a topic group; a user question unit receiving a search sentence from a user; a similarity measurement unit using the search sentence and the topic group to search for a key researcher and comparing the topic group extracted from the research data of the key researcher with the topic group extracted from other research data to search for a related researcher who is determined to perform a similar research to that of the key researcher; and a visualization unit forming and displaying a network in which the key researcher and related researchers searched in the similarity search unit are interconnected.

Description

Method and System for the Researcher Map to Promote the Convergence Research}

본 발명은 융합 연구 촉진을 위한 연구원 맵 구축 방법 및 시스템에 관한 것으로서, 보다 상세하게는 논문, 지식재산권과 같은 연구자료를 비교 분석하여 연구분야가 서로 상이한 연구자 간의 융합 연구 가능성을 분석하고, 융복합 연구 관계를 가지는 연구자에 대한 가시화 맵을 구축하는 시스템에 관한 기술이다.The present invention relates to a method and system for constructing a researcher map for facilitating convergence research, and more specifically, analyzing and analyzing convergence research possibilities between researchers with different research fields by comparing and analyzing research data such as papers and intellectual property rights. It is a technology for a system that builds a visualization map for researchers with research relationships.

현대 사회는 융합, 복합, 혁신, 창조 등의 획기적인 변화를 강조하는 정책과 제도가 확산되고 있다. 과학 기술적인 측면에서 융복합 기술은 현대 과학기술 혁신의 보편적인 현상으로 자리 잡았다. 최근 개발 및 상업화에 성공하고 있는 대부분의 제품과 서비스가 융복합 기술의 산물로 인식되고 있으며, 기업, 연구기관, 대학, 정부 등 모든 국가 과학기술 혁신 주체가 융합기술의 혁신에 몰두하고 있다. 세계적으로도 각 나라의 정부 차원에서 기술 융합 연구의 중요성을 깊이 인식하고 융복합 연구 진흥을 위한 부서의 설치와 융복합 연구개발사업 등의 관련 정책을 추진하고 있다. 특히 기업과 대학은 융복합 기술의 혁신 주체로서 주목을 받고 있으며, 이에 대한 많은 정책적인 지원도 제공되고 있다.In modern society, policies and institutions that emphasize groundbreaking changes such as convergence, complex, innovation, and creation are spreading. In terms of science and technology, convergence technology has become a universal phenomenon of modern science and technology innovation. Most products and services that have been successfully developed and commercialized are recognized as products of convergence technology, and all national science and technology innovation actors, such as corporations, research institutes, universities, and governments, are devoted to innovation in convergence technology. Globally, governments in each country are deeply aware of the importance of technology convergence research, and are promoting policies related to convergence research and development and establishment of departments to promote convergence research. In particular, corporations and universities are attracting attention as an innovator of convergence technology, and a lot of policy support is provided.

이와 같은 연구자 간의 관계도 구축을 위해 논문 'Analytical Study on the Relationship between Centralities of Research Networks and Research Performances'는 co-author, 저자 동시 인용, 저자 서지 결합 네트워크에 나타난 중심성과 연구성과의 연관성을 분석하였는데, 이것은 연구자의 연구 성과 분석에 중점을 둔 연구로 융복합 연구 수행 가능 여부에 대한 확인이 어렵다. 그리고 스탠포드 자연언어 처리 그룹(The Stanford Natural Language Processing Group)은 HRD(Human Resource Development) 연구동향 분석을 위해 기업교육 및 산업교육 연구 관련 분야의 핵심어 기반 네트워크 분석을 수행하였는데, 해당 연구에서는 연구자가 직업 노드와 링크를 생성하여 한정된 분야에 대한 분석을 수행하였기 때문에, 실시간으로 다양한 분야의 융복합 연구자 네트워크 분석 및 확인에는 적합하지 않다.In order to build a relationship between researchers, the paper 'Analytical Study on the Relationship between Centralities of Research Networks and Research Performances' analyzed the association between centrality and research results in co-author, author citation, and author bibliography combination network. This is a study focused on the researcher's research performance analysis, and it is difficult to confirm whether convergence research can be conducted. In addition, the Stanford Natural Language Processing Group conducted network analysis based on key words in the fields related to corporate education and industrial education research to analyze HRD (Human Resource Development) research trends. Since the and links were generated to perform analysis on a limited field, it is not suitable for real-time convergence researcher network analysis and verification in various fields.

등록특허공보 제10-1426765호Registered Patent Publication No. 10-1426765

이에 본 발명은 상기와 같은 종래의 제반 문제점을 해소하기 위해 제안된 것으로, 본 발명의 목적은 논문, 지식재산권과 같은 연구자료를 비교 분석하여 연구분야가 서로 상이한 연구자 간의 융합 연구 가능성을 분석하고, 융복합 연구 관계를 가지는 연구자에 대한 가시화 맵을 구축하는 시스템을 제공하기 위한 것이다.Accordingly, the present invention has been proposed to solve the above-mentioned problems, and the purpose of the present invention is to analyze and analyze research data such as thesis and intellectual property rights to analyze the possibility of convergence research between researchers with different research fields, It is to provide a system for constructing a visualization map for researchers with convergence research relationships.

또한, 본 발명의 다른 목적은 융복합 기술의 발전에 영향을 줄 수 있는 대학의 역할에 주목하여, 기업으로 하여금 대학과의 융복합 산학협력을 수행할 수 있도록 정보를 제공해주는 연구자 사이의 관계도를 구축하는 것이다.In addition, another object of the present invention is to focus on the role of universities that can influence the development of convergence technology, and also the relationship between researchers who provide information for companies to conduct convergence industry-academic cooperation with universities. Is to build

상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 융합 연구 촉진을 위한 연구원 맵 구축 시스템은 연구자료에서 연구자와 연구개요를 추출하여 데이터베이스를 구축하는 데이터 추출부와, 상기 연구개요에서 명사 및 형용사를 추출하여 단어세트를 생성하는 형태소 분석부와, 말뭉치(corpus)를 이용하여 상기 단어세트 중 상기 연구자료의 주제가 되는 토픽 단어들을 추출하고, 관련되는 토픽 단어들을 그룹지어 토픽그룹을 형성하는 토픽 모델링부와, 사용자로부터 검색문을 입력받는 사용자 질의부와, 상기 검색문과 상기 토픽그룹을 이용하여 중심연구자를 검색하고, 상기 중심연구자의 연구자료에서 추출된 토픽그룹과 다른 연구자료에서 추출된 토픽그룹을 비교하는 것으로 상기 중심연구자와 유사한 연구를 수행하는 것으로 판단되는 관련연구자를 탐색하는 유사도 측정부와, 상기 유사도 측정부에서 검색된 중심연구자와 관련연구자가 서로 연결된 연구자 네트워크를 구성하여 표시하는 가시화부를 포함하는 것을 특징으로 한다.In order to achieve the above object, the researcher map construction system for facilitating convergence research according to the technical idea of the present invention includes a data extraction unit that builds a database by extracting researchers and research outlines from research data, and nouns and A morpheme analysis unit that extracts adjectives to generate a word set, and a topic that forms topic groups by grouping related topic words by extracting topic words that are the subject of the research data from the word set using corpus A modeling unit, a user query unit that receives a search statement from a user, a search for a central researcher using the search statement and the topic group, and a topic group extracted from research data of the central researcher and a topic extracted from other research data By comparing the groups, it was determined to conduct similar studies to the above-mentioned central researchers. And similarity measurement unit for searching the researchers is, visible to the center researchers and researchers found in the similarity measurement unit configured to display the researchers networks connected to each other characterized in that it comprises a.

또한, 상기 데이터 추출부는 연구자료가 이미지로 구성된 문서인 경우 이미지 프로세싱을 수행하여 이미지에서 문자를 추출하고, 연구자료가 이미지화된 논문인 경우, 초록이 시작되는 지점에 시작좌표와, 초록이 종료되는 지점에 종료좌표를 설정하고, 상기 종료좌표의 x좌표 값은 초록 문단의 최우측의 x좌표로 치환한 후 상기 시작좌표와 상기 종료좌표를 양 끝점으로 하는 사각 범위 내에 포함된 텍스트를 추출하는 것을 특징으로 할 수 있다.In addition, the data extraction unit extracts characters from the image by performing image processing when the research data is a document composed of images, and if the research data is an imaged paper, the starting coordinates and the abstract are terminated at the point where the abstract starts. Set the end coordinate at the point, replace the x-coordinate value of the end coordinate with the x-coordinate of the rightmost part of the green paragraph, and then extract the text contained within the rectangular range using both the start coordinate and the end coordinate as both end points. It can be characterized as.

또한, 상기 데이터 추출부는 단어 간의 공백을 판단하기 위해 공백 임계 너비가 기 설정되고, 단어의 좌측 또는 우측에 위치한 공백의 너비가 상기 공백 임계 너비를 초과하면 해당 텍스트 라인이 종료된 것으로 판단하는 것을 특징으로 할 수 있다.In addition, the data extracting unit is characterized in that the space threshold width is pre-set to determine the space between words, and if the width of the space located on the left or right side of the word exceeds the space threshold width, the text line is determined to be terminated. Can be done with

또한, 상기 데이터 추출부는 단어(W), 추출된 단어의 수(n), 전체 단어(WA), i번째 추출된 단어(

)일 때, 상기

의 영역은

(

,

), 상기

의 시작 좌표

는 (

,

), 상기

의 종료 좌표

는 (

,

)가 되며,

의 조건을 만족할 때, i-1번째와 i번째 단어 간의 공백 너비

는

로 산출되는 것을 특징으로 할 수 있다.In addition, the data extraction unit is a word (W), the number of extracted words (n), the entire word (WA), the i-th extracted word (

), Above

The realm of

(

,

), remind

Starting coordinates

Is (

,

), remind

End coordinates of

Is (

,

),

The width of the space between the i-1th and ith words when the condition of

The

It can be characterized in that calculated.

또한, 상기 형태소 분석부는 연구개요에 포함된 문장을 문자(char) 단위로 읽고, 해당 문자가 가진 정수(int) 값을 추출한 후, 상기 정수 값이 0x3131보다 크고 0xD7A3보다 작으면 연구개요가 한글인 것으로 판단하는 것을 특징으로 할 수 있다.In addition, the morpheme analysis unit reads the sentence included in the study outline in units of char, extracts the integer value of the character, and if the integer value is greater than 0x3131 and less than 0xD7A3, the study outline is Korean It can be characterized as judging that.

또한, 상기 형태소 분석부는 연구개요가 한글이면 코모란(KOMORAN)을 이용하여 형태소를 분석하여 일반명사, 고유명사, 한자인 단어를 추출하여 단어세트로 구성하고, 연구개요가 영어이면 CoreNLP를 이용하여 형태소를 분석하여 명사, 형용사인 단어를 추출하여 단어세트로 구성하는 것을 특징으로 할 수 있다.Also, if the research outline is Korean, the morphemes are analyzed using KOMORAN to extract common nouns, proper nouns, and Chinese characters words into a set of words, and if the research outline is English, use CoreNLP. The morphemes may be analyzed to extract words that are nouns and adjectives and constitute a set of words.

또한, 상기 토픽 모델링부는 단어세트를 이용하여 말뭉치(corpus)를 구성하고, 상기 말뭉치를 잠재적 디리클레 할당(Latent Dirichlet Allocation) 알고리즘에 적용하여 토픽그룹을 생성하는 것을 특징으로 할 수 있다.In addition, the topic modeling unit may be characterized by constructing a corpus using a word set and applying the corpus to a potential Dirichlet Allocation algorithm to generate a topic group.

또한, 상기 토픽 모델링부는 각 연구자료에 K개의 토픽 단어 중 하나를 임의로 할당하고, 각 연구자료(d), 각 연구자료(d)에 포함된 전체 단어(w), 전체 단어(w)에 존재하는 토픽 단어(t)에 대해, 각 연구자료(d)의 단어세트(w) 중 토픽 단어(t)의 비율 p(t|d)를 연산하고, 모든 연구자료 중 토픽 단어(t)가 할당된 비율 p(w|t)를 연산하며, p(t|d)와 p(w|t)의 곱에 따라 토픽 단어(t)를 신규하게 선택하는 것을 특징으로 할 수 있다.In addition, the topic modeling unit randomly allocates one of K topic words to each study data, and exists in all study data (d), all words (w), and all words (w) included in each study data (d). For the topic word (t), the ratio p (t | d) of the topic word (t) in the word set (w) of each study data (d) is calculated, and the topic word (t) of all study data is allocated The calculated ratio p (w | t) may be calculated, and the topic word t may be newly selected according to the product of p (t | d) and p (w | t).

또한, 상기 토픽 모델링부는 검색문과 관련된 연구자가 검색되면 해당 연구자가 포함된 연구자료에서 토픽그룹을 추출하고, 검색문과 매칭되는 토픽 단어를 연결하여 검색문-토픽 매핑을 실시하는 것을 특징으로 할 수 있다.In addition, the topic modeling unit may be characterized in that when a researcher related to a search statement is searched, a topic group is extracted from research data including the researcher, and a search word-topic mapping is performed by connecting topic words matching the search statement. .

또한, 상기 토픽 모델링부는 검색문의 의미가 모호할 수 있는 문제를 해소하기 위해, 토픽그룹의 토픽 단어와 논문에 개시된 키워드를 조합하여 유사도 비교를 수행하는 것을 특징으로 할 수 있다.In addition, the topic modeling unit may be characterized by performing a similarity comparison by combining topic words of a topic group and keywords disclosed in a paper in order to solve a problem in which the meaning of the search statement may be ambiguous.

또한, 상기 유사도 측정부는 중심연구자의 토픽그룹과, 다른 연구자의 토픽그룹을 자카드(jaccard) 알고리즘, SL(Scaled Levenshtein) 알고리즘 및 Soft TF/IDF 알고리즘을 이용하여 유사도를 연산하는 것으로 관련연구자를 탐색하는 것을 특징으로 할 수 있다.In addition, the similarity measurement unit searches a related researcher by calculating the similarity between a central researcher's topic group and another researcher's topic group using a jaccard algorithm, a SL (Scaled Levenshtein) algorithm, and a Soft TF / IDF algorithm. It can be characterized by.

또한, 상기 유사도 측정부는 연구자의 수(N)에 따라 연구자의 유사도 연산을

회 수행하는 것을 특징으로 할 수 있다.In addition, the similarity measurement unit calculates the similarity of the researcher according to the number (N) of researchers.

It can be characterized by performing it once.

또한, 상기 유사도 측정부는 자카드 알고리즘, SL 알고리즘 및 Soft TF/IDF 알고리즘의 유사도 값이 모두 1이면 중심연구자와 비교된 연구자를 유사한 연구를 수행하는 관련연구자인 것으로 결정하는 것을 특징으로 할 수 있다.In addition, if the similarity value of the jacquard algorithm, the SL algorithm, and the Soft TF / IDF algorithm are all 1, the similarity measurement unit may determine that the researcher compared with the central researcher is a related researcher who performs a similar research.

또한, 상기 유사도 측정부는 중심연구자와 다른 연구자를 비교하여 자카드 알고리즘, SL 알고리즘 및 Soft TF/IDF 알고리즘의 유사도 값 중 하나라도 0이면 상기 다른 연구자를 관련연구자에서 제외하는 것을 특징으로 할 수 있다.In addition, the similarity measurement unit may be characterized in that the other researchers are excluded from the related researchers if the similarity value of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm is 0 by comparing the central researcher with other researchers.

또한, 상기 유사도 측정부는 중심연구자와 다른 연구자를 비교하여 자카드 알고리즘, SL 알고리즘 및 Soft TF/IDF 알고리즘 중 두 가지 이상의 유사도 값이 0.5 미만이면 의미적 유사성이 없는 것으로 판단하고 상기 다른 연구자를 관련연구자에서 제외하는 것을 특징으로 할 수 있다.In addition, the similarity measurement unit compares the central researcher with other researchers and determines that there is no semantic similarity when the similarity value of two or more of the Jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm is less than 0.5, and the other researcher It can be characterized as to be excluded.

또한, 상기 연구자 네트워크는 연구자의 이름이 기재된 도형과, 어느 한 도형과 다른 도형을 연결하고, 상기 어느 한 도형 및 상기 다른 도형과 대응되는 연구자 간의 관련된 토픽 단어의 수에 대응하여 굵기가 결정되는 연결선을 포함하는 가시화부를 포함하고, 상기 가시화부는 상기 도형이 선택되면 상기 도형의 외주면에 해당 연구자의 토픽 단어를 표시하고, 상기 연결선이 선택되면 연구자 간에 관련된 토픽 단어를 표시하는 것을 특징으로 할 수 있다.In addition, the researcher network connects a figure in which a researcher's name is described, a figure and another figure, and a connection line whose thickness is determined according to the number of related topic words between the figure and the researcher corresponding to the other figure The visualization unit may include a visualization unit including the display unit, and display the topic word of the researcher on the outer circumferential surface of the figure when the figure is selected, and display the related topic word between researchers when the connection line is selected.

한편, 상기와 같은 목적을 달성하기 위하여 본 발명의 기술적 사상에 의한 융합 연구 촉진을 위한 연구원 맵 구축 방법은 데이터 추출부가 연구자료에서 연구자와 연구개요를 추출하여 데이터베이스를 구축하는 단계와, 형태소 분석부가 상기 연구개요에서 명사 및 형용사를 추출하여 단어세트를 생성하는 단계와, 토픽 모델링부가 말뭉치(corpus)를 이용하여 상기 단어세트 중 상기 연구자료의 주제가 되는 토픽 단어들을 추출하고, 관련되는 토픽 단어들을 그룹지어 토픽그룹을 형성하는 단계와, 사용자 질의부가 사용자로부터 검색문을 입력받는 단계와, 유사도 측정부가 상기 검색문과 상기 토픽그룹을 이용하여 중심연구자를 검색하는 단계와, 상기 유사도 측정부가 상기 중심연구자의 연구자료에서 추출된 토픽그룹과 다른 연구자료에서 추출된 토픽그룹을 비교하는 것으로 상기 중심연구자와 유사한 연구를 수행하는 것으로 판단되는 관련연구자를 탐색하는 단계와, 가시화부가 상기 중심연구자와 상기 관련연구자가 서로 연결되는 연구자 네트워크를 구성하여 표시하는 단계를 포함하는 것을 특징으로 한다.On the other hand, in order to achieve the above object, the method of constructing a researcher map for facilitating convergence research by the technical idea of the present invention includes the steps of the data extraction unit extracting the researcher and the research summary from the research data and constructing a database, and the morpheme analysis unit Generating a word set by extracting nouns and adjectives from the research summary, and a topic modeling unit extracts topic words as the subject of the research data from the word set using a corpus, and groups related topic words Forming a topic group, the user query unit receiving a search statement from the user, the similarity measuring unit searching the central researcher using the search statement and the topic group, and the similarity measuring unit the central researcher Topics extracted from research data and topics extracted from other research data Including comparing groups and searching for related researchers who are determined to perform similar research with the central researcher, and visualizing unit constructing and displaying a researcher network in which the central researcher and the related researchers are connected to each other. It is characterized by.

본 발명에 의한 융합 연구 촉진을 위한 연구원 맵 구축 방법 및 시스템에 따르면,According to the researcher map construction method and system for facilitating convergence research according to the present invention,

첫째, 본 발명은 서로 유사한 연구를 수행하거나, 유사 기술을 이용하는 연구원들을 서로 매칭하여 연구원 네트워크를 구성하므로 연구자들 간의 협력을 강화할 수 있게 됨으로써 융복합 기술 발전을 지원할 수 있게 된다.First, the present invention can support the development of convergence technology by performing research similar to each other or by matching researchers using similar technologies to form a researcher network, thereby strengthening cooperation between researchers.

둘째, 연구자료가 이미지화된 문서라 하더라도 자동으로 이미지 프로세싱을 수행하여 주요한 연구개요를 추출하게 되므로 연구자료의 형식에 한정되지 않고 분석이 가능하게 된다.Second, even if the research data is an imaged document, the main research outline is extracted by automatically performing image processing, so analysis is possible without being limited to the format of the research data.

셋째, 이미지 프로세싱 중 공백 임계 너비를 이용함으로써 단어 간의 띄어쓰기, 줄바꿈, 문단의 종료를 식별할 수 있게 되어, 정확하게 필요한 연구개요를 추출할 수 있게 된다.Third, it is possible to identify the spacing between words, line breaks, and the end of paragraphs by using the critical width of space during image processing, so that it is possible to accurately extract the required research outline.

넷째, 연구자료의 언어를 식별함으로써 언어별로 최적의 형태소 분석 시스템을 적용할 수 있게 되고, 이로써 언어별로 최적의 단어세트를 획득할 수 있게 된다.Fourth, by identifying the language of the research data, it is possible to apply an optimal morpheme analysis system for each language, thereby obtaining an optimal word set for each language.

다섯째, 단어세트를 LDA 알고리즘을 이용하여 분석함으로써 연구자료별로 토픽 단어 및 각 토픽 단어의 빈도수를 알 수 있게 되고, 이 정보를 이용하여 유사 연구를 수행하는 연구자와 관련된 연구자료들을 용이하게 탐색할 수 있게 된다.Fifth, by analyzing the word set using the LDA algorithm, it is possible to know the topic word and the frequency of each topic word for each study data, and this information can be used to easily search for research data related to researchers performing similar studies. There will be.

여섯째, 관련연구자 탐색 시 서로 상이한 성능 특징을 가진 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘을 복합적으로 이용하므로 유사도 판단에 있어서 다양한 접근이 시도되고, 이로써 다양한 관점의 유사성을 가지는 연구자와 연구자료를 탐색할 수 있게 된다.Sixth, when searching for related researchers, various approaches are attempted in judging similarity because the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm with different performance characteristics are used in combination, thereby allowing researchers and research data with similarities from various viewpoints. You will be able to navigate.

일곱째, 중심연구자와 관련연구자 간의 관계가 연구자 네트워크에 표시되어 직관적으로 유사 연구를 수행하는 연구자를 발견할 수 있고, 연구자를 클릭(선택)하면 해당 연구자의 연구자료에서 추출된 토픽 단어들이 표시됨에 따라, 해당 연구자의 연구분야와 활용 기술 등을 용이하게 파악할 수 있게 된다.Seventh, the relationship between the central researcher and the related researchers is displayed in the researcher network, so that researchers who intuitively conduct similar research can be found, and if you click (select) the researcher, the topic words extracted from the researcher's research data are displayed. In addition, it becomes possible to easily grasp the research field and applied technology of the relevant researcher.

도 1은 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 시스템의 구성도.
도 2는 논문의 초록 영역과 키워드 영역에 사각 범위 설정된 것을 나타낸 예시 도면.
도 3은 연구개요에서 추출된 단어세트, 단어세트를 이용하여 생성된 토픽그룹, 검색문과 토픽그룹을 매칭하는 과정을 나타낸 예시 도면.
도 4는 검색문에 의해 검색된 연구자와, 각 연구자의 토픽 단어, 토픽 단어를 선택할 때 표시되는 연구자료의 정보의 표시 예를 나타낸 도면.
도 5는 도 4에서 특정 연구자의 토픽을 선택한 후 재검색할 때 표시되는 연구자 네트워크를 나타낸 예시 도면.
도 6은 연구자 중 하나를 클릭(선택)할 때 주요 토픽 단어가 도형 주변에 표시되고, 도형 주변의 회색 영역을 클릭할 때 표시되는 전체 토픽 단어 리스트를 나타낸 예시 도면.
도 7은 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 방법에서 연구자료로부터 토픽그룹이 추출되는 과정을 나타낸 순서도.
도 8은 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 방법에서 검색문이 입력된 후 연구자 네트워크가 구성되기까지의 과정을 나타낸 순서도.
도 9는 S120 단계의 세부 과정을 나타낸 순서도.
도 10은 S140 단계의 세부 과정을 나타낸 순서도.
도 11은 S260 단계의 세부 과정을 나타낸 순서도.1 is a block diagram of a researcher map construction system for facilitating convergence research according to an embodiment of the present invention.
2 is an exemplary diagram showing that a rectangular range is set in the abstract area and the keyword area of the paper.
FIG. 3 is an exemplary diagram showing a process of matching a search word and a topic group, a topic group generated using a word set, and a word set extracted from the study outline.
FIG. 4 is a diagram showing an example of displaying information of a researcher searched by a search statement and information of research data displayed when each topic word or topic word of each researcher is selected.
FIG. 5 is an exemplary diagram showing a researcher network displayed when re-searching after selecting a topic of a specific researcher in FIG. 4.
FIG. 6 is an exemplary diagram showing a list of all topic words displayed when a key topic word is displayed around a shape when one of the researchers is clicked (selected) and displayed when a gray area around the shape is clicked.
7 is a flowchart illustrating a process of extracting a topic group from research data in a method of constructing a researcher map for facilitating convergence research according to an embodiment of the present invention.
8 is a flowchart illustrating a process from a search statement input to a researcher network in a method of constructing a researcher map for facilitating convergence research according to an embodiment of the present invention.
Figure 9 is a flow chart showing the detailed process of step S120.
10 is a flowchart showing the detailed process of step S140.
11 is a flow chart showing the detailed process of step S260.

첨부한 도면을 참조하여 본 발명의 실시예들에 의한 융합 연구 촉진을 위한 연구원 맵 구축 방법 및 시스템에 대하여 상세히 설명한다. 본 발명은 다양한 변경을 가할 수 있고 여러 가지 형태를 가질 수 있는바, 특정 실시예들을 도면에 예시하고 본문에 상세하게 설명하고자 한다. 그러나 이는 본 발명을 특정한 개시 형태에 대해 한정하려는 것이 아니며, 본 발명의 사상 및 기술범위에 포함되는 모든 변경, 균등물 내지 대체물을 포함하는 것으로 이해되어야 한다. 각 도면을 설명하면서 유사한 참조부호를 유사한 구성요소에 대해 사용하였다.A method and system for constructing a researcher map for facilitating convergence research according to embodiments of the present invention will be described in detail with reference to the accompanying drawings. The present invention can be variously changed and can have various forms, and specific embodiments will be illustrated in the drawings and described in detail in the text. However, this is not intended to limit the present invention to a specific disclosure form, and it should be understood that it includes all modifications, equivalents, and substitutes included in the spirit and scope of the present invention. In describing each drawing, similar reference numerals are used for similar components.

또한, 다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지 않는다.In addition, unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by a person skilled in the art to which the present invention pertains. Terms such as those defined in a commonly used dictionary should be interpreted as having meanings consistent with meanings in the context of related technologies, and should not be interpreted as ideal or excessively formal meanings unless explicitly defined in the present application. Does not.

본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 시스템은 어느 한 컴퓨터 장치에 모든 구성이 포함되거나, 다수의 컴퓨터 장치에 구성들이 분산되고 컴퓨터 장치들이 유기적으로 연결되는 것으로 실시될 수 있다.The researcher map building system for facilitating convergence research according to an embodiment of the present invention may be implemented in which all components are included in one computer device, or components are distributed to multiple computer devices and the computer devices are organically connected. .

도 1을 참조하면, 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 시스템은 연구자료에서 연구자와 연구개요를 추출하여 데이터베이스를 구축하는 데이터 추출부와, 상기 연구개요에서 명사 및 형용사를 추출하여 단어세트를 생성하는 형태소 분석부와, 말뭉치(corpus)를 이용하여 상기 단어세트 중 상기 연구자료의 주제가 되는 토픽 단어들을 추출하고, 관련되는 토픽 단어들을 그룹지어 토픽그룹을 형성하는 토픽 모델링부와, 사용자로부터 검색문을 입력받는 사용자 질의부와, 상기 검색문과 상기 토픽그룹을 이용하여 중심연구자를 검색하고, 상기 중심연구자의 연구자료에서 추출된 토픽그룹과 다른 연구자료에서 추출된 토픽그룹을 비교하는 것으로 상기 중심연구자와 유사한 연구를 수행하는 것으로 판단되는 관련연구자를 탐색하는 유사도 측정부와, 상기 유사도 측정부에서 검색된 중심연구자와 관련연구자가 서로 연결된 연구자 네트워크를 구성하여 표시하는 가시화부를 포함하는 것을 특징으로 한다.Referring to FIG. 1, a researcher map construction system for facilitating convergence research according to an embodiment of the present invention includes a data extraction unit that builds a database by extracting researchers and research outlines from research data, and nouns and adjectives in the research overview A morpheme analysis unit that extracts a word set to generate a word set, and a topic modeling that extracts a topic word as a subject of the research data from the word set using a corpus, and groups related topic words to form a topic group Wealth, a user query unit that receives a search statement from a user, a central researcher using the search statement and the topic group, and a topic group extracted from other research data and a topic group extracted from the research data of the central researcher By comparing the relevant researchers who are judged to conduct similar research to the central researcher Connection with the center researchers found in color similarity measurement unit, the similarity measurement unit for researchers characterized in that it includes a visualization display and to configure the network are connected to each other researchers.

또한, 본 발명의 일 실시예는 연구자료, 상기 연구자료에서 추출된 연구자 이름, 연구개요, 토픽그룹 등이 하나의 세트로 구성되어 저장되는 연구자 데이터베이스를 더 포함한다.In addition, an embodiment of the present invention further includes a researcher database in which the research data, the name of the researcher extracted from the research data, the research summary, and the topic group are configured and stored in one set.

연구자료에는 논문, 연구과제, 특허, 실용신안, 저작권 등이 포함된다. 연구자료는 PDF, Ms word, 아래아 한글 등 공지된 형식의 파일이 이 실시예에 입력되는 것으로 실시될 수 있다.Research materials include papers, research projects, patents, utility models, and copyrights. The research data can be carried out by inputting a file in a known format such as PDF, Ms word, and Hangul Hangul into this embodiment.

연구자는 논문에 개시된 저자, 특허나 실용신안에 개시된 발명자가 될 수 있다.The researcher can be the author disclosed in the paper, the inventor disclosed in the patent or utility model.

연구개요에는 논문의 제목, 지식재산권의 명칭, 연구분야의 명칭, 이용되는 원리, 법칙, 알고리즘, 현상 등의 고유 명사 등이 포함된다.The research outline includes the title of the paper, the name of the intellectual property right, the name of the research field, the principle used, laws, algorithms, and proper nouns such as phenomena.

데이터 추출부는 연구자료에서 분석에 필요한 정보를 추출한다. 만약, 연구자료가 논문인 경우, 논문에서 초록(abstract) 및 키워드(keywords)에 대응하는 영역을 추출한다.The data extraction unit extracts information necessary for analysis from the research data. If the research material is a paper, the area corresponding to abstract and keywords is extracted from the paper.

데이터 추출부는 연구자료가 이미지 파일 또는 이미지화된 PDF파일과 같이 이미지로 구성된 문서인 경우 이미지 프로세싱을 수행하여 이미지에서 문자를 추출한다. 논문이 이미지화된 문서인 경우, 초록이 시작되는 지점에 시작좌표와, 초록이 종료되는 지점에 종료좌표를 설정하고, 상기 종료좌표의 x좌표 값은 초록 문단의 최우측의 x좌표로 치환한 후 상기 시작좌표와 상기 종료좌표를 양 끝점으로 하는 사각 범위 내에 포함된 텍스트를 추출한다(도 2 참조).When the research data is a document composed of images such as an image file or an imaged PDF file, the data extraction unit extracts characters from the image by performing image processing. If the article is an imaged document, set the starting coordinates at the starting point of the abstract and the ending coordinates at the ending point of the abstract, and replace the x-coordinate value of the ending coordinates with the x-coordinate of the rightmost part of the abstract paragraph. The text included in the rectangular range using both the starting coordinates and the ending coordinates as both end points is extracted (see FIG. 2).

이 실시예의 데이터 추출부는 Apache(Apache Software Foundation)에서 제작한 PDFBox를 이용하여 이미지화된 논문 PDF 파일을 분석함으로써 텍스트를 추출하였다. 추출된 텍스트에서 초록 또는 abstract라는 단어가 발견되면, 해당 페이지에 초록이 포함된 것으로 판단하며, 초록 내용을 추출하기 위한 사각형영역(region of the abstract)을 설정한다. 마찬가지로, 추출된 텍스트에 키워드 또는 keywords가 존재하면 키워드가 포함된 페이지로 판단하고, 키워드 단어를 추출하기 위한 사각형영역을 설정하였다.The data extraction unit of this embodiment extracted text by analyzing the imaged paper PDF file using a PDFBox produced by Apache (Apache Software Foundation). If the word abstract or abstract is found in the extracted text, it is determined that the page contains the abstract, and a region of the abstract is set to extract the abstract content. Likewise, if a keyword or keywords exist in the extracted text, it is determined as a page containing keywords, and a rectangular area for extracting keyword words is set.

초록 내용에 사각형영역을 설정하기 위해 PDFBox 라이브러리를 이용하여 초록의 시작좌표(x1, y1)와 종료좌표(x2, y2)를 추출한다. 초록의 시작좌표는 '초록' 또는 'Abstract'가 등장하는 페이지에서 이 단어의 좌측 상단 x, y 좌표가 된다. 초록의 종료좌표는 초록 내용의 최우측 x 좌표와 하단의 y 좌표로 지정된다. 일반적으로, 이미지형식의 논문에서 초록 내용은 줄바꿈을 통해 복수의 텍스트 라인(text line) 형식으로 기재되므로, 초록의 최우측에 대한 x 좌표는 텍스트 라인 중 마지막 단어의 최우측 x 좌표가 가장 큰 값이 종료좌표의 x 좌표가 된다.In order to set a rectangular area for abstract contents, the starting and ending coordinates (x2, y2) of the abstract are extracted using the PDFBox library. The starting coordinates of the abstract are the x and y coordinates of the upper left of the word on the page where 'Green' or 'Abstract' appears. The ending coordinates of the abstract are designated by the rightmost x-coordinate and the lower y-coordinate of the abstract content. In general, in the paper in image format, the abstract content is written in a plurality of text lines through line breaks, so the x-coordinate for the rightmost side of the abstract has the largest x-coordinate of the last word in the text line. The value is the x-coordinate of the end coordinate.

또한, 데이터 추출부는 단어 간의 공백을 판단하기 위해 공백 임계 너비를 기 설정하고, 단어의 좌측 또는 우측에 위치한 공백의 너비가 상기 공백 임계 너비를 초과하면 해당 텍스트 라인이 종료된 것으로 판단한다. 단어를 W, 추출된 단어의 수를 n, 전체 단어를 WA, i번째 추출된 단어를

라 정의할 때,

의 영역은

(

,

),

의 시작 좌표

는 (

,

),

의 종료 좌표

는 (

,

)가 된다.

의 조건을 만족할 때, i-1번째와 i번째 단어 간의 공백 너비

는

로 산출된다. 이 값들 중 최대값

는 단어 사이의 공백으로 인식되는 공백 임계 너비로서, 1번째부터 n번째까지의 공백 SX 중 가장 큰 값으로 설정된다. 공백의 정의는 특정 픽셀의 RGB 값이 검은색(0x000000)보다 흰색(0xFFFFFF)에 가까운 경우를 의미한다. 이 공백 임계 너비를 이용하여 각 단어가 종료되는 x 좌표 이후의 공백이 다음 단어와 분리하기 위한 띄어쓰기인지, 줄바꿈을 위한 텍스트 라인의 끝 인지, 혹은 한 페이지의 우측 여백을 나타내는 것인지 판단할 수 있다. 어떤 단어의 우측 공백 너비가

보다 크면, 그 단어의 종료 좌표

(

,

)의

를 초록 내용의 최우측 x 좌표로 설정한다.In addition, the data extraction unit pre-sets a space threshold width to determine a space between words, and when the width of a space located on the left or right side of a word exceeds the space threshold width, it is determined that the corresponding text line is terminated. W is the word, n is the number of words extracted, WA is the whole word, and the i word is extracted.

When we define

The realm of

(

,

),

Starting coordinates

Is (

,

),

End coordinates of

Is (

,

).

The width of the space between the i-1th and ith words when the condition of

The

Is calculated as Maximum of these values

Is a space threshold width recognized as a space between words, and is set to the largest value of spaces SX from 1st to nth. The definition of blank means that the RGB value of a specific pixel is closer to white (0xFFFFFF) than black (0x000000). Using this space threshold width, it can be determined whether the space after the x coordinate where each word ends is a space to separate from the next word, the end of a line of text for line break, or the right margin of a page. . The width of the space to the right of a word

Greater than, the end coordinate of the word

(

,

)of

Is set to the rightmost x-coordinate of the abstract content.

초록의 하단 끝 y 좌표를 구하는 방법도 우측 끝 x 좌표를 구하는 방법과 유사하다. 각 단어의 다음 줄에 있는 단어와의 높이

는

으로 구할 수 있고, 이들 중 임계 높이 공백

를 구하여 하단의 공백들을 순차적으로 분석한다. 하단의 공백이

보다 크면, 초록이 종료되거나 문서 하단의 여백으로 판단하여 해당 단어의 종료 좌표

(

,

)의

를 초록의 최하단 y 좌표로 설정한다. 추출 된 x, y 좌표가 종료좌표로 설정되면, 시작좌표와 종료좌표를 각각 좌상측 꼭짓점과 우하측 꼭짓점으로 하는 사각형영역를 형성한 후 사각형 영역 내에 위치하는 텍스트에 대해 이미지 프로세싱을 실시한다. 추출된 텍스트는 초록으로 최종 결정된다.The method of obtaining the y-coordinate of the lower end of the abstract is similar to the method of obtaining the x-coordinate of the right end. Height with the word on the next line of each word

The

And the critical height of these spaces

To obtain and analyze the blanks at the bottom sequentially. The bottom space

If greater than this, the abstract ends or the end coordinate of the word is judged by the margin at the bottom of the document.

(

,

)of

Is set to the lowest y-coordinate of the green. When the extracted x and y coordinates are set as the end coordinates, a rectangular area is formed with the start and end coordinates as the upper left corner and the lower right corner, respectively, and image processing is performed on text located within the rectangular area. The extracted text is finally determined as abstract.

논문의 키워드도 초록의 사각형영역을 설정하는 방법과 동일한 방법으로 시작좌표와 종료좌표를 추출하고, 사각형영역을 설정하여 이미지 프로세싱을 수행하면 키워드에 개시된 단어들을 추출할 수 있게 된다.The keyword of the paper also extracts the start and end coordinates in the same way as the method of setting the green square area, and performs image processing by setting the square area to extract the words disclosed in the keyword.

줄바꿈 후 소개(Introduction, Intro), 연구 범위(Research scope) 등의 다른 문단의 시작을 의미하는 단어가 나타나면, 초록의 내용이 종료된 것으로 판단한다.After the line break, if the word that indicates the beginning of another paragraph, such as Introduction, Intro, or Research scope, appears, the content of the abstract is judged to have ended.

또한, 데이터 추출부는 추출된 초록 및 키워드 내용 중 불필요한 단어 및 문자를 제거한다. 불필요한 단어나 문자에는 초록, Abstract, 키워드, Keywords, Key words 등이 포함되고, 콤마(',')와 같은 문자도 제거한다.In addition, the data extraction unit removes unnecessary words and characters from the extracted abstract and keyword contents. Unnecessary words or characters include abstracts, abstracts, keywords, keywords, key words, etc., and also removes characters such as commas (',').

형태소 분석부는 먼저 연구개요의 문자가 한글인지 또는 영어인지 분류한다. 형태소 분석부는 연구개요에 포함된 문장을 문자(char) 단위로 읽고, 해당 문자가 가진 정수(int, 문자 코드) 값을 추출한다. 일반적으로 한글로 기재된 연구자료에는 영어가 혼용되나, 영어로 기재된 연구자료에는 한글이 혼용되는 경우가 적으므로 각 문자에 대응하는 문자 코드(정수)가 0x3131보다 크고 0xD7A3보다 작으면 연구개요 및 연구자료가 한글인 것으로 판단한다. 0x3131보다 크고 0xD7A3보다 작은 값의 문자코드가 없는 연구자료는 영어인 것으로 판단한다.The morpheme analysis unit first classifies whether the text of the research outline is Korean or English. The morpheme analysis unit reads the sentences included in the research outline in units of char, and extracts the integer (int, character code) values of the characters. In general, English is used for research materials written in Korean, but Korean is rarely used for research materials written in English, so if the character code (integer) corresponding to each letter is greater than 0x3131 and less than 0xD7A3, research summary and research data Is judged to be Korean. It is judged that research data without a character code greater than 0x3131 and less than 0xD7A3 is in English.

연구개요의 언어가 판단되면 형태소 분석부는 연구개요를 대응되는 언어 분석 방법으로 분석을 실시한다. 이 실시예는 기 공지된 형태소 분석 라이브러리를 이용하였으며, 형태소 분석이 완료되어 출력되는 품사 태깅 정보를 바탕으로 분석에 필요한 단어들을 수집하였다.When the language of the research outline is judged, the morpheme analysis unit analyzes the research summary using a corresponding language analysis method. In this embodiment, a well-known morpheme analysis library was used, and words necessary for analysis were collected based on the part-of-speech tagging information output after the morpheme analysis was completed.

도 3을 참조하면, 연구개요가 한글인 경우, 이 실시예는 Shineware의 코모란(KOMORAN)을 이용하여 형태소를 분석하였다. 코모란은 사용자 사전파일에 포함된 일반명사 63개, wiki 타이틀을 분석한 고유명서 307,435개를 기반으로 형태소 분석을 수행한다. 코모란은 형태소 분석을 실시하여 태그 유형이 일반명사(NNG), 고유명사(NNP), 한자(SH)인 단어를 추출한다. 또한, 어떠한 태그 마지막에 하이픈(SS)이 있는 경우는 줄바꿈에 의해 단어가 분리된 것으로 판단하고 이전 태그의 단어와 이후 태그의 단어를 결합하여 하나의 태그를 만든다.Referring to FIG. 3, when the research outline is Korean, this example analyzed morphemes using KOMORAN of Shineware. Commoran performs morphological analysis based on 63 common nouns included in the user dictionary file and 307,435 unique titles analyzing wiki titles. Comoran performs morphological analysis to extract words with tag types of common noun (NNG), proper noun (NNP), and Chinese character (SH). In addition, if there is a hyphen (SS) at the end of any tag, it is determined that the word is separated by line breaks, and the word of the previous tag and the word of the subsequent tag are combined to make one tag.

연구개요가 영어인 경우, 이 실시예는 Stanford NLP Group의 CoreNLP(ver. 3.8)를 이용하여 형태소를 분석하였다. CoreNLP는 형태소 분석을 실시하여 태그 유형이 명사(NN), 형용사(JJ)인 단어를 추출한다.If the study outline is in English, this example was analyzed for morphemes using CoreNLP (ver. 3.8) from Stanford NLP Group. CoreNLP performs morphological analysis to extract words with tag nouns (NN) and adjectives (JJ).

형태소 분석부에서 추출된 일반명사, 고유명사, 한자, 명사, 형용사들이 해당 연구자료의 단어세트가 된다. 만약, 연구자료가 논문이면 단어세트에는 데이터 추출부에 추출된 키워드의 단어도 포함된다.The common nouns, proper nouns, Chinese characters, nouns, and adjectives extracted from the morpheme analysis unit become the word set of the relevant research data. If the research material is a paper, the word set also includes words of keywords extracted from the data extraction unit.

토픽 모델링부는 단어세트에 포함된 단어를 개별적으로 클러스터링(clustering)하여 대표되는 주제의 토픽그룹을 구성한다. 토픽 모델링부는 말뭉치(corpus)를 이용하는데, 이때 말뭉치는 단어세트를 이용하여 구성된다. 상기 말뭉치에 잠재적 디리클레 할당(Latent Dirichlet Allocation, 이하 LDA) 알고리즘을 적용한다.The topic modeling unit individually clusters words included in the word set to form a topic group of a representative topic. The topic modeling unit uses a corpus, where the corpus is constructed using a set of words. A latent Dirichlet Allocation (LDA) algorithm is applied to the corpus.

LDA 알고리즘은 자연어 혹은 단어들의 집합으로 구성된 텍스트 문서 집합에서 각 문서에 존재하는 주제들을 추출한다. 즉, LDA 알고리즘은 자연어로 구성된 텍스트 문서 집합으로부터 생성확률모델(Generative Probabilistic Model)을 통해 확률 토픽 모델을 유도하는 알고리즘으로, 각 문서에 어떤 토픽들이 존재하는지에 대한 확률 모델이다.The LDA algorithm extracts the subjects present in each document from a set of text documents consisting of a set of natural languages or words. That is, the LDA algorithm is an algorithm that derives a probability topic model through a generative probabilistic model from a set of text documents composed of natural languages, and is a probability model for what topics exist in each document.

모든 문서는 토픽(주제)을 가지고 있고, 문서들은 다수개의 토픽들과 관련되어 있으며, 문서에 등장하는 단어들은 그 토픽들을 이루기 위한 요소로 간주된다. LDA 알고리즘은 문서에 사용된 단어들이 토픽을 구성하고, 토픽이 결합하여 문서를 구성하는 형태로 모델링한다. 그리고 문서 내에서 단어들 간의 동시등장(co-occurrence) 빈도를 확률화하는 방법을 이용하여 숨겨진 토픽들을 도출한다. LDA 알고리즘은 문서 내에 등장하는 단어의 순서에 상관하지 않고 단어의 출현 횟수만을 고려한다. 토픽별 단어 수의 분포를 기반으로 각 문서에서 출현하는 단어 수의 분포를 분석하고, 해당 문서가 어떤 토픽들을 다루고 있을지 예측한다.Every document has a topic (topic), the documents are related to multiple topics, and the words appearing in the document are regarded as elements for forming the topics. In the LDA algorithm, words used in a document form a topic, and the topics are combined to model the document. In addition, hidden topics are derived using a method of randomizing the frequency of co-occurrences between words in a document. The LDA algorithm considers only the number of occurrences of a word, regardless of the order of words appearing in the document. Based on the distribution of the number of words per topic, the distribution of the number of words appearing in each document is analyzed and predicted what topics the document covers.

이 실시예는 생성된 말뭉치를 LDA 알고리즘 분석을 위한 문서로 설정하고, 상기 말뭉치에서 토픽 모델링 과정을 거쳐 토픽을 추출한다. 예를 들어, JHotDraw라는 프로젝트를 초기 분석하여 'Mycobacterium, tuberculosis-induced, granulocyte-macrophage, colony, stimulating, factor, ...' 등의 단어로 구성된 말뭉치가 생성된다고 가정할 때, 이 말뭉치를 LDA 알고리즘을 이용하여 토픽 모델링을 수행하면, declining, macrophage, cells 등의 단어들이 각각의 분산값 0.154, 0.74, 0.065 과 함께 추출된다. 분산값은 각 단어가 전체 말뭉치에서 차지하는 중요도로 볼 수 있으며, 일정 수치 이상의 값을 가진 단어들을 활용하여 해당 프로젝트의 주요 기능 혹은 특징(feature) 목록으로 설정한다.In this embodiment, the generated corpus is set as a document for LDA algorithm analysis, and a topic is extracted from the corpus through a topic modeling process. For example, assuming that an initial analysis of a project called JHotDraw produces a corpus consisting of the words 'Mycobacterium, tuberculosis-induced, granulocyte-macrophage, colony, stimulating, factor, ...', this corpus is an LDA algorithm. When topic modeling is performed using, words such as declining, macrophage, and cells are extracted together with variance values of 0.154, 0.74, and 0.065. The variance value can be regarded as the importance of each word in the whole corpus, and it is set as a list of main functions or features of the project by using words with values above a certain value.

또한, LDA 알고리즘은 깁스 샘플링 기반으로 구성된다. 깁스 샘플링은 각 연구자료에 K개의 토픽 단어 중 하나를 임의로 할당한다. 이로써 각 문서는 토픽 단어와 해당 토픽 단어의 분포를 갖게 된다. 토픽 단어의 분포 값은 오류가 있는 값이므로 개선을 위해 추가 프로세스를 진행한다. 각 연구자료(d), 각 연구자료(d)에 포함된 단어세트(w), 단어세트(w)에 존재하는 토픽 단어(t)에 대해 두 가지 계산을 수행한다. 첫째, 연구자료(d)의 단어세트(w) 중 토픽 단어(t)의 비율 p(t|d)를 연산한다. 둘째, 모든 연구자료 중에서 토픽 단어(t)가 할당된 비율 p(w|t)를 연산한다. 이후, p(t|d)와 p(w|t)의 곱에 따라 토픽 단어(t)를 신규하게 선택한다. 이 생성모델(generative model)에 따르면, 이것은 토픽 단어(t)가 단어세트(w)를 생성할 확률이라 볼 수 있으므로 현재 각 연구자료의 토픽 단어를 해당 확률에 따라 다시 설정한다. 즉, 이 단계에서는, 현재 측정되고 있는 단어 외에 토픽 단어가 전부 알맞게 할당되었다고 가정하고, 확률을 계산하여 현재 단어를 갱신한다. 이와 같은 일련의 과정들을 충분히 반복하여 안정적인 상태가 되면 문서에 존재하는 토픽 단어와 그 분포를 확인할 수 있다.In addition, the LDA algorithm is based on Gibbs sampling. Gibbs Sampling randomly assigns one of K topic words to each study. As a result, each document has a topic word and a distribution of the topic words. Since the distribution value of the topic word is an error value, an additional process is performed for improvement. Two calculations are performed on each study data (d), the word set (w) included in each study data (d), and the topic word (t) in the word set (w). First, the ratio p (t | d) of the topic word (t) among the word set (w) of the study data (d) is calculated. Second, the ratio p (w | t) to which the topic word (t) is allocated among all research data is calculated. Thereafter, the topic word t is newly selected according to the product of p (t | d) and p (w | t). According to this generative model, this can be regarded as the probability that the topic word (t) will generate a word set (w), so the topic word of each research data is set again according to the probability. That is, in this step, it is assumed that all the topic words are properly allocated in addition to the currently measured word, and the current word is updated by calculating the probability. When a stable state is achieved by repeating such a series of processes sufficiently, it is possible to check a topic word and its distribution in the document.

LDA 알고리즘 기반 토픽 모델링은 각 연구자료가 k 개의 토픽 단어 중 하나 이상을 포함하는 것을 가정한다. 모델링의 결과물인 토픽그룹은 임의의 토픽 단어의 집합이다. 예를 들어, 신문에서 토픽 모델링을 수행하여 제1그룹은 김경호, 바이올린, 드럼, 음악회라는 단어가 추출되고, 제2그룹은 독도, 북한, 정상회담, 핵이라는 단어가 추출되었다고 가정할 때, 제1그룹은 '음악'과 관련된 토픽그룹이 되고, 제2그룹은 '정치'와 관련된 토픽그룹이 된다. 즉, 토픽 모델링부는 분포 값을 기준으로 단어들을 클러스터링 하여 특정 주제를 대표하는 토픽 단어의 집합을 구성하여 토픽그룹을 형성한다.Topic modeling based on the LDA algorithm assumes that each study data contains one or more of k topic words. The topic group, the result of modeling, is a collection of arbitrary topic words. For example, assuming that the first group was extracted from the newspaper, the words Kim Gyeong-ho, violin, drum, and concert were extracted, and the second group extracted the words Dokdo, North Korea, summit, and nuclear. The first group becomes a topic group related to 'music', and the second group becomes a topic group related to 'politics'. That is, the topic modeling unit clusters words based on the distribution value to form a set of topic words representing a specific topic to form a topic group.

토픽 모델링부는 토픽 모델링을 수행하기 위해 MALLET Topic Modeling Toolkit 라이브러리를 이용한다.The Topic Modeling Department uses the MALLET Topic Modeling Toolkit library to perform topic modeling.

토픽 모델링부는 검색문과 관련된 연구자가 검색되면 해당 연구자가 포함된 연구자료에서 토픽그룹을 추출하고, 검색문과 매칭되는 토픽 단어를 연결하여 검색문-토픽 매핑을 실시한다.The topic modeling unit extracts a topic group from the research data included in the researcher when the researcher related to the search statement is searched, and performs search-to-topic mapping by connecting topic words matching the search statement.

단일 혹은 복합 단어로 구성된 검색문의 의미가 모호할 수 있는 문제를 해소하기 위해, 토픽그룹의 토픽 단어와 논문에 개시된 키워드를 조합하여 유사도 비교를 수행한다.In order to solve the problem that the meaning of the search statement composed of single or complex words may be ambiguous, similarity comparison is performed by combining topic words of the topic group and keywords disclosed in the article.

토픽그룹의 단어 중 검색문에 포함된 단어가 있으면, 해당 토픽그룹을 검색문의 단어와 연결한다. 예를 들어, 검색문이 'GM-CSF, MEK1, Mycobacterium tuberculosis, P38 MAPK, PI3-K'인 경우, 제1토픽그룹은 protein, k/mek, suggest, kinase, induction과 매핑 될 수 있다. 제2토픽그룹은 kinase, mapk, treated, inmma, increase과 매핑 될 수 있다. 제3토픽그룹은 gm-csf, infection, factor, mediated, up-refulation과 매핑 될 수 있다. 제4토픽그룹은 bymtb, mma, mapk-associatedsignaling, mek, thp과 매핑 될 수 있다. 제2토픽그룹은 mapk가 검색문의 P38 MAPK와 매핑되고, 제3토픽그룹은 gm-csf가 검색문의 GM-CSF와 매핑된다. 이와 같은 매핑 과정으로 검색문-토픽 맵이 완성된다(도 3 참조).If there are words in the search statement among the words in the topic group, the topic group is connected to the words in the search statement. For example, if the search statement is 'GM-CSF, MEK1, Mycobacterium tuberculosis, P38 MAPK, PI3-K', the first topic group can be mapped to protein, k / mek, suggest, kinase, and induction. The second topic group can be mapped to kinase, mapk, treated, inmma, increase. The third topic group can be mapped to gm-csf, infection, factor, mediated, up-refulation. The fourth topic group can be mapped to bymtb, mma, mapk-associatedsignaling, mek, and thp. In the second topic group, mapk is mapped to P38 MAPK in the search statement, and in the third topic group, gm-csf is mapped to GM-CSF in the search statement. The search statement-topic map is completed through this mapping process (see FIG. 3).

사용자 질의부는 사용자가 작성한 검색문을 입력받는다. 검색문은 단일 단어 또는 복합 단어로 구성이 가능하다. 입력된 검색문은 형태소 분석부에서 품사 태깅을 수행하여 검색에 활용 가능한 단어로 최적화 된다.The user query unit receives the search statement written by the user. The search statement can consist of a single word or a compound word. The input search statement is optimized as a word that can be used for search by performing part-of-speech tagging in the morpheme analysis unit.

사용자 질의부는 모니터 장치에 연구자의 이름이나 단어가 포함된 검색문을 입력할 수 있는 구성을 제공한다. 이 실시예는 연구자의 이름을 입력하는 연구자 입력상자와 단어를 입력하는 키워드 입력상자를 별개로 제공하였다(도 4 참조). 사용자 질의부에 검색문이 입력되면 가시화부가 모니터에 연구자 네트워크를 구성하여 표시한다.The user query unit provides a configuration for inputting a search statement including a researcher's name or words on the monitor device. In this embodiment, a researcher input box for inputting a researcher's name and a keyword input box for inputting words are provided separately (see FIG. 4). When a search statement is input to the user query section, the visualization section configures and displays the researcher network on the monitor.

연구자 입력상자에 이름의 일부를 입력하면, 해당 문자가 포함된 연구자 이름이 자동완성 목록에 표시된다. 자동완성 목록은 입력상자 바로 아래에 레이어 형태로 출력된다. 자동완성 목록에 표시된 연구자 이름 중 어느 하나를 선택하면 해당 연구자의 학교, 학과, e-mail 등이 추가로 표시될 수 있다.If a part of the name is entered in the researcher input box, the name of the researcher with the corresponding letter is displayed in the autocomplete list. The autocomplete list is displayed in the form of a layer just below the input box. If you select one of the researchers' names displayed on the autocomplete list, the researcher's school, department, e-mail, etc. may be additionally displayed.

키워드 입력상자에는 검색을 원하는 단어를 입력할 수 있다.In the keyword input box, a word to be searched can be entered.

또한, 사용자 질의부는 검색 대상이 되는 연구자료를 논문, 과제, 지식재산권 중에서 선택할 수도 있다.In addition, the user query unit can select the research data to be searched among thesis, assignments, and intellectual property rights.

유사도 측정부의 관련연구자 탐색은 중심연구자의 토픽그룹과, 다른 연구자의 토픽그룹을 자카드(jaccard) 알고리즘, SL(Scaled Levenshtein) 알고리즘 및 Soft TF/IDF 알고리즘을 이용하여 유사도를 연산하는 것으로 실시된다.The search for related researchers in the similarity measurement unit is performed by calculating the similarity between the topic group of the central researcher and the topic group of other researchers using a jaccard algorithm, a scaled Levenshtein (SL) algorithm, and a Soft TF / IDF algorithm.

자카드 알고리즘, SL 알고리즘 및 Soft TF/IDF 알고리즘의 유사도 값이 모두 1이면 중심연구자와 비교된 연구자를 유사한 연구를 수행하는 관련연구자인 것으로 결정된다.When the similarity values of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are all 1, it is determined that the researcher compared with the central researcher is a related researcher who performs similar research.

유사도 측정부는 크게 두 가지 기능을 수행한다. 첫째, 검색문을 기반으로 중심연구자 및 단어 검색을 수행한다. 둘째, 각 연구자의 연구자료에서 추출된 토픽그룹을 기반으로 연구자 간의 연구 유사도를 측정한다.The similarity measuring unit performs two functions. First, search for the central researcher and words based on the search text. Second, the similarity of research between researchers is measured based on the topic group extracted from each researcher's research data.

중심연구자 및 단어의 검색은 형태소 분석이 수행된 검색문을 이용하여 데이터베이스 중에서 like 검색(문자열 검색)을 수행하는 것으로 실시될 수 있다. 검색 결과는 연구자 목록 및 단어 목록으로 생성되며, 이 목록은 가시화부로 전달된다.The search for a central researcher and a word may be performed by performing a like search (a string search) in a database using a search statement in which morpheme analysis has been performed. The search results are generated as a list of researchers and a list of words, and the list is delivered to the visualization unit.

연구자 간의 유사도는 토픽 모델링부에서 토픽 모델링 수행 후 생성되는 토픽그룹을 기반으로 측정된다. 유사도 측정을 위해 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘이 이용된다.The similarity between researchers is measured based on the topic group created after the topic modeling is performed in the topic modeling unit. Jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are used to measure similarity.

자카드는 집합 간의 유사도를 검사하는 방법이다. 자카드 유사도 J(A,B)는 두 집합의 교집합 크기를 두 집합의 합집합 크기로 나눈 값으로 정의되며, 그 관계는 수학식1과 같이 나타낼 수 있다.Jacquard is a way to check the similarity between sets. Jacquard similarity J (A, B) is defined as the value of the intersection of two sets divided by the size of the two sets, and the relationship can be expressed by Equation 1.

[수학식1]

[Equation 1]

SL 알고리즘은 편집 거리 알고리즘으로 알려진 레벤슈타인(Levenshtein) 알고리즘의 결과 값을 보정한 것이다. SL 알고리즘은 문자열 a, b에 대해 a와 b가 같아지기 위해 몇 번의 연산을 수행해야 하는지 계산한다. 여기서 연산은 삽입, 삭제 및 대체를 나타낸다. 두 문자열 a, b에 대해 |a|와 |b|가 각각 문자열 a, b의 길이를 나타내는 경우,

는 수학식2와 같다.The SL algorithm is a correction of the result of the Levenshtein algorithm known as the edit distance algorithm. The SL algorithm calculates how many operations a and b need to perform on strings a and b to be the same. Here, the operations represent insert, delete, and replace. For two strings a, b, where | a | and | b | represent the lengths of the strings a, b, respectively.

Is the same as Equation 2.

[수학식2]

[Equation 2]

수학식2의 결과 값은 정수로 출력된다. 단어의 길이 및 연산량에 따라 값의 변동폭이 크기 때문에 SL 알고리즘은 결과 값을 보정하여 0 내지 1 사이의 값을 생성한다. 보정 식은

의 값이 d이고, 문자열 a, b의 길이 중 큰 값을 n으로 둘 경우, 수학식3과 같다.The result of Equation 2 is output as an integer. Since the fluctuation range of the value is large according to the length of the word and the calculation amount, the SL algorithm corrects the result value to generate a value between 0 and 1. The correction formula

If the value of d is, and the larger value of the lengths of the strings a and b is n, it is as shown in Equation 3.

[수학식3]

[Equation 3]

Soft TF/IDF 알고리즘은 TF/IDF에 부분 매치를 고려한 가중치 측정방법이다. TF/IDF는 다수의 문서로 구성된 문서 집합이 있을 때, 통계적으로 특정 단어가 특정 문서에서 차지하는 중요도를 나타낸다. TF는 문서 내에 등장하는 단어의 빈도(term frequency)를 나타내며, 이 값이 높을수록 중요 단어로 고려된다. 해당 단어가 다른 문서에서도 자주 사용된다면 흔히 사용되는 것을 의미하는데, 이것을 문서 빈도(document frequency, DF)라 한다. 이 값의 역수를 역 문서 빈도(inverse document frequency, IDF)라고 하며, TF/IDF 알고리즘의 결과는 TF와 IDF를 곱한 값이 된다. 다만, TF/IDF 알고리즘은 오탈자를 고려하지 않아 단어의 토큰이 조금만 상이해도 다른 단어로 인식한다. 이것을 보정하기 위해 Soft TF/IDF 알고리즘은 문자 및 토큰 기반의 유사도 측정을 모두 수행한다. Soft TF/IDF 알고리즘의 토큰 유사도 측정을 위해 Jaro, JaroWinkler 등의 문자열 비교 알고리즘과 Threshold 값을 입력으로 지정할 수 있다.Soft TF / IDF algorithm is a weighting method considering partial match to TF / IDF. TF / IDF statistically indicates the importance of a particular word in a particular document when there is a document set consisting of multiple documents. TF represents the term frequency of words appearing in the document, and the higher the value, the more important words are considered. If the word is often used in other documents, it means that it is often used. This is called document frequency (DF). The inverse of this value is called the inverse document frequency (IDF), and the result of the TF / IDF algorithm is the product of TF and IDF. However, the TF / IDF algorithm does not consider typos, so even if the tokens of words are slightly different, they are recognized as different words. To correct this, the Soft TF / IDF algorithm performs both character- and token-based similarity measurements. To measure the token similarity of the Soft TF / IDF algorithm, string comparison algorithms such as Jaro and JaroWinkler and threshold values can be specified as input.

유사도 연산을 위해 토픽 모델링부에서 전달 받은 N명의 연구자의 토픽그룹을 순차적으로 비교한다. 연구자 간의 비교는 총

회 수행된다. 이때, 자카드 알고리즘 유사도는

, SL 알고리즘의 유사도는

, Soft TF/IDF 알고리즘의 유사도는

이다.To calculate the similarity, the topic groups of N researchers received from the topic modeling unit are sequentially compared. Comparison between researchers is total

Is performed once. At this time, the jacquard algorithm similarity

, The similarity of the SL algorithm is

, The similarity of Soft TF / IDF algorithm is

to be.

제1연구자(중심연구자)와 제2연구자를 비교하여 세 알고리즘의 유사도 결과 중 하나라도 0이 나오면 제2연구자를 관련연구자에서 제외한다. 만약, 모든 비교 값이 0이 나오는 경우에는 토픽그룹의 단어를 비교한다. 이 값에서도 0이 나오면 제1연구자와 제2연구자는 유사도는 없는 것으로 판단하고 제2연구자는 관련연구자에서 제외된다.The first researcher (center researcher) and the second researcher are compared, and if any of the similarity results of the three algorithms shows 0, the second researcher is excluded from the related researcher. If all comparison values are 0, words in the topic group are compared. If 0 appears in this value, the first researcher and the second researcher judge that there is no similarity, and the second researcher is excluded from the related researcher.

유사도 비교 시

+

= 3, 즉 세 알고리즘의 유사도 값이 모두 1이 나오면 제1연구자와 제2연구자의 유사도가 매우 높은 것으로 판단한다. 또한, 토픽그룹의 비교 후 일정 값 이상의 수치가 나오면 제2연구자를 관련연구자로 결정한다. 제2연구자가 관련연구자로 결정된 후 토픽그룹의 단어를 비교하여 유사 강도를 연산한다.When comparing similarity

+

= 3, that is, if all three algorithms have a similarity value of 1, it is determined that the similarity between the first and second researchers is very high. In addition, after comparing the topic group, if a value above a certain value appears, the second researcher is decided as a related researcher. After the second researcher is decided as a related researcher, the similar strength is calculated by comparing the words of the topic group.

이 외의 값에 대해서는 세 알고리즘의 유사도 값을 비교하여 판단한다.For the other values, the similarity values of the three algorithms are compared and judged.

세 알고리즘 중 두 가지 이상의 유사도 값이 0.5 미만이면 제2연구자는 관련연구자 후보에서 제외된다. 유사도 값이 0.5 미만인 것은 의미적 유사성을 가진다고 보기 어려운 단어들이다. 실험결과, 유사도 값 0.5 미만은 단지 동일한 문자가 존재할 경우에 출력되는 값으로 확인되었다.If the similarity value of two or more of the three algorithms is less than 0.5, the second researcher is excluded from the candidate for the related researcher. When the similarity value is less than 0.5, it is a word that is hard to be considered to have semantic similarity. As a result of the experiment, the similarity value of less than 0.5 was confirmed as the value output when only the same character was present.

반면, 세 알고리즘 중 두 가지 이상의 유사도 값이 0.5 이상이면 제2연구자는 관련연구자가 될 확률이 높은 것으로 판단하고, 토픽그룹의 단어들의 비교를 통하여 최종적으로 관련연구자 여부를 결정한다. 관련연구자로 결정되면 토픽그룹의 단어를 비교하여 유사 강도를 연산한다.On the other hand, if the similarity value of two or more of the three algorithms is 0.5 or more, the second researcher determines that there is a high probability of becoming a related researcher, and finally determines whether the researcher is related by comparing words of the topic group. When determined as a related researcher, similar strengths are calculated by comparing words in the topic group.

관련연구자의 정보는 연구자 데이터베이스에 저장된다.The related researcher's information is stored in the researcher database.

도 4 내지 도 6을 참조하면, 이 실시예의 사용자 질의부 및 가시화부는 JSP 및 JQuery 기반의 웹 화면으로 구현되었다.4 to 6, the user query unit and the visualization unit of this embodiment are implemented as web screens based on JSP and JQuery.

모니터 장치에 표시되는 사용자 질의부에 대응되는 기능에 사용자가 연구자의 이름이나 단어가 포함된 검색문을 입력하면, 가시화부는 연구자 네트워크를 구성하여 모니터에 표시되게 한다.When a user inputs a search statement including a researcher's name or word in a function corresponding to a user query unit displayed on the monitor device, the visualization unit configures the researcher network to be displayed on the monitor.

예를 들어, 사용자 질의부에 연구자의 이름이 입력된 후 검색이 실시되면, 가시화부는 해당 연구자의 논문, 과제, 지식재산권에서 추출된 주요 토픽이 빈도수 또는 초성 순서로 출력되게 한다. 단어 검색이 실시되면, 가시화부는 해당 단어를 포함하는 연구자료의 연구자와, 해당 연구원의 관련연구자료의 실적이 출력되게 한다. 결과 목록에는 연구자 네트워크와 관련 토픽 및 연구자의 정보를 확인할 수 있는 아이콘이 함께 출력된다.For example, if a search is performed after a researcher's name is entered in the user query unit, the visualization unit causes major topics extracted from the researcher's thesis, assignment, and intellectual property rights to be output in order of frequency or constellation. When the word search is performed, the visualization unit displays the results of the researcher of the research data including the word and the related research data of the researcher. In the result list, an icon for checking the researcher network, related topics, and researcher information is also displayed.

검색 결과 목록에서 특정 토픽과 함께 표시되는 네트워크 아이콘을 클릭하면, 가시화부는 해당 토픽을 기반으로 특정 연구자와 관련 있는 연구자를 실선으로 연결한 네트워크 화면을 출력한다. 네트워크는 연구자들의 토픽을 기반으로 연산된 유사 강도 값으로 구성된다.When the network icon displayed with a specific topic is clicked in the search result list, the visualization unit displays a network screen connecting a researcher related to a specific researcher based on the topic. The network consists of similar intensity values calculated based on researchers' topics.

화면의 상단에는 연구자 이름과 네트워크를 구성하는 대표 토픽을 출력한다. 화면의 우측 상단에는 키워드가 추출되는 영역을 논문, 과제, 지식재산권으로 필터링할 수 있는 버튼이 있으며, 선택 및 해제에 따라 키워드가 출력되는 산출물의 영역을 제한 및 설정할 수 있다.At the top of the screen, the researcher's name and representative topics that make up the network are displayed. In the upper right of the screen, there is a button to filter the area where the keyword is extracted by thesis, assignment, and intellectual property right, and you can limit and set the area of the output of the keyword according to selection and cancellation.

연구자 네트워크 중 원, 다각형과 같은 도형은 각각 연구자를 나타낸다. 도형에는 연구자의 이름이 기재된다. 도형을 클릭(선택)하면 해당 연구자의 토픽 단어가 도형 외주면에 표시된다. 표시되는 공간의 제약으로 해당 연구자의 토픽 단어를 상위 빈도수에 따라 일부만 표시하고, 도형 주위의 회색 영역을 클릭하면 나머지 토픽 단어가 추가로 표시된다. 토픽 단어의 우측에는 연구자료에서 등장한 빈도수가 표시되며, 토픽 단어를 클릭하면 해당 토픽 단어를 포함하는 연구자료의 제목, 키워드, 저자, 관련 토픽 목록 등이 표시된다.In the researcher network, shapes such as circles and polygons each represent the researcher. In the figure, the researcher's name is given. If you click (select) a shape, the topic word of the researcher is displayed on the outer peripheral surface of the shape. Due to the limitation of the displayed space, only a part of the topic words of the researcher is displayed according to the high frequency, and clicking the gray area around the figure displays additional topic words. To the right of the topic word, the frequency that appeared in the research data is displayed, and when the topic word is clicked, the title, keyword, author, and related topic list of the research data including the topic word are displayed.

연구자를 연결하여 네트워크가 형성되게 하는 연결선은 어느 한 도형 및 다른 도형과 대응되는 연구자 간의 관련된 토픽 단어의 수(유사 강도)에 대응하여 굵기가 결정된다. 연결선에는 연구자 간에 관련된 토픽 단어의 수가 표시되며, 연결선이 클릭(선택)되면 관련된 토픽 단어의 목록이 표시된다.The connecting line that connects researchers to form a network is determined in thickness according to the number of related topic words (similar strength) between researchers corresponding to one figure and another figure. The number of topic words related to the researcher is displayed in the connection line, and when the connection line is clicked (selected), a list of related topic words is displayed.

화면 좌측의 윈도우에는 사용자가 선택(클릭)한 연구자의 전체 토픽이 빈도순서로 출력된다. 윈도우의 하단에 있는 텍스트박스에 단어를 입력하고, 텍스트박스 하단에 있는 검색 버튼을 선택하거나, 윈도우에 표시된 토픽 중 일부의 체크박스에 체크하고 검색 버튼을 선택하면, 입력된 단어나 선택된 토픽을 기반으로 새로운 연구자 네트워크가 구성된다.In the window on the left side of the screen, the entire topics of the researcher selected (clicked) by the user are displayed in order of frequency. If you enter a word in the text box at the bottom of the window, select the search button at the bottom of the text box, or check the checkboxes of some of the topics displayed in the window and select the search button, it will be based on the entered word or selected topic. A new researcher network is formed.

화면 하단에는 연구자의 학과를 그룹지어 출력할 수 있는 학과 필터가 위치한다. 각 학과를 클릭하면 동일한 학과는 동일한 색으로 배경이 출력된다.At the bottom of the screen, there is a department filter that can group and output the researchers' departments. If you click each department, the background of the same department is displayed in the same color.

이어서, 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 방법을 설명한다.Next, a method of constructing a researcher map for facilitating fusion research according to an embodiment of the present invention will be described.

도 7 및 도 8을 참조하면, 본 발명의 일 실시예에 따른 융합 연구 촉진을 위한 연구원 맵 구축 방법은 데이터 추출부가 연구자료에서 연구자와 연구개요를 추출하여 데이터베이스를 구축하는 단계(S120)와, 형태소 분석부가 상기 연구개요에서 명사 및 형용사를 추출하여 단어세트를 생성하는 단계(S140)와, 토픽 모델링부가 말뭉치(corpus)를 이용하여 상기 단어세트 중 상기 연구자료의 주제가 되는 토픽 단어들을 추출하고, 관련되는 토픽 단어들을 그룹지어 토픽그룹을 형성하는 단계(S160)와, 사용자 질의부가 사용자로부터 검색문을 입력받는 단계(S220)와, 유사도 측정부가 상기 검색문과 상기 토픽그룹을 이용하여 중심연구자를 검색하는 단계(S240)와, 상기 유사도 측정부가 상기 중심연구자의 연구자료에서 추출된 토픽그룹과 다른 연구자료에서 추출된 토픽그룹을 비교하는 것으로 상기 중심연구자와 유사한 연구를 수행하는 것으로 판단되는 관련연구자를 탐색하는 단계(S260)와, 가시화부가 상기 중심연구자와 상기 관련연구자가 서로 연결되는 연구자 네트워크를 구성하여 표시하는 단계(S280)를 포함한다.7 and 8, a method for constructing a researcher map for facilitating convergence research according to an embodiment of the present invention includes a step of extracting a researcher and a research summary from a research material and constructing a database (S120). The morpheme analysis unit extracts nouns and adjectives from the research summary to generate a word set (S140), and the topic modeling unit extracts topic words that are the subject of the research data from the word set using a corpus, Grouping related topic words to form a topic group (S160), a user query unit receiving a search statement from a user (S220), and a similarity measurement unit searching for a central researcher using the search statement and the topic group Step S240, and the similarity measurement unit is extracted from the topic group and other research data extracted from the research data of the central researcher Searching for related researchers who are determined to perform similar research with the central researcher by comparing the pick group (S260), and visualizing unit constructing and displaying a researcher network in which the central researcher and the related researchers are connected to each other. (S280).

도 9를 참조하면, S120 단계는 구체적으로, 연구자료가 논문, 연구과제, 지식재산권 등 중에서 어떠한 종류의 문서인지 분류하는 단계(S121)와, 분류된 문서가 이미지화된 문서인지 식별하는 단계(S122)를 포함한다.Referring to Figure 9, step S120 is specifically, the step of classifying what kind of document the research material is thesis, research project, intellectual property, etc. (S121), and identifying whether the classified document is an imaged document (S122) ).

만약, 논문이 이미지화된 경우, 문서를 대상으로 이미지 프로세싱을 실시하여 '초록' 및 '키워드'에 대응되는 단어를 탐색하는 단계(S126)와, '초록' 및 '키워드'가 개시된 영역에 사각 범위를 설정한 후 사각 범위 내의 텍스트를 추출하는 단계(S127)를 포함한다.If the article is imaged, a step of searching for words corresponding to 'green' and 'keyword' by performing image processing on the document (S126), and a rectangular range in the area where 'green' and 'keyword' are disclosed After setting, it includes extracting the text within the rectangular range (S127).

만약, 논문이 텍스트를 추출할 수 있는 상태인 경우, '초록' 및 '키워드'의 텍스트를 추출한다(S128).If the paper is in a state where text can be extracted, texts of 'green' and 'keyword' are extracted (S128).

S127 단계 또는 S128 단계에서 추출된 텍스트에서 불필요한 단어나 문자, 즉 초록, Abstract, 키워드, Keywords, Key words, 콤마(',')와 같은 문자를 제거하는 단계(S129)를 더 포함한다.The step S129 further includes removing unnecessary words or characters from the text extracted in step S127 or S128, that is, characters such as abstract, abstract, keyword, keywords, key words, and comma (',').

S121 단계에서 연구자료가 논문이 아닌 경우, 대응되는 방법으로 연구자료에서 연구개요가 추출된다(S123).If the research data is not a thesis in step S121, a research summary is extracted from the research data in a corresponding way (S123).

도 10을 참조하면, S140 단계는 구체적으로, 연구개요가 한글인지 또는 영어인지 분류하는 단계(S145)와, 연구개요가 한글인 경우, 코모란을 이용하여 연구개요에서 일반명사, 고유명사, 한자를 추출하고, 추출된 단어를 단어세트로 구성하는 단계(S146)와, 연구개요가 영어인 경우, CoreNLP를 이용하여 연구개요에서 명사, 형용사를 추출하고, 추출된 단어를 단어세트로 구성하는 단계(S147)를 포함한다.Referring to Figure 10, step S140 is specifically, a step of classifying whether the study outline is Korean or English (S145) and, if the study outline is Korean, using a comoran, a general noun, a proper noun, and a Chinese character in the research outline Extracting, constructing the extracted word into a word set (S146), and if the research outline is English, extracting nouns and adjectives from the research overview using CoreNLP, and constructing the extracted word into a word set (S147).

도 11을 참조하면, S260 단계는 구체적으로, 중심연구자의 토픽그룹과 다른 연구자의 토픽그룹을 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘을 이용하여 유사도 연산하는 단계(S264)와, 세 알고리즘의 유사도 값을 기 설정된 분류에 따라 분류하는 단계(S265)와, 기 설정된 분류에 따라 다른 연구자를 관련연구자로 선택하거나 관련연구자에서 제외하는 단계(S266)를 포함한다.Referring to FIG. 11, step S260 is specifically, calculating a similarity between a topic group of a central researcher and a topic group of another researcher using a jacquard algorithm, an SL algorithm, and a Soft TF / IDF algorithm (S264) and three algorithms. It includes the step of classifying the similarity value according to the preset classification (S265) and the step of selecting another researcher as a related researcher or excluding the related researcher according to the preset classification (S266).

기 설정된 분류에 따라 다른 연구자를 관련연구자로 선택하거나 관련연구자에서 제외하는 단계(S266)는, 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘 모두 유사도 값이 1이면 다른 연구자를 관련연구자 후보로 선택하는 단계(S266a), 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘 중 두 개 이상의 유사도 값이 0.5 이상이면 다른 연구자를 관련연구자 후보로 선택하는 단계(S266b), 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘 모두 유사도 값이 0이면 다른 연구자를 관련연구자에서 제외하는 단계(S266c), 자카드 알고리즘, SL 알고리즘, Soft TF/IDF 알고리즘 중 두 개 이상의 유사도 값이 0.5 미만이면 다른 연구자를 관련연구자에서 제외하는 단계(S266d)를 포함한다.In the step S266 of selecting another researcher as a related researcher or excluding the related researcher according to a preset classification, if the similarity value of the jacquard algorithm, the SL algorithm, and the Soft TF / IDF algorithm is 1, the other researcher is selected as a related researcher candidate. Step (S266a), if the similarity value of two or more of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm is 0.5 or higher, selecting another researcher as a candidate for a related researcher (S266b), jacquard algorithm, SL algorithm, Soft TF / IDF If all algorithms have a similarity value of 0, excluding other researchers from related researchers (S266c), excluding Jacques algorithm, SL algorithm, and Soft TF / IDF algorithm if the similarity value is less than 0.5, excluding other researchers from related researchers. (S266d).

S266a 및 S266b 단계에서 관련연구자 후보로 선택된 다른 연구자는 중심연구자와 토픽그룹의 토픽 단어들을 서로 비교하여 유사 강도가 기 설정된 기준 이상이 될 때 관련연구자로 선정된다(S268).In S266a and S266b, another researcher selected as a candidate for the related researcher is selected as a related researcher when the similarity strength is higher than a predetermined criterion by comparing the central researcher and the topic words of the topic group (S268).

이상에서 본 발명의 바람직한 실시예를 설명하였으나, 본 발명은 다양한 변화와 변경 및 균등물을 사용할 수 있다. 본 발명은 상기 실시예를 적절히 변형하여 동일하게 응용할 수 있음이 명확하다. 따라서 상기 기재 내용은 하기 특허청구범위의 한계에 의해 정해지는 본 발명의 범위를 한정하는 것이 아니다.The preferred embodiments of the present invention have been described above, but the present invention can use various changes, modifications, and equivalents. It is clear that the present invention can be equally applied by appropriately modifying the above embodiments. Therefore, the above description is not intended to limit the scope of the present invention as defined by the following claims.

110 : 데이터 추출부
120 : 형태소 분석부
130 : 토픽 모델링부
140 : 유사도 측정부
150 : 사용자 질의부
160 : 가시화부
180 : 연구자 데이터베이스110: data extraction unit
120: morpheme analysis unit
130: topic modeling department
140: similarity measuring unit
150: user query unit
160: visualization unit
180: Researcher database

Claims

A data extraction unit that builds a database by extracting researchers and research outlines from research data,
A morpheme analysis unit for generating a word set by extracting nouns and adjectives from the above study summary,
A topic modeling unit for extracting topic words that are the subject of the research data from the set of words using a corpus, and grouping related topic words to form a topic group;
A user query unit that receives a search statement from a user,
It is judged that a central researcher is searched using the search statement and the topic group, and a similar study to the central researcher is performed by comparing the topic group extracted from the researcher's research data and the topic group extracted from other research data. Similarity measurement unit to search for relevant researchers,
A researcher map construction system for facilitating convergence research, comprising a visualization unit configured to display a researcher network connected to a central researcher and related researchers searched in the similarity measurement unit.

According to claim 1,
The data extraction unit extracts characters from an image by performing image processing when the research data is a document composed of images,
If the research data is an imaged paper, the starting coordinates are set at the point where the abstract starts, and the ending coordinates are set at the point where the abstract ends, and the x-coordinate value of the ending coordinates is replaced with the x-coordinate of the rightmost part of the abstract paragraph. A system for constructing a researcher map for facilitating convergence research, characterized by extracting text contained within a rectangular range using the start coordinate and the end coordinate as both end points.

According to claim 2,
The data extracting unit determines that a space threshold width is pre-set to determine space between words, and when the width of a space located to the left or right of a word exceeds the space threshold width, the corresponding text line is determined to be terminated. Researcher map building system to promote convergence research.

According to claim 3,
The data extraction unit is a word (W), the number of extracted words (n), the entire word (WA), the i-th extracted word (

)when,
remind

The realm of

(

,

), remind

Starting coordinates

Is (

,

), remind

End coordinates of

Is (

,

),

The width of the space between the i-1th and ith words when the condition of

The

Researcher map construction system for facilitating convergence research, characterized in that calculated by.

According to claim 1,
The morpheme analysis unit reads the sentence included in the study outline in units of char, extracts the integer value of the character, and if the integer value is greater than 0x3131 and less than 0xD7A3, the study summary is determined to be Korean Researcher map construction system for facilitating convergence research, characterized in that.

The method of claim 5,
If the research summary is Korean, the morpheme analysis unit analyzes the morphemes using KOMORAN, extracts words of common nouns, proper nouns, and Chinese characters and composes them into word sets.
If the research outline is English, a researcher map construction system for facilitating convergence research is characterized by analyzing morphemes using CoreNLP and extracting words that are nouns and adjectives into a set of words.

According to claim 1,
The topic modeling unit constructs a corpus using a word set, and applies the corpus to a potential Dirichlet Allocation algorithm to generate a topic group, thereby creating a researcher map building system for facilitating convergence research. .

The method of claim 7,
The topic modeling unit randomly allocates one of K topic words to each research data,
For each research data (d), all words (w) included in each research material (d), and topic words (t) existing in all words (w),
Calculate the ratio p (t | d) of the topic word (t) among the word set (w) of each study data (d),
Calculate the ratio p (w | t) assigned to the topic word (t) among all research data,
A researcher map building system for facilitating convergence research, characterized in that the topic word (t) is newly selected according to the product of p (t | d) and p (w | t).

According to claim 1,
The topic modeling unit promotes convergence research, characterized in that when a researcher related to a search statement is searched, a topic group is extracted from the research data including the researcher, and a search word-topic mapping is performed by connecting topic words matching the search statement. For researcher map building system.

The method of claim 9,
The topic modeling unit is a researcher map construction system for facilitating convergence research, characterized by performing a similarity comparison by combining topic words of a topic group and keywords disclosed in a paper in order to solve a problem in which the meaning of a search statement may be ambiguous.

According to claim 1,
The similarity measurement unit searches a related researcher by calculating a similarity between a topic group of a central researcher and a topic group of another researcher using a jaccard algorithm, a scaled Levenshtein (SL) algorithm, and a Soft TF / IDF algorithm. Researcher map building system for facilitating convergence research.

The method of claim 11,
The similarity measurement unit calculates the similarity of the researcher according to the number (N) of researchers.

Researcher map building system for facilitating convergence research, characterized in that it is performed once.

The method of claim 11,
The similarity measurement unit is a researcher for facilitating convergence research, characterized in that if the similarity values of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are all 1, the researcher compared with the central researcher is determined to be a related researcher who performs similar research. Map building system.

The method of claim 11,
The similarity measurement unit compares the central researcher with other researchers, and if any of the similarity values of the jacquard algorithm, the SL algorithm, and the Soft TF / IDF algorithm is 0, the other researcher is excluded from the related researcher to promote convergence research. Map building system.

The method of claim 11,
The similarity measurement unit compares the central researcher with other researchers and determines that there is no semantic similarity when two or more similarity values of the Jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are less than 0.5, and excludes the other researchers from the related researchers. Researcher map construction system for facilitating convergence research, characterized in that.

According to claim 1,
The researcher network includes a figure in which a researcher's name is described, a figure connecting a figure and another figure, and a connecting line in which the thickness is determined in correspondence to the number of related topic words between the figure and the researcher corresponding to the figure. It includes a visualization unit,
The visualization unit displays a researcher's topic word on the outer circumferential surface of the figure when the figure is selected, and displays a related topic word between researchers when the connection line is selected.

A data extraction unit that builds a database by extracting researchers and research outlines from research data,
A morpheme analysis unit for generating a word set by extracting nouns and adjectives from the above study summary,
A topic modeling unit for extracting topic words that are the subject of the research data from the set of words using a corpus, and grouping related topic words to form a topic group;
A user query unit that receives a search statement from a user,
It is judged that a central researcher is searched using the search statement and the topic group, and a similar study to the central researcher is performed by comparing the topic group extracted from the researcher's research data and the topic group extracted from other research data. Similarity measurement unit to search for relevant researchers,
And a visualization unit configured to display a researcher network in which the central researcher and related researchers searched in the similarity measurement unit are connected to each other, and
The data extraction unit extracts characters from the image by performing image processing when the research data is a document composed of images, and if the research data is an imaged paper, the starting coordinates at the starting point of the abstract and the ending point of the abstract. Set the end coordinate, and the x-coordinate value of the end coordinate is replaced with the x-coordinate of the rightmost side of the green paragraph, and then the text included in the rectangular range using the start coordinate and the end coordinate as both end points is extracted.
The data extracting unit determines that a space threshold width is pre-set to determine space between words, and when the width of a space located on the left or right side of a word exceeds the space threshold width, it is determined that the corresponding text line is terminated.
The data extraction unit is a word (W), the number of extracted words (n), the entire word (WA), the i-th extracted word (

), Above

The realm of

(

,

), remind

Starting coordinates

Is (

,

), remind

End coordinates of

Is (

,

),

The width of the space between the i-1th and ith words when the condition of

The

Is calculated as
The morpheme analysis unit reads the sentence included in the study outline in units of char, extracts the integer value of the character, and if the integer value is greater than 0x3131 and less than 0xD7A3, the study summary is determined to be Korean And
The morpheme analysis unit analyzes morphemes using KOMORAN if the research outline is Korean, extracts words of common nouns, proper nouns, and Chinese characters into a set of words, and if the study outline is English, uses morphemes using CoreNLP. It analyzes and extracts words that are nouns and adjectives and composes them into word sets.
The topic modeling unit constructs a corpus using a word set, and applies the corpus to a latent Dirichlet Allocation algorithm to generate a topic group,
The topic modeling unit randomly allocates one of the K topic words to each study data, and the topics present in the whole word (w) and the whole word (w) included in each study data (d) and each study data (d). For word (t), the ratio p (t | d) of the topic word (t) in the word set (w) of each study data (d) is calculated, and the ratio of the topic word (t) allocated in all study data p (w | t) is calculated, and the topic word (t) is newly selected according to the product of p (t | d) and p (w | t),
When the researcher related to the search statement is searched, the topic modeling unit extracts a topic group from the research data including the researcher, and connects search terms to topic words by connecting the topic words matching the search statement,
The topic modeling unit performs a similarity comparison by combining topic words of a topic group and keywords disclosed in a paper, in order to solve a problem in which the meaning of the search statement may be ambiguous,
The similarity measurement unit searches a related researcher by calculating a similarity between a central researcher's topic group and another researcher's topic group using a jaccard algorithm, a SL (Scaled Levenshtein) algorithm, and a Soft TF / IDF algorithm.
The similarity measurement unit calculates the similarity of the researcher according to the number (N) of researchers.

Perform once,
The similarity measurement unit determines that if the similarity values of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are all 1, the researcher compared with the central researcher is determined to be a related researcher who performs similar research.
The similarity measurement unit compares the central researcher with other researchers, and if any of the similarity values of the jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm is 0, the other researchers are excluded from the related researchers,
The similarity measurement unit compares the central researcher with other researchers and determines that there is no semantic similarity when two or more similarity values of the Jacquard algorithm, SL algorithm, and Soft TF / IDF algorithm are less than 0.5, and excludes the other researchers from related researchers. ,
The researcher network includes a figure in which a researcher's name is described, a figure connecting a figure and another figure, and a connecting line in which the thickness is determined in correspondence to the number of related topic words between the figure and the researcher corresponding to the figure. It includes a visualization unit,
The visualization unit displays a researcher's topic word on the outer circumferential surface of the figure when the figure is selected, and displays a related topic word between researchers when the connection line is selected.

Includes a visualization unit including a figure in which a researcher's name is described, a connecting line that connects one figure to another figure, and determines the thickness corresponding to the number of related topic words between the figure and the researcher corresponding to the other figure and,
The visualization unit displays a researcher's topic word on the outer circumferential surface of the figure when the figure is selected, and displays a related topic word between researchers when the connection line is selected.

The data extraction unit extracts the researcher and the research summary from the research data and builds a database.
The morpheme analysis unit extracts nouns and adjectives from the research outline to generate a word set,
A topic modeling unit extracting a topic word as a subject of the research data from the word set using a corpus, and forming a topic group by grouping related topic words;
The user query unit receives a search statement from the user,
A similarity measurement unit searching for a central researcher using the search statement and the topic group;
Searching for related researchers who are determined to perform similar research with the central researcher by comparing the similarity measurement unit by extracting the topic group extracted from the central researcher's research data and other research data;
Method of building a researcher map for facilitating convergence research, characterized in that the visualization unit comprises the step of constructing and displaying a researcher network in which the central researcher and the related researchers are connected to each other.

The user query unit receives a search statement from the user,
The visualization unit connects a figure showing the name of the researcher searched for, and a figure and another figure, and monitors a connection line whose thickness is determined according to the number of related topic words between the figure and the researcher corresponding to the other figure A method of constructing a researcher map for facilitating convergence research, comprising the step of displaying in a.