KR102467096B1

KR102467096B1 - Method and apparatus for checking dataset to learn extraction model for metadata of thesis

Info

Publication number: KR102467096B1
Application number: KR1020200143835A
Authority: KR
Inventors: 정희석; 설재욱; 황혜경; 최성필; 고건우; 김선우; 지선영
Original assignee: 한국과학기술정보연구원; 경기대학교 산학협력단
Priority date: 2020-10-30
Filing date: 2020-10-30
Publication date: 2022-11-15
Also published as: KR20220058257A

Abstract

논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법이 제공 된다. 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법은 학술정보가 포함된 논문을 표현하기 위한 논문 데이터와 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 상기 논문의 메타데이터를 입력 받는 단계, 상기 메타데이터의 메타데이터 영역 텍스트와 상기 논문을 동시에 표시하는 단계, 상기 메타데이터 영역 텍스트의 수정을 위한 사용자 입력에 따라, 상기 메타데이터 영역 텍스트를 수정하는 단계, 및 상기 메타데이터의 수정에 대한 데이터를 상기 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터로 가공하는 단계를 포함할 수 있다.A method for reviewing datasets for learning thesis metadata area classification model is provided. A method for reviewing a dataset for learning a thesis metadata domain classification model according to an embodiment of the present invention extracts thesis data for representing a thesis containing academic information and the machine-learned thesis metadata domain classification model. Receiving metadata of the thesis, displaying the metadata area text of the metadata and the thesis at the same time, modifying the metadata area text according to a user input for modifying the metadata area text , and processing the metadata correction data into learning data for learning the metadata area classification model.

Description

Dissertation Method and apparatus for checking dataset to learn extraction model for metadata of thesis}

본 발명은 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법 및 그 장치에 관한 것이다. 보다 자세하게는, 웹 기반의 검수 툴을 이용하여 논문으로부터 추출된 메타데이터 영역 텍스트를 검수하여 수정된 내용이 데이터셋으로 구축되고, 구축된 데이터셋이 논문 메타데이터 영역 분류 모델의 학습에 활용될 수 있는 데이터셋의 검수 방법 및 장치를 제공하는 것이다.The present invention relates to a method and apparatus for inspecting a dataset for learning a paper metadata domain classification model. More specifically, by using a web-based review tool, the metadata area text extracted from thesis is inspected, the modified contents are built into a dataset, and the built dataset can be used to learn thesis metadata area classification model. It is to provide a method and apparatus for inspecting a dataset having a data set.

기존의 논문 서비스는 정형화된 문헌 메타데이터의 데이터베이스화 및 주로 PDF 파일 형식으로 공유가 이루어졌다. 한편 최근 영향력이 큰 저널들과 Open Access 저널들을 중심으로 독자들에게 전문 XML을 제공하는 서비스가 늘어나고 있고, 논문의 대규모 텍스트를 빅데이터 분석 등에 활용하여 기존의 정보 서비스를 뛰어넘는 고부가가치 서비스에 관련된 시도가 늘어나고 있다.Existing dissertation services have been made into a database of standardized literature metadata and shared mainly in PDF file format. On the other hand, services that provide full-text XML to readers are increasing, especially in influential journals and open access journals. attempts are increasing.

문헌 메타데이터를 데이터베이스화하는 것은 방대한 논문을 대부분 수작업을 통해 이루어졌던 과정으로 많은 비용과 시간이 걸리는 일이었다. 그런데 기술적, 사회적 요구에 따라 기존의 PDF 파일과 문헌 데이터베이스로부터 전문 XML을 빠르게 수집하는 기술이 필요하다.Creating a database of literature metadata was a costly and time-consuming process that was mostly done manually for a large number of papers. However, according to technical and social needs, a technology for quickly collecting full-text XML from existing PDF files and literature databases is required.

그러나, 논문의 경우 기술 분야가 다양하고, 그 내용이 복잡하며 신규한 경우가 대부분이다. 방대한 문헌이 산발적으로 흩어져 있기 때문에 논문이 효율적으로 활용되기 위해서는 방대한 양의 논문 DB를 정형화시켜 구축할 필요가 있다.종래의 논문은 대부분 ADOBE 사의 PDF 파일 형식으로 논문 공유가 이루어지고, PDF 형식의 논문 또는 그 외의 형식으로 제공되는데 제각각 다른 형식의 논문 파일은 텍스트가 추출되기 어려운 형태인 경우가 많다. 예를 들어, PDF 형식의 논문 파일의 경우 정형화된 형식이 없기 때문에 각 논문마다 텍스트가 배치된 위치가 모두 상이하게 작성되어 있으며, 같은 기술 분야의 논문이라고 하더라도 모두 다른 형식으로 작성되는 경우가 대부분이다. PDF 파일에 포함된 논문의 경우 제일 첫 페이지의 가장 상단에는 제목과 저자가 기재되는 경우가 일반적이지만, 그 외의 분류인 초록, 키워드, 저자 정보의 경우에는 각각의 논문마다 상이한 구조로 이루어져 있다. 또한, 본문의 경우 어떠한 논문은 1단으로 이루어지고, 어떠한 논문은 2단으로 이루어져 있다.However, in the case of papers, the technical fields are diverse, and the contents are complex and new in most cases. Since vast literature is sporadically scattered, it is necessary to standardize and build a vast amount of thesis DB in order to efficiently utilize thesis. Most of the conventional thesis is shared in PDF file format of ADOBE, and thesis in PDF format Or, it is provided in other formats, but in many cases, thesis files in different formats are difficult to extract text from. For example, in the case of PDF-format thesis files, since there is no standardized format, the position where the text is placed is different for each paper, and even papers in the same technical field are all written in different formats. . In the case of thesis included in the PDF file, it is common for the title and author to be listed at the top of the first page, but in the case of abstract, keywords, and author information, which are other classifications, each paper has a different structure. Also, in the case of the main text, some papers consist of one column, and some papers consist of two columns.

이처럼 각각 상이한 구조로 이루어진 논문에 대하여 텍스트를 추출하기 위해서 종래에는 OCR과 같은 이미지 프로세싱 기술을 이용하여 텍스트를 추출하였는데, 논문의 메타데이터가 어떠한 분류인지 관계하지 않고 좌측에서 우측으로 순차적으로 텍스트를 추출하였다. 이 경우 다른 목차의 텍스트와 섞이기 때문에 문서 구조를 반영한 정확한 메타데이터를 추출할 수 없었다.Conventionally, in order to extract text from papers with different structures, text was extracted using image processing techniques such as OCR, but the text was sequentially extracted from left to right regardless of the classification of the metadata of the paper. did In this case, it was not possible to extract accurate metadata reflecting the document structure because it was mixed with the text of other table of contents.

이를 해결하기 위해, 논문의 메타데이터를 추출할 수 있는 다양한 모델이 개발되고 있으나 이러한 모델에 의해 추출된 메타데이터라고 하더라도, 구조가 제각각 상이한 PDF 논문의 메타데이터를 정확하게 추출하기는 실질적으로 쉽지 않았다. 따라서, 논문 내에서 텍스트의 구조를 정확하게 인식할 수 없었고, 높은 품질로 텍스트가 추출될 수 없었으며, 추출된 메타데이터를 일부 수정해야하는 일종의 후처리 작업이 요구되었다.In order to solve this problem, various models capable of extracting the metadata of papers are being developed, but even with the metadata extracted by these models, it has not been practically easy to accurately extract the metadata of PDF papers with different structures. Therefore, it was not possible to accurately recognize the text structure within the thesis, the text could not be extracted with high quality, and some kind of post-processing work was required to partially modify the extracted metadata.

이러한 작업은 논문과 메타데이터를 하나하나 비교함에 따른 시간과 비용이 소요되기 때문에 실질적으로 메타데이터를 재생산하는 정도의 소모가 발생되었다. 또한, 이러한 작업을 통해 기록된 수정 내역들은 기존의 메타데이터 영역 분류 모델이 제대로 추출하지 못한 메타데이터를 보완하고 수정하는 작업 내역이기 때문에, 메타데이터 영역 분류 모델을 학습시키기에 훌륭한 데이터로 이용될 수 있었지만, 이러한 유용한 자원이 활용되지 못하고 낭비되는 실정이었다.Since this work takes time and money due to comparing thesis and metadata one by one, consumption to the extent of actually reproducing metadata occurred. In addition, since the revision history recorded through these tasks is the history of supplementing and correcting the metadata that the existing metadata domain classification model failed to properly extract, it can be used as excellent data for training the metadata domain classification model. However, these useful resources were not utilized and wasted.

따라서, 추출된 메타데이터를 수정할 필요가 있었는데 XML 파일과 같은 메타데이터를 논문과 비교하게 되면 XML 포맷의 특성상 태그나 기타 정보들이 많이 포함되어 육안으로 확인할 수 있는 효율성이 떨어지는 문제가 있었다. 종래에 검수 또는 수정 툴이 존재하기는 하였으나, 이를 이용하여 '논문'을 검수하는 경우 논문의 구조가 제각각 상이하며 대부분이 신규한 내용을 담고 있기 때문에 논문에 대해서는 작업이 용이하지 않았다Therefore, it was necessary to modify the extracted metadata. When metadata such as XML files are compared with papers, tags and other information are included in the nature of the XML format, resulting in inefficient visual confirmation. In the past, a review or correction tool existed, but when reviewing 'thesis' using it, the structure of the thesis was different and most of them contained new content, so it was not easy to work on the thesis.

이에 따라, 논문과 메타데이터를 한눈에 알아볼 수 있도록 비교 작업을 보조하고, 검수와 수정이 가능한 플랫폼을 제공하는 기술이 필요하며, 이에 더해 검수가 완료되면 검수를 통해 수정된 내역이 기록되고 학습데이터 포맷으로 저장되어 메타데이터 영역 분류 모델의 학습에 이용될 수 있는 학습 데이터셋을 구축할 수 있는 기술이 필요한 실정이다.Accordingly, a technology is needed to assist comparison work so that thesis and metadata can be recognized at a glance, and to provide a platform that can be reviewed and corrected. There is a need for a technology capable of constructing a learning dataset that can be stored in a format and used for learning a metadata domain classification model.

등록특허공보 제10-1500598호 "XML 생성 시스템 및 방법"(2020.03.03. 등록)Registered Patent Publication No. 10-1500598 "XML Generation System and Method" (registered on March 3, 2020)

본 발명이 해결하고자 하는 기술적 과제는, 웹 기반의 검수 툴을 이용하여 논문으로부터 추출된 메타데이터 영역 텍스트를 검수할 수 있는 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법 및 장치를 제공하는 것이다.The technical problem to be solved by the present invention is to provide a dataset review method and apparatus for learning a thesis metadata area classification model capable of inspecting metadata area text extracted from a paper using a web-based review tool. is to do

본 발명이 해결하고자 하는 다른 기술적 과제는 수정된 내용이 데이터셋으로 구축되고, 구축된 데이터셋이 논문 메타데이터 영역 분류모델의 학습에 활용될 수 있는 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법 및 장치를 제공하는 것이다.Another technical problem to be solved by the present invention is a dataset for learning a thesis metadata region classification model in which the modified contents are built as a dataset and the constructed dataset can be used for learning thesis metadata region classification model. To provide an inspection method and device for

본 발명의 기술적 과제들은 이상에서 언급한 기술적 과제들로 제한되지 않으며, 언급되지 않은 또 다른 기술적 과제들은 아래의 기재로부터 본 발명의 기술분야에서의 통상의 기술자에게 명확하게 이해 될 수 있을 것이다.The technical problems of the present invention are not limited to the technical problems mentioned above, and other technical problems not mentioned will be clearly understood by those skilled in the art from the description below.

상기 기술적 과제를 해결하기 위한, 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법은 컴퓨팅 장치에 의해 수행되는 방법에 있어서, 학술정보가 포함된 논문을 표현하기 위한 논문 데이터와 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 상기 논문의 메타데이터를 입력 받는 단계, 상기 메타데이터의 메타데이터 영역 텍스트와 상기 논문을 동시에 표시하는 단계, 상기 메타데이터 영역 텍스트의 수정을 위한 사용자 입력에 따라, 상기 메타데이터 영역 텍스트를 수정하는 단계, 및 상기 메타데이터의 수정에 대한 데이터를 상기 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터로 가공하는 단계를 포함할 수 있다.In order to solve the above technical problem, the method for reviewing a dataset for learning a thesis metadata domain classification model according to an embodiment of the present invention is a method performed by a computing device, in which a thesis containing academic information is expressed. Receiving thesis data and the metadata of the thesis extracted by the machine learning thesis metadata area classification model, displaying the metadata area text of the metadata and the thesis at the same time, the metadata area text It may include modifying the metadata area text according to a user input for correction, and processing data about the modification of the metadata into learning data for learning the metadata area classification model. .

다른 실시예에서, 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 장치는 프로세서, 네트워크 인터페이스, 상기 프로세서에 의해 실행되어 컴퓨터 프로그램을 로드(load)하는 메모리, 및 상기 컴퓨터 프로그램을 저장하는 스토리지를 포함하되, 상기 컴퓨터 프로그램은, 학술정보가 포함된 논문을 표현하기 위한 논문 데이터와 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 상기 논문의 메타데이터를 입력 받는 인스트럭션, 상기 메타데이터의 메타데이터 영역 텍스트와 상기 논문을 함께 표시하는 인스트럭션,In another embodiment, an apparatus for inspecting a dataset for learning a paper metadata domain classification model includes a processor, a network interface, a memory executed by the processor to load a computer program, and a storage for storing the computer program. Including, but the computer program, an instruction to receive the metadata of the thesis data extracted by thesis data for representing thesis including academic information and the machine-learned thesis metadata area classification model, the metadata of the metadata An instruction to display the data area text and the thesis together;

상기 메타데이터 영역 텍스트의 수정을 위한 사용자 입력에 따라, 상기 메타데이터 영역 텍스트를 수정하는 인스트럭션, 및 상기 메타데이터의 수정에 대한 데이터를 상기 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터로 가공하는 인스트럭션을 포함할 수 있다.An instruction for modifying the metadata region text according to a user input for modifying the metadata region text, and an instruction for processing the metadata modification data into learning data for learning the metadata region classification model. can include

도 1은 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법의 예시도이다.
도 2는 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법의 순서도이다.
도 3은 메타데이터 영역 텍스트와 논문이 표시되고, 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터가 가공되는 예시를 설명하기 위한 도면이다.
도 4는 도 2의 단계 S300를 구체적으로 설명하기 위한 도면이다.
도 5는 논문 PDF 일부의 예시를 나타내는 도면이다.
도 6은 데이터 검수 화면에 분류된 메타데이터 항목에 맞게 출력되는 메타데이터 영역 텍스트를 나타내는 도면이다.도 7은 단계 S310에서의 오류가 발생된 메타데이터 영역 분류에 대한 텍스트 엘리먼트의 분류 교정의 예를 나타내는 도면이다. 도 8에서는 단계 S320에서의 텍스트 엘리먼트 순서 교정의 예를 나타내는 도면이다.도 9는 논문 메타데이터 영역 분류 모델에 입력되는 데이터와 논문 메타데이터 영역 분류 모델에 의해 출력되는 결과를 설명하기 위한 도면이다.
도 10은 PDF 논문으로부터 논문의 메타데이터를 추출하는 논문 메타데이터 영역 분류 모델의 프로세스를 개략적으로 설명하기 위한 도면이다.
도 11은 본 발명의 다른 실시예에 따른 논문의 메타데이터 영역 분류 장치의 하드웨어 구성도이다.1 is an exemplary view of a method for verifying a dataset for learning a classification model for a thesis metadata area according to an embodiment of the present invention.
2 is a flowchart of a dataset verification method for learning a thesis metadata domain classification model according to an embodiment of the present invention.
3 is a diagram for explaining an example in which text and papers in the metadata area are displayed and learning data for learning a metadata area classification model is processed.
FIG. 4 is a diagram for explaining step S300 of FIG. 2 in detail.
5 is a diagram showing an example of a part of a thesis PDF.
FIG. 6 is a diagram showing the metadata area text output according to the classified metadata items on the data verification screen. FIG. 7 shows an example of classification and correction of text elements for the metadata area classification in which an error occurred in step S310. It is a drawing that represents FIG. 8 is a diagram showing an example of text element order correction in step S320. FIG. 9 is a diagram for explaining data input to the paper metadata domain classification model and results output by the paper metadata domain classification model.
10 is a diagram schematically illustrating a process of a paper metadata area classification model for extracting metadata of a paper from a PDF paper.
11 is a hardware configuration diagram of an apparatus for classifying a meta data area of a paper according to another embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예들을 상세히 설명한다. 본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시 예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 게시되는 실시 예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 수 있으며, 단지 본 실시 예들은 본 발명의 게시가 완전하도록 하고, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings. Advantages and features of the present invention, and methods for achieving them, will become clear with reference to the embodiments described below in detail in conjunction with the accompanying drawings. However, the present invention is not limited to the embodiments disclosed below and may be implemented in various different forms, only the present embodiments make the disclosure of the present invention complete, and the common knowledge in the art to which the present invention belongs It is provided to fully inform the holder of the scope of the invention, and the present invention is only defined by the scope of the claims. Like reference numbers designate like elements throughout the specification.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다. 본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다.Unless otherwise defined, all terms (including technical and scientific terms) used in this specification may be used in a meaning commonly understood by those of ordinary skill in the art to which the present invention belongs. In addition, terms defined in commonly used dictionaries are not interpreted ideally or excessively unless explicitly specifically defined. Terminology used herein is for describing the embodiments and is not intended to limit the present invention. In this specification, singular forms also include plural forms unless specifically stated otherwise in a phrase.

이하, 도면들을 참조하여 본 발명의 몇몇 실시예들을 설명한다.Hereinafter, some embodiments of the present invention will be described with reference to the drawings.

도 1은 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법의 예시도이다.1 is an exemplary view of a method for verifying a dataset for learning a classification model for a thesis metadata area according to an embodiment of the present invention.

도 1을 참조하면, 논문 데이터와 메타데이터 영역 텍스트가 논문 메타데이터 검수 장치에 입력되면, 논문 메타데이터 검수 장치(100)에 의해 논문의 메타데이터를 검수하는 툴이 제공될 수 있다.Referring to FIG. 1 , when the thesis data and metadata field text are input to the thesis metadata checking device, a tool for checking the metadata of the thesis may be provided by the thesis metadata checking device 100 .

논문(10)은 학술정보가 포함된 논문을 표현하기 위한 논문 데이터로 이루어질 수 있는데, 일 실시예에서 PDF 파일 형식의 논문일 수 있다. 논문(10)은 통상적인 구조에 따라 논문 제목, 영문 제목, 저자 정보, 영문 초록, 한글 초록, 본문, 주석 등으로 분류되어 작성된 문헌일 수 있다. 메타데이터는 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 데이터로서, XML, TXT, JATS-XML로 이루어질 수 있으나, 형식이 이에 한정되는 것은 아니다.The thesis 10 may be composed of thesis data for expressing thesis including academic information, and in one embodiment, it may be a PDF file format thesis. The thesis 10 may be a document prepared by being classified into thesis title, English title, author information, English abstract, Korean abstract, text, annotation, etc. according to a typical structure. Metadata is data extracted by a machine-learned thesis metadata domain classification model, and may consist of XML, TXT, or JATS-XML, but the format is not limited thereto.

몇몇 실시예에서, 논문 메타데이터 검수 장치(100)에 입력된 메타데이터와 논문(10)은 웹을 통해 함께 표시될 수 있고, 논문(10)의 페이지와 논문(10)의 페이지에 대응되는 메타데이터 영역 텍스트(20)가 배치될 수 있다. 상기 웹에 표시된 메타데이터 영역 텍스트(20)는 사용자의 입력에 수정될 수 있고, 수정된 내용은 학습데이터로 구축될 수 있다. 본 발명에 따른 데이터셋의 검수 방법을 더 자세히 설명하기 위해 도 2를 참조하도록 한다.In some embodiments, the metadata input to the dissertation metadata review device 100 and the dissertation 10 may be displayed together through the web, and the page of the dissertation 10 and the meta corresponding to the page of the dissertation 10 A data area text 20 may be placed. The metadata area text 20 displayed on the web can be modified by user input, and the modified contents can be built into learning data. Reference is made to FIG. 2 for a detailed description of the method for verifying a dataset according to the present invention.

도 2는 본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법의 순서도이고, 도 3은 논문과 메타데이터 영역 텍스트가 표시되고, 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터가 가공되는 예시를 설명하기 위한 도면이다.2 is a flowchart of a dataset review method for learning a thesis metadata area classification model according to an embodiment of the present invention, and FIG. 3 displays thesis and metadata area text, and learning of the metadata area classification model It is a diagram for explaining an example in which learning data for is processed.

도 2의 단계 S100에서 논문 데이터와 논문의 메타데이터가 입력될 수 있다. 구체적으로, 학술정보가 포함된 논문을 표현하기 위한 논문 데이터와 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 논문의 메타데이터가 입력될 수 있다. 이때 논문은 일반적으로 PDF 파일 형식의 논문이 입력될 수 있는데, 논문 파일은 사용자에 의해 입력된 데이터(로컬 파일)이거나 통합 논문 서비스 데이터베이스에서 관리되는 데이터(서버 상에서 관리되는 파일)일 수 있다. 논문의 메타데이터 영역 텍스트는 논문의 목차에 따라 분류된 메타데이터 테이블로 이루어질 수 있다. 예를 들어 논문의 메타데이터 영역은 제목, 영문 제목, 저자 정보, 영문 초록, 한글 초록, 한글 키워드, 영문 키워드, 기타(본문 등)등으로 분류된 데이터일 수 있다.In step S100 of FIG. 2 , thesis data and metadata of the thesis may be input. Specifically, thesis data for representing thesis including academic information and the metadata of the thesis extracted by the machine-learned thesis metadata domain classification model may be input. At this time, the thesis may be input in a PDF file format, and the thesis file may be data input by the user (local file) or data managed in the integrated thesis service database (file managed on the server). The text of the metadata field of the thesis may be composed of a metadata table classified according to the table of contents of the thesis. For example, the metadata field of a thesis may be data classified into title, English title, author information, English abstract, Korean abstract, Korean keyword, English keyword, and others (text, etc.).

단계 S200에서 메타데이터 영역 텍스트와 논문이 웹상에서 함께 표시될 수 있는데, 논문의 페이지와 논문의 페이지에 대응되는 메타데이터 영역이 배치될 수 있다. 일 실시예에서, 논문은 좌측에 표시되고, 메타데이터가 우측에 표시될 수 있다. 논문은 첫페이지가 표시될 수 있고, 메타데이터 영역 테이블은 논문의 첫페이지에 해당되는 내용을 포함하는 테이블이 표시될 수 있다.In step S200, the metadata area text and the thesis may be displayed together on the web, and a page of the thesis and a metadata area corresponding to the page of the thesis may be arranged. In one embodiment, articles may be displayed on the left and metadata may be displayed on the right. The first page of the thesis may be displayed, and a table including contents corresponding to the first page of the thesis may be displayed as a metadata area table.

도 3을 참조하면, 논문(10)과 메타데이터 영역 텍스트(20)는 웹 상에서 표시될 수 있다. 예를 들어 python 또는 javascript의 웹 기반으로 이루어진 논문 검수 플랫폼에서 메타데이터 영역 텍스트의 수정에 대한 입력을 받을 수 있다. 논문(10)과 메타데이터 영역 텍스트(20)가 입력되면 웹 상에서 좌측에 논문(10)이 표시되고, 우측에 메타데이터 영역 텍스트(20)가 표시될 수 있다.Referring to FIG. 3 , the thesis 10 and the metadata area text 20 may be displayed on the web. For example, input for text correction in the metadata area can be received on a thesis review platform based on the web of python or javascript. When the thesis 10 and the metadata area text 20 are input, the thesis 10 may be displayed on the left side and the metadata area text 20 may be displayed on the right side on the web.

메타데이터 영역 텍스트(20)는 논문(10)에 기재된 목차의 분류에 따라 제목(2-1), 영문 제목(2-2), 저자 정보(2-3), 주석(2-4), 영문 초록(2-5), 한글 초록(2-6), 영문 키워드(2-7), 한글 키워드(2-8)등으로 분류될 수 있다.The metadata area text (20) is title (2-1), English title (2-2), author information (2-3), annotation (2-4), and English according to the classification of the table of contents described in the thesis (10). It can be classified into abstract (2-5), Korean abstract (2-6), English keyword (2-7), Korean keyword (2-8), etc.

웹 상에 표시된 수정 버튼이 클릭되면 메타데이터가 수정될 수 있고, 저장 버튼의 클릭에 의해 수정된 메타데이터가 저장될 수 있다. 저장된 데이터는 메타데이터 영역 분류 모델의 학습을 위한 학습 데이터(30)로 가공될 수 있다.Metadata can be modified when an edit button displayed on the web is clicked, and modified metadata can be saved by clicking a save button. The stored data may be processed into learning data 30 for learning a metadata area classification model.

단계 S300에서 사용자 입력에 따라 메타데이터 영역 텍스트(20)가 수정될 수 있다. 여기서 사용자의 입력은 웹 상에서의 키보드 입력, 마우스 입력 등을 가리킬 수 있으며, 메타데이터 영역 텍스트(또는 텍스트 엘리먼트)의 오분류를 수정하는 것을 의미한다.In step S300, the metadata area text 20 may be modified according to a user input. Here, the user's input may indicate keyboard input, mouse input, etc. on the web, and means correcting misclassification of metadata field text (or text element).

사용자의 확인 또는 수정을 거친 메타데이터 영역 텍스트(20)는 단계 S400에서 논문 메타데이터 영역 분류 모델의 학습 데이터로 변경될 수 있다. 일 실시예에서, 여러 텍스트 엘리먼트로 나누어져 있는 메타데이터 영역 텍스트 분류가 사용자에 의해 정제되고, 그 메타데이터 영역 텍스트들 중 가장 바깥쪽에 위치한 좌표의 데이터 속성들을 기준으로 해당 메타데이터 영역을 감싸는 사각형의 좌표와 텍스트, 텍스트의 폰트 속성 등을 학습 데이터로 저장하게 된다. 이러한 방식을 통해 메타데이터 영역 텍스트(20)가 학습 데이터로 가공될 수 있다.The text 20 in the metadata area that has been checked or modified by the user may be changed into training data of the thesis metadata area classification model in step S400. In one embodiment, a metadata area text classification divided into several text elements is refined by a user, and a rectangle enclosing the metadata area is formed based on data properties of coordinates located at the outermost part of the metadata area texts. Coordinates, text, font properties of text, etc. are stored as learning data. In this way, the metadata area text 20 can be processed into learning data.

본 발명에 따른 검수 방법을 통해 확보된 학습 데이터는 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋으로 이용될 수 있다. 즉, 메타데이터 영역 텍스트의 수정에 대한 데이터는 논문이 입력되면 논문에 대한 정제된 메타데이터 영역을 출력하는 논문 메타데이터 영역 분류 모델을 학습하기 위해 상기 논문과 함께 상기 논문 메타데이터 영역 분류 모델에 입력되는 학습 데이터셋이다.The learning data secured through the inspection method according to the present invention can be used as a dataset for learning a thesis metadata area classification model. That is, the data on correction of metadata area text is input to the thesis metadata area classification model together with the thesis in order to learn the thesis metadata area classification model that outputs the refined metadata area for the thesis when the thesis is input. is the training dataset.

본 발명에 따른 검수 방법은 논문 메타데이터 영역 분류 모델을 통해 추출된 메타데이터를 보완하여 논문 메타데이터 영역 분류 모델이 더욱 정확하게 메타데이터를 추출할 수 있도록 학습할 수 있는 데이터셋을 증강시킬 수 있다.The inspection method according to the present invention supplements the metadata extracted through the thesis metadata area classification model to augment the dataset that can be learned so that the thesis metadata area classification model can extract metadata more accurately.

지금까지 본 실시예에 따른 논문 메타데이터 영역 분류 모델을 학습하기 위한 데이터셋의 검수 방법을 개략적으로 설명하였다. 본 실시예는 컴퓨팅 장치에 의하여 수행될 수 있다. 예를 들어, 상기 컴퓨팅 장치는 도 1을 참조하여 설명한 논문의 메타데이터 영역 분류 장치일 수 있다. 본 실시예를 설명함에 있어서, 몇몇 동작의 수행 주체에 대한 기재가 생략될 수 있다. 이 때, 상기 수행 주체는 상기 컴퓨팅 장치이다.So far, the method for reviewing the dataset for learning the thesis metadata area classification model according to this embodiment has been schematically described. This embodiment may be performed by a computing device. For example, the computing device may be a device for classifying a metadata field of a paper described with reference to FIG. 1 . In describing the present embodiment, description of a performer of some operations may be omitted. At this time, the performing subject is the computing device.

이하, 도 4 내지 도 9를 참조하여 메타데이터의 수정을 위한 사용자 입력에 따라 메타데이터 영역 텍스트가 수정되는 예시를 설명하도록 한다.Hereinafter, with reference to FIGS. 4 to 9 , an example of modifying metadata region text according to a user input for modifying metadata will be described.

도 4에서 단계 S300이 수행될 때, 단계 S310 또는 S320 이 선택적으로 수행될 수 있다. 단계 S310에서는 PDF 파일 상의 텍스트 엘리먼트에 대한 분류가 오류인 것으로 판단되는 경우 논문 메타데이터 영역 분류 모델을 이용하여 오류에 해당되는 텍스트 엘리먼트의 분류가 수정될 수 있다. 단계 S320에서는 텍스트 엘리먼트 중 미리 설정된 구문 해석 방법과 상이한 방법으로 배치된 텍스트 엘리먼트의 분류가 수정될 수 있다.When step S300 is performed in FIG. 4 , step S310 or step S320 may be selectively performed. In step S310, when it is determined that the classification of text elements in the PDF file is an error, the classification of the text element corresponding to the error may be corrected using the thesis metadata region classification model. In step S320, the classification of text elements disposed in a method different from a preset syntax analysis method among text elements may be modified.

도 5는 논문 PDF 일부의 예이다. 각각의 텍스트 엘리먼트는 점선으로 표현되어 있는 박스로 표현되고, 찾고자 하는 메타데이터 영역은 회색 바탕으로 표현되어 있다. 최상단 '한글 초록'이라는 표시자는 하나의 텍스트 엘리먼트로 구성된 메타데이터 영역이다. 아래 한글 초록 메타데이터 영역은 6개의 텍스트 엘리먼트로 구성된다. 그림에서 사람이 보기에 한 행이 있더라도 여러 텍스트 엘리먼트로 구성될 수 있고, 한 문장도 텍스트 엘리먼트가 나뉘어 있을 수 있다. 여러 개의 텍스트 엘리먼트로 구성된 메타데이터 영역의 계산은 최외곽 엘리먼트들의 좌표를 기준으로 하기 때문에, 엘리먼트의 순서를 제대로 추출한 뒤 그 아래에 위치한 한글 키워드 영역에 해당 키워드들을 입력할 수 있다.Figure 5 is an example of a portion of the thesis PDF. Each text element is represented by a box represented by a dotted line, and the metadata area to be found is represented by a gray background. The indicator 'Hangeul abstract' at the top is a metadata area composed of one text element. The Korean abstract metadata area below consists of 6 text elements. Even if there is a single line in a picture, it can be composed of several text elements, and even a single sentence can be divided into text elements. Since the calculation of the metadata area composed of several text elements is based on the coordinates of the outermost elements, after properly extracting the order of the elements, the corresponding keywords can be entered in the Korean keyword area located below it.

도 6 에서 추출된 메타데이터 영역 텍스트는 데이터 검수 화면(표)에 분류된 메타데이터 항목에 맞게 출력된다. 그림에서 한글 초록 영역 텍스트들은 사람이 읽기에 순서에 맞도록 위에서 아래로 정렬되어 '한글 초록'란에 출력되어 있다. 또한 한글 키워드 영역 텍스트는 '한글 키워드'란에 출력되어 있다. 텍스트 엘리먼트의 순서를 제대로 찾아내는 것은 ○a 최외곽 엘리먼트들의 좌표를 기준으로 하여 메타데이터 영역을 구분하고, ○b 메타데이터 추출의 정확성을 높이기 위해 필요한 동작으로서 중요도가 높다. 결과적으로, 도 6과 같이 메타데이터 영역의 텍스트는 정상적인 순서로 추출될 수 있다.The metadata area text extracted in FIG. 6 is output according to the metadata items classified on the data verification screen (table). In the figure, the texts in the Korean abstract area are arranged from top to bottom in order for people to read them, and printed in the 'Korean abstract' column. In addition, the Korean keyword field text is output in the 'Hangul keyword' column. Properly finding the order of text elements is of high importance as a required operation to classify the metadata area based on the coordinates of ○a outermost elements and to increase the accuracy of ○b metadata extraction. As a result, as shown in FIG. 6, text in the metadata area can be extracted in a normal order.

도 7에서는 단계 S310에서의 메타데이터 영역 분류가 잘못된 텍스트 엘리먼트의 분류 교정의 예를 보여준다. 그림과 같이'한글 키워드'란이 비어 있고, '한글 초록'에 한글 키워드 영역의 텍스트 엘리먼트가 출력되어 있다. 이러한 경우 PDF 파일 상의 텍스트 엘리먼트에 대한 분류가 오류인 것으로 판단되기 때문에 논문 메타데이터 영역 분류 모델을 이용하여 오류에 해당되는 텍스트 엘리먼트의 분류가 수정될 수 있다FIG. 7 shows an example of correcting the classification of a text element in which the metadata area classification in step S310 is incorrect. As shown in the figure, the 'Hangul Keyword' field is empty, and the text elements of the Korean keyword area are displayed in the 'Hangul Abstract'. In this case, since the classification of the text element in the PDF file is determined to be an error, the classification of the text element corresponding to the error can be corrected using the thesis metadata area classification model.

도 8에서는 단계 S320에서의 텍스트 엘리먼트 순서 교정의 예를 보여준다. PDF 파일에 따라 텍스트 엘리먼트 중 미리 설정된 구문 해석 방법과 상이한 방법으로 배치된 텍스트 엘리먼트의 분류가 수정될 수 있다. 구체적으로, 미리 설정된 구문 해석 방법에 부합되도록 상기 텍스트 엘리먼트의 분류가 수정될 수 있다. 이를 통해 메타데이터 영역을 계산하기 위한 최외곽 엘리먼트 판단 및 메타데이터 텍스트 추출의 정확도를 높일 수 있다.8 shows an example of correcting the order of text elements in step S320. According to the PDF file, the classification of text elements disposed in a method different from a preset syntax analysis method among text elements may be modified. Specifically, the classification of the text element may be modified to conform to a preset syntax analysis method. Through this, it is possible to increase the accuracy of determining the outermost element for calculating the metadata area and extracting the metadata text.

이렇게 교정된 데이터를 이용하여 메타데이터 영역 좌표를 계산하고 정련된 메타데이터 텍스트 정보를 얻는다. 이 데이터는 메타데이터 영역 분류 모델의 학습 데이터로 활용될 수 있다. 또한 이 데이터는 인공지능 또는 하드 코딩을 통한 메타데이터 항목 추출을 통해 저자 상세 정보, 키워드 등의 각각 분리가 이루어져 데이터베이스 구축에도 이용될 수 있다.Using the corrected data, coordinates of the metadata area are calculated and refined metadata text information is obtained. This data can be used as training data for a metadata area classification model. In addition, this data can be used for database construction by separating author detailed information and keywords through artificial intelligence or metadata item extraction through hard coding.

도 9를 참조하면, 논문(10)이 논문 메타데이터 영역 분류 모델에 입력되면 논문의 각 분류에 따른 메타데이터 영역 텍스트(20)가 출력될 수 있다. 구체적으로, 논문은 제목, 영문 제목, 저자 정보, 영문 초록, 한글 초록, 본문 등의 내용을 포함할 수 있는데, 논문 메타데이터 영역 분류 모델(200)은 논문(10) 내에 포함된 상기 내용들을 각 목차에 해당되도록 분류할 수 있다. 논문 메타데이터 영역 분류 모델(200)은 가장 상단에 위치한 텍스트 엘리먼트를 논문 제목으로 분류하고, 그 하단에 위치한 텍스트 엘리먼트를 영문 제목으로 분류하며, 그 하단에 위치한 텍스트 엘리먼트를 저자 정보로 분류할 수 있으며, 나머지 엘리먼트들 또한 각각 해당되는 목차에 따라 영문 초록, 한글 초록, 본문 또는 주석으로 분류할 수 있다.Referring to FIG. 9 , when a paper 10 is input to a paper metadata area classification model, metadata area text 20 according to each classification of the paper may be output. Specifically, the thesis may include contents such as title, English title, author information, English abstract, Korean abstract, body, etc., and the paper metadata area classification model 200 classifies the contents included in the paper 10 It can be classified according to the table of contents. The thesis metadata area classification model 200 can classify the text element located at the top as the thesis title, classify the text element located at the bottom as the English title, and classify the text element located at the bottom as author information, , the rest of the elements can also be classified into English abstract, Korean abstract, text or notes according to the corresponding table of contents.

이렇게 메타데이터 영역이 자동으로 분류된 결과를 이용하여 메타데이터 영역 텍스트가 추출될 수 있다. 메타데이터(20)는 논문 제목(2-1), 영문 제목(2-2), 저자 정보(2-3), 주석(2-4), 영문 초록(2-5), 한글 초록(2-6), 영문 키워드(2-7), 한글 키워드(2-8)의 목차에 해당되는 내용으로 구성되는 테이블일 수 있다.The metadata area text may be extracted using the result of automatically classifying the metadata area in this way. Metadata (20) includes thesis title (2-1), English title (2-2), author information (2-3), annotations (2-4), English abstract (2-5), Korean abstract (2- 6), English keywords (2-7), and Korean keywords (2-8).

논문의 포맷은 저널별로 다르고, 같은 저널이라고 하더라도 시기에 따라 달라질 수 있으며, 상대적 위치의 순서가 규칙을 가지고 있지만 논문별로 메타데이터 항목의 정확한 위치는 달라지며, 논문에 따라 특정 메타데이터 항목의 존재 여부가 달라질 수 있다. 이러한 조건에서 실행되는 메타데이터 영역 분류기의 출력에 부정확한 분류나 텍스트 배열 순서가 포함될 수 있다. 잘못 분류된 메타데이터 영역 텍스트나 틀린 순서로 정렬된 텍스트가 포함된 경우, 정확한 개체 간의 공간적 거리, 위치 관계, 후처리에 의한 단어 재구성(도 5 초록 중간의 '메타데이터'가 '메타데' + '이터'로 나뉘어 있음)에 오류를 발생시킬 수 있으므로, 이러한 오류를 정확히 보정해 주어야 메타데이터 영역 분류기의 학습 성능 및 데이터베이스 구축 정확도를 높일 수 있다.The format of the thesis differs by journal, and even within the same journal, it may change according to the time period. The order of relative position has a rule, but the exact position of the metadata item varies for each paper, and the existence of a specific metadata item depends on the paper. may vary. The output of metadata field classifiers executed under these conditions may contain inaccurate classifications or text alignment sequences. In the case of incorrectly classified metadata area texts or texts arranged in the wrong order, the spatial distance between the correct objects, positional relationships, and word reconstruction by post-processing ('metadata' in the middle of the abstract of Fig. 5 are 'metade' + divided into 'data') can cause errors, so these errors must be accurately corrected to improve the learning performance of the metadata domain classifier and the accuracy of database construction.

종래에는 이러한 프로세스를 메타데이터 영역 분류기에 의해 출력되는 텍스트(또는 XML) 파일을 직접 수정하여 작업해야 하기 때문에, 가독성 저하 및 오타 발생으로 인한 정확도 저하의 여지가 있었다. 본 발명에 의한 메타데이터 영역 분류 데이터 방법 및 장치를 이용하면 사람의 실수를 최소화할 수 있는 조건과 환경에서 메타데이터 영역 분류 결과를 수정할 수 있게 되어 데이터 검수 결과의 정확성을 높일 수 있다.Conventionally, since such a process has to be performed by directly modifying a text (or XML) file output by a metadata field classifier, there is room for deterioration in readability and accuracy due to occurrence of typos. Using the metadata area classification data method and apparatus according to the present invention, it is possible to modify the metadata area classification result under conditions and environments that can minimize human error, thereby increasing the accuracy of data inspection results.

본 발명의 일 실시예에 따른 논문 메타데이터 영역 분류 모델(200)을 학습하기 위한 데이터셋의 검수 방법은 python 또는 javascript로 구현된 웹 기반으로 제공됨에 따라 어떠한 플랫폼과 OS에도 종속되지 않기 때문에 다양한 사용자가 다양한 장치에서 논문을 검수하도록 구현될 수 있다. 또한, 본 실시예가 서버에서 제공되는 경우 다수의 사용자가 서버에 접속하여 본 실시예를 활용하도록 구현될 수도 있다.The dataset inspection method for learning the thesis metadata area classification model 200 according to an embodiment of the present invention is provided based on a web implemented with python or javascript and is not dependent on any platform or OS, so various users can be implemented to review papers on various devices. In addition, when the present embodiment is provided in the server, it may be implemented so that a plurality of users access the server and utilize the present embodiment.

지금까지 도 10 및 도 11을 참조하여 본 발명의 논문 메타데이터 영역 분류 모델에 대하여 상세히 살펴보았다.So far, with reference to FIGS. 10 and 11, the thesis metadata domain classification model of the present invention has been examined in detail.

도 12는 PDF 논문과 메타데이터의 수정에 의해 가공된 학습 데이터셋을 논문 메타데이터 영역 분류 모델에 학습시키는 예시를 설명하기 위한 도면이다.12 is a diagram for explaining an example of learning a paper metadata area classification model with a training dataset processed by modifying PDF papers and metadata.

상술한 논문 PDF 파일(10)과 본 실시예의 검수 방법에 의해 가공된 학습데이터(30)는 논문 메타데이터 영역 분류 모델의 학습에 활용될 수 있다. 예를 들어, 논문 메타데이터 영역 분류 모델은 논문 PDF 파일(10)과 학습데이터(30)가 입력되면 학습데이터(30)를 통해 기존의 모델에서 추출된 메타데이터와의 오차를 산출하고 그 오차를 역전파하여 가중치를 갱신하는 방식으로 학습될 수 있다. 논문 메타데이터 영역 분류 모델이 학습되는 방법은 이에 한정되지 않는다.The above-described thesis PDF file 10 and the learning data 30 processed by the inspection method of this embodiment can be used for learning the thesis metadata area classification model. For example, the thesis metadata area classification model calculates an error with the metadata extracted from the existing model through the learning data 30 when the thesis PDF file 10 and the learning data 30 are input, and calculates the error. It can be learned by backpropagating and updating weights. The method by which the thesis metadata area classification model is learned is not limited thereto.

이하에서는, 도 11을 참조하여 본 발명의 다양한 실시예에서 설명된 장치를 구현할 수 있는 예시적인 컴퓨팅 장치(500)에 대하여 설명하도록 한다.Hereinafter, referring to FIG. 11 , an exemplary computing device 500 capable of implementing the devices described in various embodiments of the present invention will be described.

도 11은 컴퓨팅 장치(500)를 나타내는 예시적인 하드웨어 구성도이다.11 is an exemplary hardware configuration diagram illustrating a computing device 500 .

도 11에 도시된 바와 같이, 컴퓨팅 장치(500)는 하나 이상의 프로세서(510), 버스(550), 통신 인터페이스(570), 프로세서(510)에 의하여 수행되는 컴퓨터 프로그램(591)을 로드(load)하는 메모리(530)와, 컴퓨터 프로그램(591)를 저장하는 스토리지(590)를 포함할 수 있다. 다만, 도 11에는 본 발명의 실시예와 관련 있는 구성요소들 만이 도시되어 있다. 따라서, 본 발명이 속한 기술분야의 통상의 기술자라면 도 11에 도시된 구성요소들 외에 다른 범용적인 구성 요소들이 더 포함될 수 있음을 알 수 있다.As shown in FIG. 11, the computing device 500 loads one or more processors 510, a bus 550, a communication interface 570, and a computer program 591 executed by the processor 510. It may include a memory 530 and a storage 590 for storing the computer program 591. However, only components related to the embodiment of the present invention are shown in FIG. 11 . Accordingly, those skilled in the art to which the present invention pertains can know that other general-purpose components may be further included in addition to the components shown in FIG. 11 .

프로세서(510)는 컴퓨팅 장치(500)의 각 구성의 전반적인 동작을 제어한다. 프로세서(510)는 CPU(Central Processing Unit), MPU(Micro Processor Unit), MCU(Micro Controller Unit), GPU(Graphic Processing Unit) 또는 본 발명의 기술 분야에 잘 알려진 임의의 형태의 프로세서 중 적어도 하나를 포함하여 구성될 수 있다. 또한, 프로세서(510)는 본 발명의 다양한 실시예들에 따른 방법/동작을 실행하기 위한 적어도 하나의 애플리케이션 또는 프로그램에 대한 연산을 수행할 수 있다. 컴퓨팅 장치(500)는 하나 이상의 프로세서를 구비할 수 있다.The processor 510 controls the overall operation of each component of the computing device 500 . The processor 510 may include at least one of a Central Processing Unit (CPU), a Micro Processor Unit (MPU), a Micro Controller Unit (MCU), a Graphic Processing Unit (GPU), or any type of processor well known in the art. can be configured to include Also, the processor 510 may perform an operation for at least one application or program for executing a method/operation according to various embodiments of the present disclosure. Computing device 500 may include one or more processors.

메모리(530)는 각종 데이터, 명령 및/또는 정보를 저장한다. 메모리(530)는 본 발명의 다양한 실시예들에 따른 방법/동작들을 실행하기 위하여 스토리지(590)로부터 하나 이상의 프로그램(591)을 로드(load) 할 수 있다. 예를 들어, 컴퓨터 프로그램(591)이 메모리(530)에 로드 되면, 도 4에 도시된 바와 같은 로직(또는 모듈)이 메모리(530) 상에 구현될 수 있다. 메모리(530)의 예시는 RAM이 될 수 있으나, 이에 한정되는 것은 아니다.Memory 530 stores various data, commands and/or information. Memory 530 may load one or more programs 591 from storage 590 to execute methods/operations according to various embodiments of the present invention. For example, when the computer program 591 is loaded into the memory 530, the logic (or module) shown in FIG. 4 may be implemented on the memory 530. An example of the memory 530 may be RAM, but is not limited thereto.

버스(550)는 컴퓨팅 장치(500)의 구성 요소 간 통신 기능을 제공한다. 버스(550)는 주소 버스(Address Bus), 데이터 버스(Data Bus) 및 제어 버스(Control Bus) 등 다양한 형태의 버스로 구현될 수 있다.The bus 550 provides a communication function between components of the computing device 500 . The bus 550 may be implemented as various types of buses such as an address bus, a data bus, and a control bus.

통신 인터페이스(570)는 컴퓨팅 장치(500)의 유무선 인터넷 통신을 지원한다. 통신 인터페이스(570)는 인터넷 통신 외의 다양한 통신 방식을 지원할 수도 있다. 이를 위해, 통신 인터페이스(570)는 본 발명의 기술 분야에 잘 알려진 통신 모듈을 포함하여 구성될 수 있다.The communication interface 570 supports wired and wireless Internet communication of the computing device 500 . The communication interface 570 may support various communication methods other than Internet communication. To this end, the communication interface 570 may include a communication module well known in the art.

스토리지(590)는 하나 이상의 컴퓨터 프로그램(591)을 비임시적으로 저장할 수 있다. 스토리지(590)는 ROM(Read Only Memory), EPROM(Erasable Programmable ROM), EEPROM(Electrically Erasable Programmable ROM), 플래시 메모리 등과 같은 비휘발성 메모리, 하드 디스크, 착탈형 디스크, 또는 본 발명이 속하는 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체를 포함하여 구성될 수 있다.Storage 590 may non-temporarily store one or more computer programs 591 . The storage 590 may be a non-volatile memory such as read only memory (ROM), erasable programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), flash memory, or the like, a hard disk, a removable disk, or a device well known in the art. It may be configured to include any known type of computer-readable recording medium.

컴퓨터 프로그램(591)은 본 발명의 다양한 실시예들에 따른 방법/동작들이 구현된 하나 이상의 인스트럭션들을 포함할 수 있다. 컴퓨터 프로그램(591)이 메모리(530)에 로드 되면, 프로세서(510)는 상기 하나 이상의 인스트럭션들을 실행시킴으로써 본 발명의 다양한 실시예들에 따른 방법/동작들을 수행할 수 있다.Computer program 591 may include one or more instructions in which methods/operations according to various embodiments of the invention are implemented. When computer program 591 is loaded into memory 530, processor 510 may execute the one or more instructions to perform methods/acts according to various embodiments of the present invention.

일 실시예에서, 프로세서, 네트워크 인터페이스, 상기 프로세서에 의해 실행되어 컴퓨터 프로그램을 로드(load)하는 메모리, 및 상기 컴퓨터 프로그램을 저장하는 스토리지를 포함하되, 상기 컴퓨터 프로그램은, 학술정보가 포함된 논문을 표현하기 위한 논문 데이터와 기계 학습 된 논문 메타데이터 영역 분류 모델에 의하여 추출된 상기 논문의 메타데이터를 입력 받는 인스트럭션, 상기 메타데이터와 상기 논문을 함께 표시하는 인스트럭션, 상기 메타데이터 영역 텍스트의 수정을 위한 사용자 입력에 따라, 상기 메타데이터를 수정하는 인스트럭션, 및 상기 메타데이터 영역 텍스트의 수정에 대한 데이터를 상기 메타데이터 분류 모델의 학습을 위한 학습 데이터로 가공하는 인스트럭션을 포함할 수 있다.In one embodiment, a processor, a network interface, a memory that is executed by the processor to load a computer program, and a storage for storing the computer program, wherein the computer program includes academic information. An instruction to receive thesis data for expression and the metadata of the thesis extracted by the machine-learned thesis metadata area classification model, an instruction to display the metadata and the thesis together, and a modification of the metadata area text It may include an instruction for modifying the metadata according to a user input, and an instruction for processing data about modification of the metadata area text into learning data for learning the metadata classification model.

일 실시예에서, 상기 메타데이터와 상기 논문을 함께 표시하는 인스트럭션은, 상기 논문의 페이지와 상기 논문의 페이지에 대응되는 메타데이터를 배치하는 인스트럭션을 포함할 수 있다.In one embodiment, the instruction for displaying the metadata and the thesis together may include an instruction for arranging a page of the thesis and metadata corresponding to the page of the thesis.

일 실시예에서, 상기 메타데이터의 수정을 위한 사용자 입력에 따라, 상기 메타데이터를 수정하는 인스트럭션은, 웹 기반으로 이루어진 논문 검수 플랫폼에서 상기 메타데이터의 수정에 대한 입력을 받는 인스트럭션을 포함할 수 있다.In one embodiment, the instruction for modifying the metadata according to a user input for modifying the metadata may include an instruction for receiving an input for modifying the metadata in a web-based thesis review platform. .

지금까지 설명된 본 발명의 실시예에 따른 방법들은 컴퓨터가 읽을 수 있는 코드로 구현된 컴퓨터프로그램의 실행에 의하여 수행될 수 있다. 상기 컴퓨터프로그램은 인터넷 등의 네트워크를 통하여 제1 컴퓨팅 장치로부터 제2 컴퓨팅 장치에 전송되어 상기 제2 컴퓨팅 장치에 설치될 수 있고, 이로써 상기 제2 컴퓨팅 장치에서 사용될 수 있다. 상기 제1 컴퓨팅 장치 및 상기 제2 컴퓨팅 장치는, 서버 장치, 클라우드 서비스를 위한 서버 풀에 속한 물리 서버, 데스크탑 피씨와 같은 고정식 컴퓨팅 장치를 모두 포함한다.The methods according to the embodiments of the present invention described so far can be performed by executing a computer program implemented as a computer readable code. The computer program may be transmitted from the first computing device to the second computing device through a network such as the Internet, installed in the second computing device, and thus used in the second computing device. The first computing device and the second computing device include both a server device, a physical server belonging to a server pool for a cloud service, and a fixed computing device such as a desktop PC.

상기 컴퓨터프로그램은 DVD-ROM, 플래시 메모리 장치 등의 기록매체에 저장된 것일 수도 있다.The computer program may be stored in a recording medium such as a DVD-ROM or a flash memory device.

이상 첨부된 도면을 참조하여 본 발명의 실시예들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로 이해해야만 한다.Although the embodiments of the present invention have been described with reference to the accompanying drawings, those skilled in the art to which the present invention pertains can be implemented in other specific forms without changing the technical spirit or essential features of the present invention. can understand that Therefore, the embodiments described above should be understood as illustrative in all respects and not limiting.

Claims

In a method performed by a computing device,
Receiving thesis data for representing thesis including academic information and the metadata of the thesis extracted by the machine-learned thesis metadata domain classification model;
displaying a classification result of the metadata by the thesis metadata area classification model and the thesis together on one screen;
correcting misclassification of the metadata according to a user input for modifying the metadata; and
Processing the metadata correction data into learning data for re-learning of the machine-learned thesis metadata area classification model,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
The thesis metadata domain classification model,
Identify the metadata area text included in the thesis using the coordinates according to the arrangement of the text elements of the thesis and the properties of the text elements, and determine the area where the metadata area text included in the thesis is arranged as the metadata area and extract text from the metadata area,
The thesis metadata area classification model is machine-learned to classify the metadata area of the metadata area text based on the coordinates according to the arrangement of the text elements of the thesis and the properties of the text elements,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 2,
The thesis metadata domain classification model,
A model for classifying the metadata area based on coordinates of outermost elements of a plurality of text elements,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 2,
The thesis metadata domain classification model,
A model for classifying the table of contents to which the metadata area text belongs in the thesis using the arrangement of the text elements and the properties of the text,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
The step of displaying the classification result by the thesis metadata area classification model and the thesis together on one screen,
Arranging the page of the paper and the classification result corresponding to the page of the paper,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
In accordance with the user input for modifying the metadata, the step of correcting the misclassification of the metadata,
When the classification of the text element of the paper is determined to be an error, correcting the classification of the text element corresponding to the error using the paper metadata region classification model,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
In accordance with the user input for modifying the metadata, the step of correcting the misclassification of the metadata,
Modifying the classification of text elements arranged in a method different from a preset syntax analysis method among the text elements of the thesis to conform to the preset syntax analysis method,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
In accordance with the user input for modifying the metadata, the step of correcting the misclassification of the metadata,
Receiving an input for correction of misclassification of the metadata in a web-based thesis review platform,
A method for reviewing datasets for learning thesis metadata area classification model.

According to claim 1,
The step of processing the metadata correction data into learning data for re-learning of the machine-learned thesis metadata area classification model,
Converting and storing data on the modification of the metadata into a format that can be input to the thesis metadata area classification model,
A method for reviewing datasets for learning thesis metadata area classification model.

processor;
network interface;
a memory that is executed by the processor and loads a computer program; and
Including a storage for storing the computer program,
The computer program,
An instruction for receiving thesis data for representing a thesis including academic information and the metadata of the thesis extracted by the machine-learned thesis metadata domain classification model;
an instruction for displaying a classification result of the metadata by the thesis metadata domain classification model and the thesis together on one screen;
an instruction for correcting misclassification of the metadata according to a user input for modifying the metadata; and
Including instructions for processing the metadata correction data into learning data for re-learning of the machine-learned thesis metadata area classification model,
A dataset review device for learning a thesis metadata area classification model.

According to claim 10,
The thesis metadata domain classification model,
Identify the metadata area text included in the thesis using the coordinates according to the arrangement of the text elements of the thesis and the properties of the text elements, and determine the area where the metadata area text included in the thesis is arranged as the metadata area and extract text from the metadata area,
The thesis metadata area classification model is machine-learned to classify the metadata area of the metadata area text based on the coordinates according to the arrangement of the text elements of the thesis and the properties of the text elements,
A dataset review device for learning a thesis metadata area classification model.

According to claim 10,
The thesis metadata domain classification model,
A model that classifies the metadata area based on the coordinates of the outermost elements of a plurality of text elements,
A dataset review device for learning a thesis metadata area classification model.

A computer program that operates in combination with a computing device,
Receiving thesis data for representing thesis including academic information and the metadata of the thesis extracted by the machine-learned thesis metadata domain classification model;
displaying a classification result of the metadata by the thesis metadata area classification model and the thesis together on one screen;
correcting misclassification of the metadata according to a user input for modifying the metadata; and
Stored in a computer-readable recording medium to execute the step of processing the data for modification of the metadata into learning data for re-learning of the machine-learned thesis metadata area classification model,
computer program.