KR20160064306A

KR20160064306A - Automatic construction system of references

Info

Publication number: KR20160064306A
Application number: KR1020140167341A
Authority: KR
Inventors: 윤정원; 박규태
Original assignee: 손죠 주식회사
Priority date: 2014-11-27
Filing date: 2014-11-27
Publication date: 2016-06-08
Also published as: KR101640428B1

Abstract

The present invention relates to an automatic construction system of references configured to automatically structuralize reference information as reference information based on the international standard of academic information DB construction after extracting the reference information included in thesis information and automatically construct journal references by using the structurized reference information. The automatic construction system of references according to the present invention relates to an automatic construction system of references configured to structuralize the reference information as reference information based on the international standard by extracting the reference information included in a thesis PDF file, comprising: a reference format & separator designating unit which designates and sets up the format and separator of each reference according to the type of references prepared in a different format per journal; a reference original form automatic extractor which recognizes the reference regions included in a thesis PDF file by receiving the thesis PDF file and extracts the reference information in the original form; and a reference automatic structuralization unit which classifies the reference information extracted through the reference original form automatic extractor according to the format and separator designed through the reference format & separator designating unit, and sets up a reference database composed of items prescribed in the NISO JATS DTD, which is the international standard of academic information DB construction.

Description

AUTOMATIC CONSTRUCTION SYSTEM OF REFERENCES

본 발명은 참고문헌 자동 구축 시스템에 관한 것으로, 특히 논문 정보에 포함된 참고문헌 정보를 추출한 후 이를 학술정보 DB 구축의 국제적 표준에 기반한 참고문헌 정보로 자동 구조화하며, 구조화된 참고문헌 정보를 이용하여 저널 전거를 자동으로 구축할 수 있는 참고문헌 자동 구축 시스템에 관한 것이다.
In particular, the present invention relates to an automatic construction system for reference documents, in particular, extracts reference information included in the article information, automatically structures the reference information based on an international standard for building an academic information DB, and uses structured reference information The present invention relates to a reference document automatic construction system capable of automatically constructing a journal authority.

논문은 어떠한 주제에 대해 저자가 자신의 학문적 연구결과나 의견, 주장을 논리에 맞게 풀어 써서 일관성 있고 일정한 형식에 맞추어 체계적으로 쓴 글로써, 석사, 박사 등 학위를 취득하기 위한 학위논문과 각종 학술지 또는 학술대회에 발표하는 학술논문, 그리고 출판을 위한 논문 등이 있다.The thesis is a systematic writing of the authors' academic research results, opinions, and arguments on a certain topic in a consistent and consistent format by releasing them in accordance with the logic. The dissertation and various academic journals Academic papers presented at academic conferences, and papers for publication.

일반적으로 국내 논문의 경우 해외와는 달리 투고되는 논문 파일의 포맷이 hwp, doc 등의 포맷을 취하고 있으며, 부가적으로 제출되는 파일에는 xls, ppt 등의 포맷이 제공되고 있다. 이러한 다양한 포맷으로 제공되는 논문 정보를 취합 및 출판을 위한 편집 가공 후, 최종 생산되는 파일 포맷은 PDF로 귀결되어 진다. 이렇게 다양한 파일 포맷으로 작성되는 논문 파일은 PDF 변환 프로그램을 이용하여 PDF 파일로 변환되게 되는데, PDF 파일로 변환된 논문 파일은 텍스트 정보 추출을 통하여 논문 데이터베이스로 구축되어 이용하게 된다. 일반적으로 텍스트 기반의 PDF 논문 파일에서 텍스트 정보를 정확히 추출하기 위해 PDF 파일을 ADOBE ACROBAT 프로그램을 통해 XML 파일 형태로 저장하게 되는데, 이렇게 PDF 파일의 XML 파일 저장시 PDF를 제작한 생성 프로그램 및 생성 프로그램의 버전에 따라 매우 다양한 형태의 XML 구조를 제공하게 된다.Generally, in the case of domestic papers, unlike foreign countries, the papers to be submitted have formats such as hwp and doc, and additional files such as xls and ppt are provided. After compiling and publishing the dissertation information provided in these various formats, the final produced file format results in PDF. Thesis files created in various file formats are converted to PDF files using a PDF conversion program. Paper files converted into PDF files are constructed and used as thesis databases through text information extraction. Generally, in order to accurately extract text information from a text-based PDF papers file, the PDF file is stored as an XML file through the ADOBE ACROBAT program. Thus, when a PDF file is saved as an XML file, Depending on the version, it provides a very wide variety of XML structures.

이와 같이 PDF 생성 프로그램 및 생성 프로그램의 버전에 따라 다양한 형태의 XML 구조를 제공하기 때문에, 이러한 다양한 구조의 XML 파일을 통하여 논문 정보를 자동으로 구조화시키기 어려운 문제점이 있었다. 즉, 다양한 구조로 이루어진 XML 파일에서 텍스트 정보를 추출하고 이를 통하여 데이터베이스로 구조화하기 위해서는 XML 구조에 따라 서로 다른 구조화 프로그램이 필요한데, 이러한 다양한 구조화 프로그램을 제작하려면 개발 비용이 증가하게 되고, 제작된 구조화 프로그램의 버전 관리가 일일이 이루어져야 하며, 신규 PDF 변환 프로그램의 적용 및 신규 버전의 적용시 추가 프로그램의 제작이 필요하기 때문에, 전체적인 제작 및 유지보수 기간과 비용이 많이 소요되는 문제점이 있었다. Since various types of XML structures are provided according to the version of the PDF generation program and the generated program, there is a problem that it is difficult to automatically structure the dissertation information through the XML files having various structures. That is, in order to extract text information from an XML file having various structures and to structure it into a database through a database structure, different structured programs are required according to the XML structure. To produce such various structured programs, development cost increases, And it is necessary to make additional programs when applying the new PDF conversion program, and thus it takes a long period of time to manufacture and maintain the entire system and the cost is increased.

한편, 이러한 논문 정보에는 참고문헌 정보가 포함되는데, 참고문헌 정보의 기재 형식은 논문을 발표하는 학교나 단체 등에 따라 달라지게 된다. 따라서, 이러한 논문 정보에 포함된 참고문헌 정보를 데이터베이스로 구조화하기 위해서 사람이 참고문헌 정보를 일일이 확인하면서, 그 기재 형식에 따라 내용을 파악하여 표준화된 형식에 맞추어 재배치하여야 하므로, 인력 및 시간이 많이 소요되는 문제점이 있었다.
On the other hand, the reference information is included in such information, and the format of reference information varies depending on the school or the organization that publishes the thesis. Therefore, in order to structure the reference information included in such article information into a database, a person has to check the reference information one by one, grasp the contents according to the description format, and rearrange the contents according to the standardized format, There was a problem.

대한민국 등록특허공보 제10-03197567호 (2001.12.21. 등록)Korean Registered Patent No. 10-03197567 (registered on December 21, 2001)

본 발명은 상기 종래 기술의 문제점을 해결하기 위해 제안된 것으로서, 본 발명의 목적은 PDF 파일로 제작된 논문 파일을 XML로 저장하지 않고 Open Source인 PDFBOX를 통해 PDF의 논문 순서대로 텍스트 정보를 추출하고, 추출된 텍스트 정보 중 참고문헌 영역을 인식하여 참고문헌 정보를 국제적 표준에 기반한 참고문헌 정보로 자동으로 구조화하여 참고문헌 데이터베이스를 구축할 수 있도록 하는 참고문헌 자동 구축 시스템을 제공하는 데 있다. SUMMARY OF THE INVENTION The present invention has been made to solve the above problems of the prior art, and it is an object of the present invention to extract text files in PDF order through PDFBOX, which is an Open Source, And to provide a reference document automatic construction system which recognizes a reference document area in extracted text information and automatically constructs a reference document information based on an international standard to construct a reference document database.

본 발명의 다른 목적은 상기 참고문헌 자동 구축 시스템을 통하여 자동 구축된 참고문헌 정보에 부속된 국제적으로 유통되고 있는 DOI 정보 중 저널 DOI를 자동 발췌하여, 저널 DOI를 통해 대표 저널명과 이형 저널명을 시스템이 자동 등록할 수 있도록 하는 참고문헌 자동 구축 시스템을 제공하는 데 있다.
Another object of the present invention is to automatically extract a journal DOI from internationally circulated DOI information annexed to the reference information automatically constructed through the above automatic reference system and to transmit the representative journal name and the release journal name through the journal DOI And to provide an automatic construction system for references to enable automatic registration.

본 발명에서는 논문 정보에 포함된 참고문헌 정보를 PDF to Text 추출 프로그램을 통해 참고문헌 텍스트 정보로 자동 추출한 후, 이를 학술정보 DB 구축의 국제적 표준 기반의 참고문헌 정보로 구조화하고, 구조화된 참고문헌 정보 중 저널명, 발행년도, 권, 논문의 시작 페이지 정보를 http Protocol 방식을 통해, CrossRef의 API 기능을 이용하여, 리턴되는 정보 중 저널 DOI를 자동 캐취하여, 저널명에 고유 저널 코드를 자동 구성하고, 구성된 저널명칭 및 저널 코드를 그룹화 및 종속 처리를 통해, 저널의 대표명 설정과 이형 저널명 설정을 자동으로 처리하여, 저널 전거를 생성하게 된다.In the present invention, the bibliographic information included in the thesis information is automatically extracted as the bibliographic text information through the PDF to Text extraction program, the bibliographic information is structured as reference information based on the international standard for the construction of the academic information DB, and the structured bibliographic information The journal ID, the year of publication, the title, the title of the article, the journal DOI is automatically retrieved from the returned information by using the CrossRef API function through the http Protocol method, the unique journal code is automatically constructed in the journal name, The journal name and the journal code are grouped and the subordinate process is performed, the representative name setting of the journal and the setting of the release journal name are automatically processed to generate the journal authority.

이를 위하여, 본 발명에 따른 참고문헌 자동 구축 시스템은 논문 PDF 파일에 포함된 참고문헌 정보를 추출하여 국제적 표준에 기반 한 참고문헌 정보로 구조화하는 참고문헌 자동 구축 시스템으로서, 각 저널별로 상이한 형식으로 작성되는 참고문헌의 종류에 따라 각 참고문헌의 형식 및 구분자를 지정하여 설정하는 참고문헌 형식&구분자 지정부와; 논문 PDF 파일을 입력받아, 논문 PDF 파일에 포함된 참고문헌 영역을 인식하여 참고문헌 정보를 원형 형태로 추출하는 참고문헌 원형 자동 추출부와; 상기 참고문헌 원형 자동 추출부를 통하여 추출된 참고문헌 정보에 대해, 상기 참고문헌 형식&구분자 지정부를 통하여 지정된 형식 및 구분자에 따라 자동 구분하여, 학술정보 DB구축의 국제적 표준인 NISO JATS DTD에서 규정한 항목으로 구성하여 참고문헌 데이터베이스를 구축하는 참고문헌 자동 구조화부;를 포함하여 이루어진다. To this end, the system for automatically constructing a reference document according to the present invention is an automatic reference system for extracting reference information included in a PDF file and structuring it as reference information based on an international standard. The system is constructed in a different format for each journal A reference format for specifying and setting the format and delimiter of each reference document according to the type of reference document; A reference circular automatic extracting unit which receives a thesis PDF file, recognizes a reference area included in the thesis PDF file and extracts reference information in a circular form; The reference information extracted through the reference circular automatic extracting unit is automatically classified according to the format and the delimiter specified through the reference formatting & delimiter designation unit, and the item specified by the NISO JATS DTD, which is an international standard for building an academic information DB And a reference document automatic structuring unit for constructing a reference document database.

상기 참고문헌 형식&구분자 지정부는 다양한 형식으로 작성된 참고문헌의 자료 타입을 관리자에게 제공되는 참고문헌 패턴 등록화면의 패턴등록 기능을 통해 정의하고, 각 자료 타입에 따라 참고문헌의 형식에 따른 구성 항목의 순서를 정의하며, 각 구성 항목과 항목 사이에 존재하는 구분자("공백", """, "'", "「", "」", ",", "≪", "≫", "『", "』" 등)를 정의하여, 참고문헌 형식 및 구분자 데이터베이스에 등록 또는 갱신하게 된다. The reference document type & delimiter designation unit defines the data type of the reference document created in various formats through the pattern registration function of the reference pattern registration screen provided to the administrator, and sets the configuration item according to each data type ("Blank", "" "," "", "", "" "," "," "," "," "," "," "," " &Quot;, """ ", etc.) are defined and registered in the reference document format and the delimiter database.

여기에서, 상기 참고문헌의 자료 타입은 저널, 학술대회, 단행본, 보고서, 학위논문, 특허, 웹 중 어느 하나를 포함하며, 상기 참고문헌의 구성 항목은 참고문헌의 자료 타입에 따라 적어도 한 명 이상의 저자 성 및 이름, 논문명, 저널명, 권호, 시작 페이지, 끝 페이지, 학술대회 명, 학술대회 개최일, 학술대회 개최지명, 출판사, 출판사 소재지, 발행년도, 보고서 번호, 특허번호, 특허 항목 중 참고문헌의 자료타입 설정에 따라 상기 항목 중 다수 항목을 포함하는 것이 바람직하다. Here, the data type of the reference includes any one of a journal, an academic conference, a monograph, a report, a thesis, a patent, and the web, and the constituent items of the reference may include at least one The title of the author, the name of the author, the title of the paper, the title of the paper, the title of the paper, the start page, the last page, It is preferable to include a plurality of items among the items according to the data type setting.

또한, 상기 참고문헌 원형 자동 추출부는 논문 PDF 파일을 입력받아 PDFBOX의 PDFText Stripper Object를 통해 텍스트 정보를 추출하고, 상기 추출된 텍스트 정보 중 참고문헌을 나타내는 문자열을 탐색하여 참고문헌 영역을 파악하며, 파악된 참고문헌 영역의 원형 형태로 추출하게 된다. In addition, the reference circular automatic extractor extracts text information through a PDFText Stripper Object of the PDFBOX, receives a document PDF file, searches for a character string representing a reference document among the extracted text information, grasps the reference document area, And extracted as a circular form of the reference document area.

여기에서, 상기 참고문헌 원형 자동 추출부는 PDFBOX의 PDFText Stripper Object를 통해 추출된 텍스트 정보 중 References, Citation, 인용문헌, 참고문헌, 引用文獻, 參考文獻 중 어느 하나를 포함하는 문자열을 탐색하여 참고문헌 영역을 파악하고, 상기 파악되는 단일 참고문헌의 시작 부분과 끝 부분을 인식하여 인식된 단일 참고문헌을 원형 형태로 구성한 후, 전체 텍스트 정보 중에서 인식된 단일 참고문헌의 수를 카운트하여 카운트 된 수만큼 텍스트 박스를 생성하고, 생성된 텍스트 박스에 참고문헌의 원형 정보를 표시하여 구성하는 것이 바람직하다.Here, the reference circular automatic extractor searches a text string including any of the text information extracted through the PDFText Stripper Object of the PDFBOX, such as References, Citation, Cited Documents, References, Cited Documents, and Reference Documents, Recognizes the start and end portions of the single reference document, forms a single reference document in a circular form, counts the number of single reference documents recognized in the entire text information, Box is created, and circular information of the reference document is displayed in the generated text box.

또한, 상기 참고문헌 자동 구조화부는 참고문헌 형식&구분자 지정부에서 지정한 참고문헌 형식 및 구분자에 따라, 상기 참고문헌 원형 자동 추출부에서 추출된 참고문헌 정보에 포함된 저자 수를 카운트하여 카운트된 저자 수만큼 저수 성 및 이름을 입력하기 위한 텍스트 박스를 생성하고, 참고문헌의 자료 타입에 따라 논문명, 저널명, 발행년도, 권호, 시작 페이지, 끝 페이지, 학술대회 명, 학술대회 개최지역, 학술대회 개최일, 보고서 번호, 보고서 발행기관, 출판사, 출판사 소재지, 특허번호, 특허 출원국가, Url, DOI 중 참고문헌의 자료타입 설정에 따라 상기 항목 중 다수 항목을 포함하는 세부 항목을 입력하기 위한 텍스트 박스를 생성한 후, 생성된 텍스트 박스에 해당 항목을 입력하여 참고문헌 데이터베이스로 저장하게 된다.
In addition, the reference structure automatic structuring unit counts the number of authors included in the reference information extracted from the reference original circular automatic extracting unit according to the reference format and separator specified by the reference format & The name of the paper, the journal name, the year of publication, the title, the start page, the end page, the name of the conference, the location of the conference, the date of the conference, A text box for inputting a detailed item including a plurality of items in accordance with the data type setting of the reference number among the report number, report issuing authority, publisher, publisher location, patent number, patent application country, Url, and DOI After that, the corresponding item is inputted into the generated text box and stored in the reference document database.

한편, 본 발명에 따른 참고문헌 자동 구축 시스템에는 상기 참고문헌 자동 구조화부를 통하여 구축된 참고문헌 정보 중 DOI 정보를 질의 응답을 통해 확보하고, 확보되는 DOI 정보 중 저널에 대한 DOI 정보를 추출하여 파악하는 저널 DOI 추출부와; 상기 저널 DOI 추출부를 통하여 파악되는 저널 DOI 정보를 분석하여, 상기 저널에 대한 대표 저널명과 이형명 정보를 파악하여 저널 전거 데이터베이스를 구축하는 저널 전거 자동 구성부;가 더 구비된다. Meanwhile, in the reference document automatic construction system according to the present invention, DOI information among the reference information constructed through the reference document automatic structuring unit is secured through a query response, and DOI information about the journal is extracted and grasped A journal DOI extracting unit; And a journal authority automatic constructing unit for analyzing the journal DOI information that is grasped through the journal DOI extracting unit and grasping the representative journal name and release name information for the journal to construct a journal authority database.

상기 저널 DOI 추출부는 참고문헌 자동 구조화부를 통해 구축된 참고문헌 정보 중 자료 타입이 저널인 참고문헌에 대해 CrossRef에 API 방식을 통해 질의 처리하여 각 참고문헌별 DOI 정보를 획득하고, 상기 획득되는 DOI 정보 중 기관 고유 아이디인 Prefix와 저널 정보와 논문 고유 정보로 구성된 Suffix 정보에 포함된 저널에 대한 저널 관리 코드를 조합하여, 하나의 저널 DOI를 구성하게 된다.The journal DOI extracting unit obtains DOI information for each reference document by querying CrossRef through an API method for a reference document whose data type is a journal among the reference information constructed through the reference document automatic structuring unit, A journaling ID for a journals included in Suffix information composed of journal information and peculiar information are combined to constitute one journal DOI.

여기에서, 상기 저널 DOI 추출부는 참고문헌 자동 구조화부에서 자동 구축된 참고문헌 정보 중 자료 타입이 저널인 참고문헌에 대해 저널명, 발행년도, 권, 시작 페이지가 포함된 정보를 조합하여 CrossRef에 질의하고, 상기 CrossRef로부터 회신되는 XML 파일을 파싱(Parsing)하여 획득되는 DOI 정보 중 Prefix 정보와 Suffix 정보를 통하여 저널을 식별할 수 있는 저널 코드를 확보하여, Prefix + 저널코드를 저널 DOI로 정의하고, 정의된 저널 DOI를 질의한 저널명에 1:1 대응하는 코드로 구성하게 된다. Here, the journal DOI extracting unit may query the CrossRef by combining information including a journal name, a publication year, a volume, and a start page, with reference to the reference information whose data type is a journal, among the reference information automatically constructed in the reference document automatic structuring unit , A journal code capable of identifying a journal is obtained through Prefix information and Suffix information among DOI information obtained by parsing an XML file returned from the CrossRef, Prefix + journal code is defined as a journal DOI, The journal DOI is composed of a code corresponding to the journal name that has been queried.

또한, 상기 저널 전거 자동 구성부는 저널 DOI 추출부를 통하여 추출되는 저널 DOI 정보가 저널의 대표 저널명 및 이형 저널명을 관리하는 저널 전거 데이터베이스의 테이블 내에 존재하는지 조회한 후, 저널명의 형태는 다르나 동일한 DOI 구조로 이루어진 저널명이 존재하는 경우 저널명의 텍스트 길이에 따라 추출된 저널 DOI의 저널명을 대표 저널명 또는 이형 저널명으로 등록하여, 저널명에 대한 대표 저널명과 이형 저널명을 저널 DOI 기반으로 그룹화하고 관리하여 저널의 전거 정보를 구축하게 된다. The journal authority autoconfiguration unit may check whether the journal DOI information extracted through the journal DOI extracting unit exists in the table of the journal authority database managing the representative journal name and the release journal name of the journal, The journal name of the journal DOI extracted according to the text length of the journal name is registered as the representative journal name or the release journal name, and the representative journal name and the release journal name for the journal name are grouped and managed based on the journal DOI, .

여기에서, 상기 저널 전거 자동 구성부는 저널 전거 데이터베이스에 저널 DOI 추출부에서 획득한 저널 DOI 정보가 존재하는 경우, 저널 DOI에 대응하는 저널명의 문자열 길이를 조회하여, 문자열 크기가 질의한 저널명의 문자열 크기보다 작을 경우 저널 전거 데이터베이스의 대표 저널명 테이블에 저장되어 있는 대표 저널명을 이형 저널명으로 변경하고, 질의한 저널명을 저널 전거 데이터베이스의 대표 저널명 테이블에 새로운 대표 저널명으로 등록하며, 문자열 크기가 질의한 저널명의 문자열 크기보다 클 경우 질의한 저널명을 저널 전거 데이터베이스의 이형 저널 테이블에 저널 이형 저널명으로 등록하여, 저널의 대표명과 이형 저널명이 참고문헌의 DOI가 확보되는 수에 비례하여 증가하는 것이 바람직하다.
Here, if the journal DOI information acquired by the journal DOI extracting unit exists in the journal authority database, the journal authority autoconfiguration unit may inquire the string length of the journal name corresponding to the journal DOI to determine the string size of the queried journal name The representative journal name stored in the representative journal name table of the journal authority database is changed to the release journal name, the journal name to be queried is registered as a new representative journal name in the representative journal name table of the journal authority database, It is preferable that the journal name of the journal is registered in the journal journal table of the journal authority database as the journal journal name, and the name of the journal and the journal name increase in proportion to the number of the DOIs of the reference document.

본 발명에 따른 참고문헌 자동 구축 시스템에 의하면, 텍스트 형태의 논문 PDF 파일로부터 직접 텍스트를 정보를 추출하기 때문에 종래 DPF 파일을 XML 파일로 변환하여 텍스트를 추출하기 위해 사용되는 PDF 생성 프로그램 및 프로그램별 버전에 따라 계속 개발할 필요가 없기 때문에, 개발 비용의 절약을 기대할 수 있는 효과가 있다.According to the automatic reference building system of the present invention, since text information is extracted directly from a text PDF file, a PDF generation program and a program-specific version used for extracting text from an existing DPF file into an XML file There is no need to continue development according to the present invention. Therefore, there is an effect that the development cost can be expected to be saved.

또한, 저널별로 상이하게 구성되는 참고문헌의 자료 타입을 시스템에서 관리자가 직접 관리자에게 제공되는 관리프로그램의 화면에 등록 관리함으로써, 참고문헌 정보의 구조적 분해가 용이하며, 이후 발행된 권호에도 해당 규칙을 승계 적용함으로써 업무의 효율적 관리 체제가 용이한 효과가 있다. 또한, 참고문헌 정보의 구조화 측면에서 신속, 정확한 데이터베이스 구축을 통한 업무의 효율성 확보 및 시스템 추가 개발에 소요되는 비용의 절감과 관리 인터페이스를 통해 업무를 효율적으로 처리할 수 있어, 업무 효율성 확보에 기여할 수 있는 효과가 있다. In addition, the structure of the reference information can be easily decomposed by registering and managing the data type of the reference document, which is configured differently for each journal, on the management program screen provided by the administrator directly to the administrator in the system, By applying the succession system, it is easy to efficiently manage the business. In addition, in terms of structuring of reference information, it is possible to secure the efficiency of work by building a database quickly and accurately, to reduce the cost of additional system development, and to efficiently manage the work through the management interface. There is an effect.

뿐만 아니라, 자동 구축된 참고문헌 정보 중 저널명 정보와 저널 DOI를 통해 저널 전거의 기본이 되는 대표 저널명, 이형 저널명을 시스템적으로 자동 관리할 수 있으며, 구축된 저널 전거를 통해 인용 정보의 분석, 통계 등에 활용할 수 있어, 분석 기반의 학술정보 서비스에 활용할 수 효과가 있다.
In addition, the system automatically manages the journal name and the journal name, which are the basis of the journal authority, through the journal name information and the journal DOI, and analyzes the citation information, statistics And can be utilized in an analysis-based academic information service.

도 1은 본 발명에 따른 참고문헌 자동 구축 시스템의 전체적인 기능 블록도,
도 2는 본 발명에 따른 참고문헌 자동 구축 시스템의 세부 블록 구성도,
도 3은 본 발명에 따른 참고문헌 자동 구축 시스템의 각 구성부를 통하여 진행되는 참고문헌 및 저널 전거 구축 과정을 나타낸 흐름도,
도 4는 본 발명에 따른 참고문헌 자동 구축 시스템의 참고문헌 형식&구분자 지정부를 통하여 참고문헌 형식 및 구분자 정보가 등록되는 과정을 나타낸 흐름도,
도 5는 상기 도 4의 참고문헌 형식&구분자 지정부의 기능을 등록하고 관리 하는 사용자 인터페이스 구성도,
도 6은 본 발명에 따른 참고문헌 원형 자동 추출부를 통하여 참고문헌의 원형이 추출되는 과정을 나타낸 흐름도,
도 7은 본 발명에 따른 참고문헌 자동 구조화부를 통하여 참고문헌 정보를 국제 표준화 형태의 데이터베이스로 구축하는 과정을 나타낸 흐름도,
도 8은 본 발명에 따른 저널 DOI 추출부를 통하여 참고문헌에 대한 저널 DOI를 추출하는 과정을 나타낸 흐름도,
도 9는 본 발명에 따른 저널 전거 자동 구성부를 통하여 저널 전거가 구축되는 과정을 나타낸 흐름도이다. BRIEF DESCRIPTION OF THE DRAWINGS Figure 1 is an overall functional block diagram of a reference automatic construction system according to the present invention;
FIG. 2 is a detailed block diagram of a reference automatic construction system according to the present invention;
FIG. 3 is a flowchart illustrating a procedure for constructing a reference authority and a journal authority through each component of the automatic document reference building system according to the present invention,
FIG. 4 is a flowchart illustrating a process in which a reference document format and delimiter information are registered through a reference format & delimiter designation unit of a reference automatic construction system according to the present invention;
5 is a diagram illustrating a user interface for registering and managing functions of the reference format &
FIG. 6 is a flowchart illustrating a process of extracting a prototype of a reference document through a reference circular automatic extracting unit according to the present invention;
FIG. 7 is a flowchart illustrating a process of constructing reference information as an international standardized database through a reference automatic structure unit according to the present invention;
FIG. 8 is a flowchart illustrating a process of extracting a journal DOI for a reference through a journal DOI extracting unit according to the present invention;
FIG. 9 is a flowchart illustrating a procedure for constructing a journal authority through the automatic journal organizing unit according to the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다.
Hereinafter, preferred embodiments of the present invention will be described in detail with reference to the accompanying drawings.

도 1은 본 발명의 실시예에 따른 참고문헌 자동 구축 시스템의 전체적인 기능 블록도를 나타낸 것이다. FIG. 1 is a functional block diagram of a reference automatic construction system according to an embodiment of the present invention.

도 1에 도시된 바와 같이, 본 발명에 따른 참고문헌 자동 구축 시스템은 PDF 텍스트 파일 형태의 논문 PDF 파일을 입력받아 참고문헌 정보를 자동 추출하여 참고문헌 데이터베이스를 구축하는 참고문헌 구축부(10)와, 상기 참고문헌 구축부(10)를 통하여 구축되는 참고문헌 정보에 포함된 DOI 정보 중 저널 DOI를 추출하고 이를 분석하여 저널 전거 데이터베이스를 구축하는 저널 전거 구축부(40)를 포함하여 이루어지게 된다. As shown in FIG. 1, the reference automatic construction system according to the present invention includes a reference document constructing unit 10 that receives a PDF file of a PDF text file and automatically extracts reference information and constructs a reference document database, And a journal authority construction unit 40 for extracting a journal DOI from the DOI information included in the reference information constructed through the reference document construction unit 10 and analyzing the extracted journal DOI to construct a journal authority database.

상기 참고문헌 구축부(10)는 논문 PDF 파일에서 참고문헌 영역을 인식하여 원형 상태로 추출한 후, 이를 참고문헌 형식 및 구분자 데이터베이스에 저장된 참고문헌 형식 및 구분자에 따라 분류하여, 학술정보 DB 구축의 국제적 표준에 기반한 참고문헌 정보로 자동 구조화함으로써 참고문헌 데이터베이스를 구축하게 된다. 본 발명의 실시예에서 상기 학술정보 DB 구축의 국제적 표준으로 National Library of Medicine(이하 "NLM"이라 함)에서 제안하고 있는 학술정보 DB구축의 국제적 표준인 NISO JATS Ver 1.0 DTD를 적용하고 있는데, 이러한 국제적 표준은 환경에 따라 버전이 업그레이드 되거나 변경될 수 있으므로, 본 발명이 상기 표준만을 한정하는 것은 아니다. The reference document constructing unit 10 recognizes the reference document area in the PDF file of the document and extracts the reference document area in a circular shape and classifies the reference document area according to the reference format and the reference data stored in the separator database, A reference database is constructed by automatically structuring reference information based on standards. In the embodiment of the present invention, the NISO JATS Ver 1.0 DTD, an international standard for building an academic information DB proposed by the National Library of Medicine (hereinafter referred to as "NLM"), is applied as an international standard for the construction of the academic information DB. Since the international standard may be upgraded or changed in accordance with the environment, the present invention does not limit the standard only.

상기 저널 전거 구축부(40)는 참고문헌 구축부(10)를 통하여 구축된 참고문헌 데이터베이스에 등록된 참고문헌 정보에 따라, DOI 정보를 관리하는 CrossRef에 질의 응답을 통해 참고문헌별 DOI 정보를 획득하고, 획득된 DOI 정보 중에서 저널 DOI 정보를 추출하며, 참고문헌에 존재하는 각 저널별로 저널 DOI를 그룹으로 관리하고 저널 대표명과 이형명을 자동 구성하여 저널 전거 데이터베이스를 구축하게 된다. 상기 DOI(Digital Object Identifier)는 디지털콘텐트(학술 논문)에 고유한 식별자(알파벳+숫자)를 부여하여 누구나 손쉽게 온라인 상의 디지털 콘텐츠에 접근할 수 있도록 제공하는 고유 식별 체계이다. 이러한 DOI는 DOI를 구성하는 기관에 대한 고유 코드 정보인 Prefix와, 저널에 대한 코드로 발행년도 및 권, 시작 페이지 등과 같은 정보를 구성함으로써 논문에 대한 유일한 값을 구성하는 Suffix로 구성되어져 있다. 또한, 저널 DOI는 Prefix + 저널 코드(저널을 식별할 수 있는 코드(ISSN, 저널약어 등)) 조합을 통해, 학회 또는 기관에 부여된 고유번호(Prefix)와 저널 코드를 그룹화하여, 기관에서 발행하는 저널에 부여하는 유일한 저널 식별자로 정의되는데, 본 발명의 실시예에서는 이러한 저널 DOI 정보를 자동 획득하고 분석하여 저널 전거를 자동으로 구축하게 된다.
The journal authority construction unit 40 acquires the DOI information for each reference through the query response to the CrossRef managing the DOI information according to the reference information registered in the reference database constructed through the reference construction unit 10 The journal DOI information is extracted from the obtained DOI information, and the journal authority database is constructed by automatically managing the journal representative name and the variant name by managing the journal DOIs for each journal in the reference document. The DOI (Digital Object Identifier) is a unique identification scheme that provides an identifier (alphabet + number) unique to a digital content (academic paper) so that anyone can easily access the digital content on-line. These DOIs consist of Suffix, which constitutes the unique value for the paper by constructing information such as prefix, which is unique code information for the organizations that make up the DOI, and the year, volume, and start page, as codes for the journal. In addition, the journal DOI groups the prefixes and journal codes assigned to the institute or institution through a combination of prefix + journal code (code (ISSN, journal abbreviation, etc.) In the embodiment of the present invention, such journal DOI information is automatically acquired and analyzed to automatically construct a journal authority.

도 2는 본 발명의 실시예에 따른 참고문헌 자동 구축 시스템의 세부 블록 구성도를 나타낸 것이고, 도 3은 참고문헌 자동 구축 시스템의 각 구성부를 통하여 진행되는 참고문헌 및 저널 전거 구축 과정을 나타낸 흐름도이다.FIG. 2 is a detailed block diagram of a reference automatic construction system according to an embodiment of the present invention, and FIG. 3 is a flowchart illustrating a procedure for constructing a reference authority and a journal authority through each component of the reference construction automatic construction system .

도 2와 도 3에 도시된 바와 같이, 본 발명에 따른 참고문헌 자동 구축 시스템은 논문 PDF 파일에 포함된 참고문헌 정보를 국제적 표준에 따라 재구성하여 구조화하는 참고문헌 구축부(10)와, 상기 참고문헌 구축부(10)를 통하여 구축된 참고문헌 정보의 저널 DOI 정보를 획득하여 저널 전거를 구축하는 저널 전거 구축부(40)와, 상기 참고문헌 구축부(10) 및 저널 전거 구축부(40)를 통하여 생성되는 참고문헌 정보 및 저널 전거 정보를 등록하는 데이터베이스(50)와, 상기 각 구성부를 제어하는 중앙제어부(60)를 포함하여 이루어진다. As shown in FIGS. 2 and 3, the reference automatic construction system according to the present invention includes a reference document constructing unit 10 for reconstructing and structuring reference document information included in a PDF file of a document according to international standards, A journal authority construction unit 40 that acquires journal DOI information of reference information constructed through the document construction unit 10 and constructs a journal authority; a reference authority construction unit 10 and a journal authority construction unit 40; A database 50 for registering reference information and journal authority information generated through the central control unit 60, and a central control unit 60 for controlling the respective components.

상기 참고문헌 구축부(10)는 논문 정보에 포함된 참고문헌 정보를 추출하여 국제 표준에 따른 참고문헌 형태로 변환하여 자동 구조화하는 프로그램으로써, 이 참고문헌 구축부(10)에는 참고문헌의 종류에 따라 참고문헌 형식 및 구분자를 지정하여 설정하는 참고문헌 형식&구분자 지정부(100)와, 논문 PDF 파일에 포함된 참고문헌 정보를 인식하여 원형 상태로 추출하는 참고문헌 원형 자동 추출부(200)와, 상기 참고문헌 원형 자동 추출부(200)를 통하여 추출된 참고문헌 원형 정보를 참고문헌 형식&구분자 지정부(100)를 통하여 설정된 참고문헌 자료타입에 따른 참고문헌 형식 및 구분자에 따라 학술정보 DB 구축의 국제적 표준에 기반한 참고문헌 정보로 자동 구조화하여 참고문헌 데이터베이스를 구축하는 참고문헌 자동 구조화부(300)가 구비된다. 상기 참고문헌 형식&구분자 지정부(100)는 관리자가 참고문헌의 자료타입 에 따라 구성항목과 구성항목 사이의 구분을 위한 구분자("공백", """, "'", "「", "」", ",", "≪", "≫", "『", "』" 등)를 직접 설정하는 기능으로서, 관리자는 이 참고문헌 형식&구분자 지정부(100)를 통하여 다양한 형태의 참고문헌 형식/구분자를 등록/수정/승계하여 참고문헌 형식 및 구분자 데이터베이스를 구축하게 된다. 또한, 새로운 형태의 참고문헌이 등장하면 새로운 참고문헌의 형식 및 구분자를 파악하고 이를 참고문헌 형식 및 구분자 데이터베이스에 등록, 수정, 승계하여 이용할 수 있도록 제공한다.The reference building unit 10 is a program for extracting reference information included in the dissertation information and converting it into a reference form according to the international standard to automatically structure the reference. A reference type format & separator designation unit 100 for designating and setting a reference document type and a separator, a reference circular automatic extraction unit 200 for recognizing reference information included in the paper PDF file and extracting the reference information in a circular shape, , The reference original circular information extracted through the reference circular automatic extracting unit 200 is constructed according to a reference type and a separator according to the reference data type set through the reference type & And a reference document automatic structuring unit 300 for automatically structuring the reference document information based on the international standard of the reference standard. The reference type formatting and delimiter designation unit 100 is a system in which an administrator selects a delimiter ("blank", "", "" "," "," "," " The administrator can set various types of references (eg, "", "", "", "", "" In addition, when a new type of reference document appears, it is necessary to identify the format and the identifier of the new reference document, and to identify the reference format and the delimiter database Registration, modification, and succession.

상기 저널 전거 구축부(40)는 참고문헌 구축부(10)를 통하여 구축된 참고문헌 정보를 분석하여 저널 DOI 정보를 파악하고 이를 통하여 저널 전거 정보를 자동으로 구축하는 프로그램으로써, 이 저널 전거 구축부(40)에는 참고문헌 자동 구조화부(300)를 통하여 구축된 참고문헌 정보 중 자료 타입이 저널인 참고문헌에 대해 저널명, 발행년도, 권, 시작페이지 정보를 HTTP Protocol 방식으로 질의하여 DOI 정보를 확보하고 확보되는 DOI 정보 중 저널에 대한 DOI 정보를 추출하여 파악하는 저널 DOI 추출부(400)와, 상기 저널 DOI 추출부(400)를 통하여 파악되는 저널 DOI 정보를 분석하여 저널에 대한 대표 저널명과 이형 저널명 정보를 파악하여 저널 전거 데이터베이스를 구축하는 저널 전거 자동 구성부(500)가 구비된다. The journal authority construction unit 40 is a program for automatically analyzing reference information constructed through the reference construction unit 10 to identify journal DOI information and automatically constructing journal authority information, (40), the document name, the issue year, the volume, and the start page information are referred to the HTTP protocol method for the reference document in which the data type is the journal among the reference information constructed through the reference document automatic structuring unit (300) A journal DOI extracting unit 400 for extracting and acquiring DOI information for the journal among the acquired DOI information and a journal DOI extracting unit 400 for analyzing the journal DOI information obtained through the journal DOI extracting unit 400, And a journal authority automatic constructing unit 500 for constructing a journal authority database by grasping the journal name information.

한편, 상기 데이터베이스(50)에는 상기 참고문헌 구축부(10) 및 저널 전거 구축부(40)를 통하여 구축되는 참고문헌 형식 및 구분자 데이터베이스와 참고문헌 데이터베이스 및 저널 전거 데이터베이스가 구비된다. Meanwhile, the database 50 is provided with a reference document format and a delimiter database, a reference document database, and a journal authority database, which are constructed through the reference document building unit 10 and the journal authority construction unit 40.

상기 중앙제어부(60)는 참고문헌 자동 구축 시스템의 각 구성부를 제어하고 관리하는 장치로서, 이 중앙제어부(60)에는 통상의 중앙처리장치(CPU)와 램(RAM) 및 롬(ROM) 등의 하드웨어 장치와 상기 하드웨어 장치를 인식하여 구동하는 소프트웨어가 구비되어 전체적인 동작을 제어하게 된다. 또한, 도 2에는 표시되어 있지 않지만, 참고문헌 자동 구축 시스템에는 데이터 입출력을 위한 입력장치 및 표시장치와, 외부 장치와의 데이터 송수신을 위한 통신장치 및 인터페이스 장치가 구비되어 있다.
The central control unit 60 is an apparatus for controlling and managing each component of the reference automatic construction system. The central control unit 60 is provided with a central processing unit (CPU), a RAM A hardware device and software for recognizing and driving the hardware device are provided to control the overall operation. Although not shown in Fig. 2, the reference automatic construction system includes an input device and a display device for data input / output, and a communication device and an interface device for data transmission / reception with an external device.

이하, 상기의 구성으로 이루어진 참고문헌 자동 구축 시스템을 통하여 참고문헌 및 저널 전거가 자동으로 구축되는 과정에 대하여 설명하기로 한다.
Hereinafter, a process of automatically constructing a reference document and a journal authority through the automatic reference system construction system will be described.

도 4는 본 발명의 실시예에 따른 참고문헌 자동 구축 시스템의 참고문헌 형식&구분자 지정부를 통하여 참고문헌 형식 및 구분자 정보가 등록되는 과정을 나타낸 흐름도이다. FIG. 4 is a flowchart illustrating a process in which a reference document format and a delimiter information are registered through a reference format & delimiter designation unit of a reference automatic construction system according to an embodiment of the present invention.

단계 S110, S120 : 본 발명에 따른 참고문헌 자동 구축 시스템에 구비된 참고문헌 형식&구분자 지정부(100)는 관리자에 의해 운용되는 프로그램 모듈로서, 먼저 관리자는 참고문헌 형식&구분자 지정부(100)에서 제공하는 관리자 페이지에 접속한 후, 각 학회에서 발행하는 논문이 실린 저널을 선택하게 된다(S110). 한편, 상기 저널 선택시 논문 PDF 파일의 화면 레이어를 정의하여 설정하게 되는데, 이러한 PDF 파일 화면 레이어 정의에는 PDF 파일에서 텍스트 정보를 추출하는 순서를 정의하는 기능이 포함된다(S120).In step S110 and S120, the reference document format & delimiter identification part 100 provided in the reference document automatic construction system according to the present invention is a program module operated by the administrator, And then selects a journal containing papers to be published by the respective academic societies (S110). When the journal is selected, the screen layer of the PDF file of the paper is defined and set. The definition of the PDF file screen layer includes a function of defining the order of extracting text information from the PDF file (S120).

단계 S130, S140, S150 : 상기 저널 선택을 한 다음, 해당 저널에 포함된 논문의 발행년도를 선택하고(S130), 권호 추가 및 선택을 하며(S140), 이후 참고문헌의 자료 타입을 선택하게 된다(S150). Steps S130, S140, and S150: After selecting the journal, a publication year of the articles included in the journal is selected (S130), a bookmark is added and selected (S140), and a data type of the reference article is selected (S150).

단계 S160, S170 : 상기 참고문헌 자료 타입 선택 시에, 각각의 참고문헌 자료 타입의 등록 및 참고문헌 항목별 구성 순서를 정의하게 되는데, 이 참고문헌 항목별 구성 순서 정의는 참고문헌의 형식을 정의하는 것으로, 참고문헌의 각 자료 타입에 따라 구축 항목이 구성되는 순서를 정의하게 된다(S160). 또한, 참고문헌 항목별 구분자를 등록하게 되는데, 이 참고문헌 항목별 구분자 등록 시에 상기 참고문헌 항목별 구성 순서 정의를 통하여 정의된 구축 항목과 항목 사이에 구성되는 구분자를 등록하여 관리하게 된다(S170). 이렇게 참고문헌의 자료 타입에 따라 등록되는 참고문헌의 구성 항목과 항목 사이의 구분자는 참고문헌을 자동으로 구조화 하여, 사용자의 웹 화면에 출력해 주는 중요한 기준으로 사용되게 된다.Steps S160 and S170: Upon selection of the reference data type, the registration of each reference data type and the configuration order of each reference item are defined. The reference order definition for each reference item defines the format of the reference document And the order in which the construction items are constructed is defined according to each data type of the reference document (S160). In addition, a delimiter for each reference item is registered. When delimiter is registered for each reference item, a delimiter formed between the item and the item defined through definition of the order of the reference item is registered and managed (S170 ). In this way, the identifier between the configuration item and the item of the reference document registered according to the data type of the reference document is used as an important reference for automatically outputting the reference document on the user's web screen.

상기의 과정을 통하여 관리자는 참고문헌 형식&구분자 지정부(100)를 통하여 참고문헌 정보를 NLM에서 제시하는 NISO JATS Ver 1.0 DTD의 기준에 따라 참고문헌의 자료 타입 설정, 각 자료 타입에 따른 구축 항목 및 각 구축 항목과 항목을 구분하는 구분자를 등록하여 관리하게 된다.
Through the above process, the administrator sets the data type of the reference document according to the criteria of the NISO JATS Ver 1.0 DTD, which provides the reference information in the NLM through the reference format & And a delimiter for distinguishing each construction item and an item are registered and managed.

도 5는 상기 도 4의 참고문헌 형식&구분자 지정부의 기능을 등록하고 관리 하는 사용자 인터페이스의 구성도를 나타낸 것이다. FIG. 5 is a block diagram of a user interface for registering and managing the functions of the reference format & delimiter identifier in FIG.

도 5에 도시된 바와 같이, 본 발명에 따른 참고문헌 형식&구분자 지정부(100)를 통하여 참고문헌 자료 타입을 설정할 때, 참고문헌 자료 타입으로 저널, 단행본, 학술대회, 보고서, 학위논문, 특허, 웹 등으로 구분하여 등록하게 된다. As shown in FIG. 5, when setting the reference data type through the reference format & separator designation unit 100 according to the present invention, the reference data type includes journals, monographs, academic conferences, reports, , The web, and so on.

또한, 참고문헌 자료 타입 선택 후, 해당 자료 타입에 구성되는 각 구성항목의 순서를 정의할 수 있는데, 이러한 참고문헌 항목별 구성 순서로 이미 시스템에 등록된 패턴을 호출(예제 선택)하거나 직접 구성 선택을 통해, 관리자가 각각의 구성 항목의 순서를 지정할 수 있게 된다. 만약, 관리자가 참고문헌 항목별 구성 순서로서 직접 구성을 선택하게 되면, 화면 하단에 저자(Au), 논문명(A-Title), 저널명(J-Title), 발행년도(P-Year), 권(Vol), 호(Iss), 시작 페이지(S-Page), 끝 페이지(E-Page), 기관명(Col), 출판사 소재지(P-City), 출판일자(Op-Day), 출판사명(Op-City), 특허번호(Patent), Url, DOI 등의 구성 순서를 직접 등록하고 등록된 구성 항목을 확인할 수 있게 된다.
In addition, after selecting the reference data type, you can define the order of each configuration item in the data type. You can call the pattern already registered in the system (select example) Allows the administrator to specify the order of each configuration item. If the administrator selects the direct configuration as a composition order for each reference item, the following information is displayed at the bottom of the screen: Au, A-Title, J-Title, P- Vol, Issue, S-Page, E-Page, P-City, Op-Day, Op- City, Patent No., Url, DOI, etc., and can confirm the registered configuration items.

도 6은 본 발명의 실시예에 따른 참고문헌 원형 자동 추출부를 통하여 참고문헌의 원형이 추출되는 과정을 나타낸 흐름도이다. 6 is a flowchart illustrating a process of extracting a prototype of a reference document through a reference circular automatic extracting unit according to an embodiment of the present invention.

단계 S210 : 본 발명에 따른 참고문헌 원형 자동 추출부(200)는 논문 정보에 포함된 참고문헌 정보를 표준화 형태의 데이터베이스로 구축하기 위해, 대상이 되는 논문 PDF 파일을 선택하여 업로드 한 후, PDF 파일의 JAVA Library를 호출하게 된다. 상기 JAVA Library 호출은 사용자가 논문 PDF 파일을 참고문헌 자동 구축 시스템에 업로드할 때, Open Source Package인 PDFBOX에서 제공하는 PDFTextStripper Object의 기능을 이용하여, PDF 파일로부터 텍스트를 순차적으로 추출하기 위해 PDFTextStripper Object를 호출하는 기능을 의미한다.Step S210: The reference prototype automatic extracting unit 200 according to the present invention selects and uploads the target thesis PDF file in order to construct the reference information included in the thesis information into the database of the standardized form, The JAVA Library of. The JAVA Library call uses a PDFTextStripper Object function provided by PDFBOX, which is an Open Source Package, when a user uploads a PDF file of a document to a reference document automatic building system. In order to sequentially extract text from a PDF file, a PDFTextStripper Object It means the function to call.

단계 S220 : 논문 PDF 파일로부터 PDFTextStripper Object가 호출되면, 호출된 PDFTextStripper Object를 통하여 논문 PDF 파일로부터 텍스트 정보를 순차적으로 추출하게 된다.Step S220: When the PDFTextStripper Object is called from the PDF file of the thesis, the text information is sequentially extracted from the thesis PDF file through the called PDFTextStripper Object.

단계 S230 : 논문 PDF 파일로부터 텍스트 정보가 순차적으로 추출되면 참고문헌 영역을 탐색하게 되는데, 이 참고문헌 영역 탐색은 추출된 텍스트 정보에서 참고문헌 형식&구분자 지정부(100)에 의해 정의된 참고문헌, 인용문헌, References, Citation, 引用文獻, 參考文獻 등의 형식으로 구성된 단일 문자행, 즉 참고문헌 영역을 찾는 기능을 의미한다. Step S230: When the text information is sequentially extracted from the thesis PDF file, the reference document area is searched. The reference document search is performed on the extracted text information by referring to the reference document defined by the reference document format & Refers to the ability to search for a single character line, ie, a reference area, composed of citations, references, citations, citations, and references.

단계 S240 : 참고문헌 영역이 탐색되면, 참고문헌 문단의 좌,우측 Tag를 자동 구성하게 되는데, 이 참고문헌 문단 좌,우측 Tag 자동 구성은 탐지된 참고문헌 영역의 참고문헌 텍스트 정보의 좌, 우측에 Tag를 자동으로 구성하여, 참고문헌을 인식하는 기능을 의미한다. 또한, 단일 행으로 구성된 정보에서 1., 2. 등으로 시작하는 문자열의 자동 인식을 통해 단일 참고문헌의 시작부분과 끝 부분을 인지하여 한 건의 단일 참고문헌을 인식하게 된다.Step S240: When the reference document area is searched, the left and right tags of the reference document are automatically constructed. The left and right tag automatic configurations of the reference document are arranged at the left and right sides of the reference text information of the detected reference document area It is a function that automatically constructs a tag and recognizes a reference document. In addition, recognition of a single reference is recognized by recognizing the beginning and end of a single reference through automatic recognition of a string starting with 1., 2., etc. in a single line of information.

단계 S250 : 참고문헌 문단의 좌,우측에 Tag 자동 구성되면, 참고문헌 문단을 병합하게 되는데, 이 참고문헌 문단 병합은 문단의 좌,우측에 Tag가 구성된 각 참고문헌의 문단 정보를 하나의 참고문헌으로 병합하는 과정을 나타낸다. Step S250: When the tag is automatically constructed on the left and right sides of the reference document, the reference document is merged. In this reference document merging, the paragraph information of each reference document constituted by the tags on the left and right sides of the paragraph is referred to as a reference As shown in FIG.

단계 S260 : 참고문헌 문단이 병합되면, 병합된 참고문헌의 수량을 카운트하게 되는데, 이 참고문헌 수량 카운트는 상기 병합된 참고문헌에서 각각 하나의 참고문헌으로 인식된 참고문헌 수를 카운트하는 기능을 나타낸다.Step S260: When the reference paragraphs are merged, the number of merged references is counted, which indicates the function of counting the number of references recognized as one reference in each of the merged references .

단계 S270 : 참고문헌의 수량이 카운트되면, 참고문헌 원형이 반환되는데, 이 참고문헌 원형 반환은 상기 카운트된 참고문헌의 수만큼 참고문헌 자동 구축 시스템의 사용자 인터페이스 화면에 텍스트 박스를 생성하고, 생성된 텍스트 박스에 참고문헌 원형 정보를 구성하는 기능을 의미한다.Step S270: When the number of references is counted, a reference prototype is returned. The reference prototype return is generated by generating a text box on the user interface screen of the reference automatic construction system by the number of the counted references, Means the function to construct the reference circle information in the text box.

상기의 과정을 통하여 참고문헌 원형 자동 추출부(200)는 논문 PDF 파일에 포함된 참고문헌 영역을 인식하여 텍스트 박스 형태의 원형 상태로 추출하게 된다.
Through the above process, the reference circular automatic extracting unit 200 recognizes the reference document area included in the submitted PDF file and extracts it as a text box-shaped circular shape.

도 7은 본 발명의 실시예에 따른 참고문헌 자동 구조화부를 통하여 참고문헌 정보를 국제 표준화 형태의 데이터베이스로 구축하는 과정을 나타낸 흐름도이다. FIG. 7 is a flowchart illustrating a process of constructing reference information as an international standardized database through a reference document automatic structuring unit according to an embodiment of the present invention.

단계 S260, S261, S270 : 상기 도 6에서 설명한 바와 같이, 참고문헌 원형 자동 추출부(200)는 논문에 포함된 참고문헌 영역을 파악하여 하나로 병합한 후, 각 참고문헌 수량을 카운트하고(S260), 카운트된 참고문헌의 수량만큼 사용자 인터페이스 화면에 텍스트 박스를 생성하여(S261), 참고문헌 정보를 원형 상태로 텍스트 박스에 각각 구성하게 된다(S270). 6, the reference circular automatic extracting unit 200 recognizes the reference document areas included in the paper, merges the reference document areas into one, counts the number of reference documents in step S260, , A text box is created on the user interface screen by the number of the counted reference documents (S261), and the reference information is formed in the text box in a circular shape (S270).

단계 S310 : 한편, 참고문헌 자동 구조화부(300)는 참고문헌 원형 자동 추출부(200)를 통하여 추출된 참고문헌 원형 상태를 NLM에서 제안하고 있는 학술정보 DB구축의 국제적 표준인 NISO JATS Ver 1.0 DTD 형태로 변환하여 참고문헌 데이터베이스를 구축하기 위해, 먼저 참고문헌 자료 타입을 설정하게 된다. 상기 참고문헌 자료 타입 설정은 사용자 인터페이스 상에 출력되는 참고문헌의 자료 타입을 저널, 단행본, 학술대회, 보고서, 특허, 웹, 학위논문 등으로 선택하는 것을 의미한다. Step S310: On the other hand, the reference document automatic structuring unit 300 refers to the reference original circular state extracted through the reference circular automatic extracting unit 200 as the NISO JATS Ver 1.0 DTD To construct a reference database, we first set the reference data type. The reference data type setting means selecting the data type of the reference document output on the user interface as a journal, a monograph, an academic conference, a report, a patent, a web, a dissertation or the like.

단계 S320 : 참고문헌의 자료 타입이 설정되면, 참고문헌의 형식 및 구분자 형식을 호출하게 되는데, 이 참고문헌 형식 및 구분자 형식 호출은 도 4에서 상술한 참고문헌 자료 타입 설정부(310)를 통하여 설정된 참고문의 자료 타입에 따른 참고문헌 형식 & 구분자 형식을 호출하는 것을 의미한다. Step S320: When the data type of the reference document is set, the format of the reference document and the delimiter format are called. The reference format and the delimiter format call are set through the reference data type setting unit 310 described in FIG. It means to call reference format & delimiter format according to reference data type.

단계 S330, S331, S332 : 참고문헌의 형식 및 구분자 형식이 호출되면, 참고문헌을 항목 순서 및 구분자를 기준으로 분해하게 되는데, 이 참고문헌 항목별, 순서별 분해는 상기 호출된 참고문헌 형식 및 구분자 정보를 통해, 참고문헌의 원형 정보를 기반으로, 선택된 자료 타입에 따라, 저자 성, 저자 이름, 기사명, 저널명, 발행년도, 권, 호, 시작 페이지, 끝 페이지, 학술대회명, 학술대회 개최일자, 학술대회 개최지역, 출판사명, 출판사 소재지, 특허번호, 특허국가, Url, DOI, 보고서 번호 등의 각 항목을 인지하여 이를 구분하는 것을 의미한다(S330). 상기 과정을 통하여 참고문헌이 항목별, 순서별로 분해되면, 분해된 정보에 따라 저자수를 카운트하고(S331), 카운트된 저자 숫자만큼 저자명 입력 박스를 사용자 인터페이스 상에 생성하고 상기 참고문헌 형식 및 구분자 형식 호출을 통하여 정의된 각 항목의 순서에 따라 입력 박스를 사용자 인터페이스 상에 생성하게 된다(S332).Steps S330, S331, and S332: When the format of the reference document and the format of the delimiter are called, the reference document is decomposed on the basis of the item order and the delimiter. The reference item format and the delimiter information The name of the author, the name of the article, the name of the journal, the year of publication, the title, the title, the start page, the last page, the date of the conference, the date of the conference, It means to recognize and distinguish each item such as an academic conference area, a name of a publisher, a location of a publisher, a patent number, a patent country, a URL, a DOI, and a report number (S330). If the references are decomposed according to the item and order by the above process, the number of authors is counted according to the disassembled information (S331), the author name input box is generated on the user interface as many as the counted number of authors, An input box is created on the user interface according to the order of each item defined through the format call (S332).

단계 S340 : 상기 과정을 통하여 저자명 입력 박스 및 각 항목 입력 박스가 사용자 인터페이스 상에 생성되면, 생성된 입력박스에 상기 참고문헌 항목별, 순서별 분해에 따라 인식된 해당 저자명 및 각 항목 정보를 NLM의 NISO JATS Ver 1.0 DTD에서 정한 항목에 따라 자동 구분하여 참고문헌의 각 자료타입에 따라 각각의 텍스트 입력칸에 자동 구분된 텍스트 정보를 출력하게 된다.Step S340: When the author name input box and each item input box are created on the user interface through the above process, the corresponding author name and each item information recognized according to the reference item, JATS Ver 1.0 It is automatically classified according to the items defined in DTD and automatically outputs text information separated in each text input field according to each data type of reference document.

단계 S350, S360 : 상기 과정을 통하여 각 입력박스에 참고문헌의 해당 항목 정보가 입력되어 출력되면, 사용자에 의해 검토 및 보완이 이루어진 후(S350), 참고문헌 데이터베이스로 저장되어 관리되게 된다(S360). S350 and S360: When the corresponding item information of the reference document is inputted and outputted in each input box through the above process, the user reviews and complements the input item information (S350), and is stored and managed in the reference document database (S360) .

상기의 과정을 통하여 참고문헌 자동 구조화부(300)는 원형 상태로 추출된 참고문헌 정보를 NLM의 NISO JATS Ver 1.0 DTD에서 정한 항목에 따라 자동 구분하여, 각각의 구성 항목을 구성함으로써 참고문헌 데이터베이스를 구축하게 된다.
Through the above process, the reference structure automatic structuring unit 300 automatically classifies the reference information extracted in the circular state according to the items defined in the NISO JATS Ver 1.0 DTD of NLM, .

도 8은 본 발명의 실시예에 따른 저널 DOI 추출부를 통하여 참고문헌에 대한 저널 DOI를 추출하는 과정을 나타낸 흐름도이다. 8 is a flowchart illustrating a process of extracting a journal DOI for a reference through a journal DOI extracting unit according to an embodiment of the present invention.

단계 S410 : 상기 참고문헌 자동 구조화부(300)를 통하여 참고문헌 데이터베이스가 구축되면(S360), 저널 DOI 추출부(400)는 저널 전거 구축을 위해 참고문헌의 저널명, 발행년도, 권, 시작 페이지 정보를 통하여 해당 참고문헌의 CrossRef DOI를 조회하게 된다(S410). 상기 CrossRef DOI 조회는 참고문헌 자동 구조화부(300)를 통하여 구축된 참고문헌 정보 중 자료 타입이 저널인 데이터에 대해, CrossRef의 서버에 http Protocol 기반의 질의 처리를 통해, 각 참고문헌에 대한 DOI를 획득하는 것을 의미한다. Step S410: When the reference document database is constructed through the reference document automatic structuring unit 300 (S360), the journal DOI extracting unit 400 extracts the journal name, the publication year, the volume, the start page information The CrossRef DOI of the reference document is inquired through (S410). In the CrossRef DOI inquiry, the DOI for each reference is transmitted to the CrossRef server through the http Protocol based query processing for the data whose data type is journal among the reference information constructed through the reference document auto structure unit 300 It means to acquire.

단계 S420 : CrossRef의 서버의 질의 응답을 통해 각 참고문헌에 대한 DOI가 획득되면, DOI 정보 중 저널 DOI 정보를 추출하게 된다. 이 저널 DOI 추출은 상기 CrossRef DOI 조회를 통해 확보된 DOI 정보 중 DOI Prefix 정보와 저널 코드를 포함하는 저널 DOI 정보를 분해하여 확보하는 것을 의미한다. Step S420: When the DOI for each reference document is acquired through the query response of the server of the CrossRef, the journal DOI information is extracted from the DOI information. This journal DOI extraction means that the journal DOI information including the DOI prefix information and the journal code among the DOI information obtained through the CrossRef DOI inquiry is disassembled and secured.

상기 과정을 통하여 저널 DOI 추출부(400)는 참고문헌 데이터베이스에 구축된 각 참고문헌에 대한 저널 DOI 정보를 추출하여 확보하게 된다.
Through the above process, the journal DOI extracting unit 400 extracts and obtains the journal DOI information for each reference document constructed in the reference document database.

도 9는 본 발명의 실시예에 따른 저널 전거 자동 구성부를 통하여 저널 전거가 구축되는 과정을 나타낸 흐름도이다. FIG. 9 is a flowchart illustrating a process of constructing a journal authority through a journal authority automatic constructing unit according to an embodiment of the present invention.

단계 S510 : 상기 저널 DOI 추출부(400)를 통하여 각 참고문헌에 대한 저널 DOI 정보가 추출되면, 저널 전거 자동 구성부(500)는 저널 전거 데이터베이스의 대표 저널명 테이블과 이형 저널명 테이블에 등록된 저널 DOI 정보를 조회하게 된다. 상기 대표 저널명 정보와 이형 저널명 정보 테이블은 저널의 대표 저널명 및 동일 저널명의 다른 형태 사항인 이형 저널명을 관리하는 테이블로서, 상기 저널 DOI 정보 조회는 저널 전거 데이터베이스에 등록된 대표 저널명 테이블에 해당 저널 DOI 정보가 등록되어 있는지의 여부를 조회하게 된다. Step S510: When the journal DOI information for each reference is extracted through the journal DOI extracting unit 400, the journal authority automatic constructing unit 500 extracts the representative journal name table of the journal authority database and the journal DOI Information is inquired. The representative journal name information and the release journal name information table are tables for managing the release journal name, which is another form of the journal's representative journal name and the same journal name. The journal DOI information inquiry includes the corresponding journal DOI information in the representative journal name table registered in the journal authority database Is registered.

단계 S520, S560 : 만약, 저널 전거 데이터베이스에 해당 저널 DOI 정보가 존재하지 않으면(S520), 신규 저널로 판단하여 저널 전거 데이터베이스의 대표 저널명 정보와 이형 저널명 정보 테이블에 해당 저널 DOI 정보에 따라 신규 대표 저널명 및 저널 DOI 정보로 등록하게 된다(S560).Steps S520 and S560: If there is no corresponding journal DOI information in the journal authority database (S520), it is determined as a new journal and the representative journal name information of the journal authority database and the new representative journal name And journal DOI information (S560).

단계 S530, S540, S550 : 만약, 저널 전거 데이터베이스에 기 등록된 저널 DOI 정보가 존재한다면(S520), 존재하는 해당 저널명의 문자열 길이를 조회하여(S530), 그 길이를 체크한 후(S540), 저널명 문자열 길이가 신규 저널명보다 작은 경우, 신규 저널명 정보를 대표 저널명 테이블에 저장하게 된다(S550). 즉, 저널 전거 데이터베이스에 저널 DOI 추출부(400)에서 획득한 저널 DOI 정보가 존재하는 경우, 저널 DOI에 대응하는 저널명의 문자열 길이를 조회하여, 문자열 크기가 질의한 저널명의 문자열 크기보다 작을 경우 대표 저널명 테이블에 저장되어 있는 대표 저널명을 이형 저널명으로 변경하고, 질의한 저널명을 대표 저널명 테이블에 새로운 대표 저널명을 등록하게 된다.Steps S530, S540 and S550: If the journal DOI information already registered in the journal authority database exists (S520), the length of the string of the corresponding journal name is inquired (S530), its length is checked (S540) If the journal name string length is smaller than the new journal name, the new journal name information is stored in the representative journal name table (S550). That is, when the journal DOI information acquired by the journal DOI extracting unit 400 exists in the journal authority database, the string length of the journal name corresponding to the journal DOI is inquired. If the string size is smaller than the string size of the queried journal name, The representative journal name stored in the journal name table is changed to the release journal name, and the name of the queried journal is registered in the representative journal name table.

단계 S570 : 한편, 검색된 저널명 문자열 길이가 신규 저널명보다 크다면, 기 등록된 저널 정보가 대표 저널명으로 관리되고, 신규 저널 정보는 이형 저널 테이블에 등록하여 관리되게 된다. 즉, 문자열 크기가 질의한 저널명의 문자열 크기보다 클 경우, 질의한 저널명을 이형 저널명 테이블에 이형 저널명으로 등록하게 된다. 이에 따라 저널의 대표명과 이형 저널명이 참고문헌의 DOI가 확보되는 수에 비례하여 증가하게 된다. Step S570: If the retrieved journal name string length is larger than the new journal name, the previously registered journal information is managed as the representative journal name, and the new journal information is registered and managed in the release journal table. That is, if the string size is larger than the string size of the queried journal name, the queried journal name is registered in the release journal name table as the release journal name. As a result, the representative name of the journal and the name of the release journal increase in proportion to the number of DOIs of the reference.

상기의 과정을 통하여 저널 전거 자동 구성부(500)는 참고문헌의 저널 DOI 정보를 추출한 후, 이를 기 등록된 저널명과 비교하여 대표 저널명 또는 이형 저널명으로 등록하여 관리하게 된다.
Through the above process, the journal authority autoconfiguration unit 500 extracts the journal DOI information of the reference document, compares the journal DOI information with the previously registered journal name, and registers and manages the representative journal name or the release journal name.

이와 같이, 본 발명에 따른 참고문헌 자동 구축 시스템은 논문 PDF 파일에서 참고문헌 정보를 원형 상태로 추출한 후 추출된 참고문헌 정보를 NLM의 NISO JATS Ver 1.0 DTD에서 정한 항목에 따라 자동 구분하여 표준 형식으로 재배열함으로써 참고문헌 데이터베이스를 구축하며, 구축된 참고문헌 정보를 이용하여 참고문헌의 저널 DOI 정보를 추출한 후 이를 기 등록된 저널명과 비교하여 대표 저널명 또는 이형 저널명으로 저널 전거 데이터베이스에 등록하여 관리하게 된다.
In this way, the reference automatic construction system according to the present invention extracts the reference information from the paper PDF file in a circular shape and automatically extracts the extracted reference information according to the NISO JATS Ver 1.0 DTD of NLM And the journal DOI information of the reference document is extracted by using the constructed reference document information and then compared with the previously registered journal name and registered in the journal authority database as the representative journal name or the release journal name .

이러한 본 발명은 상술한 실시예에 한정되는 것은 아니며, 본 발명이 속하는 기술 분야에서 통상의 지식을 갖는 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구 범위의 균등범위 내에서 다양한 수정 및 변형이 이루어질 수 있음은 물론이다.
It will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the present invention as defined by the appended claims and their equivalents. It is a matter of course that a deformation can be made.

10 : 참고문헌 구축부 40 : 저널 전거 구축부
50 : 데이터베이스 60 : 중앙제어부
100 : 참고문헌 형식&구분자 지정부 200 : 참고문헌 원형 자동 추출부
300 : 참고문헌 자동 구조화부 400 : 저널 DOI 추출부
500 : 저널 전거 자동 구성부10: reference document construction unit 40: journal authority construction unit
50: database 60: central control unit
100: reference document type & delimiter specification 200: reference document circular automatic extraction unit
300: Reference Document Automatic Structuring Unit 400: Journal DOI Extraction Unit
500: journal authority autoconfiguration section

Claims

As an automatic reference system for extracting reference information contained in a PDF file and structuring it as reference information based on international standards,
A reference document format & delimiter designation unit 100 for designating and setting the format and delimiter of each reference document according to the kind of reference document written in a different format for each journal;
A reference circular automatic extracting unit 200 for receiving a papers PDF file and recognizing a reference area included in the papers PDF file and extracting the reference information in a circular form;
The bibliographic information extracted through the reference circular automatic extracting unit 200 is classified according to the format and the delimiter designated by the reference formatting and delimiter assigning unit 100 and is used as an international standard And a reference document automatic structuring unit (300) for constructing a reference document database by configuring items defined by the NISO JATS DTD.

The method according to claim 1,
The reference format &
It defines the data type of the reference document written in various formats, defines the order of the configuration items according to the format of the reference according to each data type, defines the delimiter between each configuration item and the item, And registers or updates the database in the database through the delimiter registration screen.

3. The method of claim 2,
The data type of the reference includes any one of a journal, an academic conference, a book, a report, a thesis, a patent, and the web,
According to the data type setting of the reference document, at least one author name and name, title, journal name, title, start page, end page, conference name, A publication number, a publication number, a publishing company, a publication location, a publication year, a report number, a patent number, and a patent item.

The method according to claim 1,
The reference circular automatic extracting unit 200 extracts
Extracts text information through a PDFText Stripper Object of PDFBOX, extracts a reference character area by searching for a character string representing a reference character among the extracted text information, and extracts it as a circular form of the identified reference character area Wherein the system comprises:

5. The method of claim 4,
The reference circular automatic extracting unit 200 extracts
A reference character area is searched for a character string including any one of Referencs, Citation, a cited document, a reference document, a citation document, and a reference document among the text information extracted through the PDFText Stripper Object of the PDFBOX,
After recognizing the beginning and end of the single reference document, the recognized single reference document is formed into a circular form,
Counting the number of recognized single references in the whole text information, generating a counted number of text boxes, and displaying the circular information of the reference in the generated text box.

The method according to claim 1,
The reference structure automatic structuring unit 300
The number of authors included in the reference information extracted from the reference original automatic extracting unit 200 is counted according to the reference format and identifier specified by the reference format & Create a text box for entering the name of the property,
According to the data type of the reference, the title of the paper, the journal name, the publication year, the volume, the start page, the final page, the name of the conference, A text box for inputting a detailed item including at least one of a patent application country, Url, and DOI is generated, and the corresponding item is input to the generated text box and stored as a reference document database. Building system.

The method according to claim 1,
A journal DOI extracting unit 400 for acquiring DOI information among the reference information constructed through the reference document automatic structuring unit 300 through a query response and extracting DOI information for the journal among the acquired DOI information, ;
And a journal authority automatic constructing unit 500 for analyzing journal DOI information obtained through the journal DOI extracting unit 400 to grasp the representative journal name and release name information for the journal and construct a journal authority database Wherein the system comprises:

8. The method of claim 7,
The journal DOI extraction unit 400 extracts
Among the reference information constructed through the reference document automatic structuring unit 300, a query is processed through the API method on the CrossRef for the reference data whose data type is the journal, the DOI information for each reference document is obtained,
A journal DOI is constructed by combining a prefix which is an institution unique ID among the obtained DOI information, a journal management code for journals included in the Suffix information composed of journal information and thesis unique information, and the like. system.

9. The method of claim 8,
The journal DOI extractor 400 combines the information including the journal name, the year of publication, the title, and the start page of the reference document automatically constructed in the reference document automatic structuring unit 300, Inquiry to CrossRef,
A journal code capable of identifying a journal is obtained through Prefix information and Suffix information among DOI information obtained by parsing an XML file returned from the CrossRef, a prefix + journal code is defined as a journal DOI, And the journal DOI is constituted by a code corresponding to 1: 1 to the journal name which has been queried.

8. The method of claim 7,
The journal authority automatic constructing unit 500
After inquiring whether the journal DOI information extracted through the journal DOI extracting unit 400 exists in the table of the journal authority database managing the representative journal name and the release journal name of the journal,
If the journal name having the same DOI structure is present but the journal name is different, the journal name of the journal DOI extracted according to the text length of the journal name is registered as the representative journal name or the release journal name,
Wherein a representative journal name and a release journal name for the journal name are grouped and managed based on a journal DOI to construct authority information of the journal.

11. The method of claim 10,
The journal authority automatic constructing unit 500
If the journal DOI information acquired by the journal DOI extracting unit 400 exists in the journal authority database, the length of the string of the journal name corresponding to the journal DOI is inquired,
If the string size is smaller than the string size of the queried journal name, the representative journal name stored in the representative journal name table of the journal authority database is changed to the release journal name, and the queried journal name is registered as a new representative journal name in the representative journal name table of the journal authority database ,
If the string size is larger than the string size of the queried journal name, the queried journal name is registered in the journal authority database as a journal-type journal name,
Wherein the representative name of the journal and the name of the release journal increase in proportion to the number of DOIs of the reference.