KR101620631B1

KR101620631B1 - Nuclear energy-related similar technical document search system and its method

Info

Publication number: KR101620631B1
Application number: KR1020140137800A
Authority: KR
Inventors: 신동훈; 태재웅; 양승효; 최선도; 정승호
Original assignee: 한국원자력통제기술원
Priority date: 2014-10-13
Filing date: 2014-10-13
Publication date: 2016-05-24
Also published as: KR20160043608A

Abstract

대상 문서에 대한 정보로부터 키워드를 추출하는 단계, 상기 키워드를 이용하여 상기 대상 문서를 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 텍스트 기반 유사도를 판정하는 단계, 상기 대상 문서에 포함된 이미지를 추출하는 단계, 상기 이미지를 이용하여 상기 대상 문서를 상기 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 이미지 기반 유사도를 판정하는 단계, 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 단계를 포함하는 유사 문서 검색 방법을 제공한다. The method comprising the steps of: extracting a keyword from information about a target document; designating the target document as at least one classification system in a database using the keyword; determining a textual basis between the document stored in the at least one classification system and the target document Determining a degree of similarity, extracting an image included in the object document, designating the object document as at least one classification system of the database using the image, Based similarity between the target document and the target document, and retrieving a similar document using the text-based similarity and the image-based similarity.

Description

[0001] NUCLEAR ENERGY-RELATED SIMILAR TECHNICAL DOCUMENT SEARCH SYSTEM AND ITS METHOD [0002]

본 발명은 원자력 관련 유사 기술 문서 검색 시스템 및 그 방법에 관한 것으로서, 보다 상세하게는 텍스트 및 이미지의 유사도를 이용하여 검토가 완료된 기술 문서 중에서 심사에 필요한 유사 기술 문서를 검색하는 시스템 및 그 방법에 관한 것이다.TECHNICAL FIELD [0001] The present invention relates to a system and a method for searching a similar technical document related to nuclear power, and more particularly, to a system and a method for searching similar technical documents necessary for examination among technical documents that have been reviewed using similarity of text and images will be.

전략물자(Strategic items)란 재래식 무기 및 대량 파괴 무기와 이들의 운반수단인 미사일의 제조, 개발, 사용 및 보관 등에 이용 가능한 물품, 소프트웨어 및 기술을 말한다. 이러한 전략물자는 국제평화와 국가안보에 위해를 가할 수 있기 때문에 수출입에 일정한 제한을 두고 있다.Strategic items are goods, software and technologies that are available for the manufacture, development, use and storage of conventional and weapons of mass destruction and missiles as their means of delivery. These strategic items have certain restrictions on imports and exports, because they can harm international peace and national security.

전략물자를 수입 또는 수출하고자 하는 개인 또는 업체는 수입 또는 수출 전에 반드시 관련 허가기관으로부터 수출 허가를 받아야 하며, 전략물자가 아니더라도 전략물자로 전용될 위험이 있을 경우에도 관련기관의 허가를 받아야 수출이 가능하다.Individuals or companies intending to import or export strategic goods must obtain export permission from the relevant licensing authority before importing or exporting. Even if they are not strategic goods, they may be exported if they have the risk of being transferred as strategic goods. Do.

우리나라의 경우에는 대외무역법 및 다자간 국제 수출 통제 체제의 원칙에 따라 산업통상자원부장관이 전략물자 수출입 고시 별표 2 및 3에서 전략물자를 명시하여 각 전략물자 별로 허가기관을 정하여 수출입통제를 이행하고 있다.In the case of Korea, the Minister of Commerce, Industry and Energy has specified the strategic materials in Table 2 and 3 of the Strategic Goods Import and Export Notification according to the principles of the Foreign Trade Law and the multilateral international export control system.

상기 전략물자 수출입 고시에는 전략물자에 해당되는 품목명 및 그 품목명을 가진 품목이 전략물자에 해당되기 위해 가져야 할 성능, 사양 등의 기술적인 특성기준이 명시되어 있다.The strategic commodity import and export notice specifies technical characteristics such as the name of the item corresponding to the strategic item and the performance and specifications that the item having the item name should have in order to correspond to the strategic item.

그러나, 상기 전략물자 수출입 고시에 명시되어 있는 전략물자에 관한 내용의 분량은 매우 방대하며, 특히 기술 수출건의 경우 통제 기준이 추상적이고 포괄적이기 때문에 특정한 개인 또는 기업이 자신이 취급하는 물품 또는 기술이 상기 전략물자 수출입고시에 명시된 전략물자에 해당하는지 여부를 파악하는 것이 쉽지 않다.However, the contents of the strategic goods specified in the strategic goods import and export notification are very extensive, and in particular, in the case of technology export cases, since the control standard is abstract and comprehensive, It is not easy to know whether it is a strategic item specified in the strategic goods import and export notification.

이에 특정한 개인 또는 기업이 자신이 취급하는 물품 또는 기술이 전략물자에 해당하는지 여부에 의문이 있는 경우 해당 허가기관에 전략물자인지 여부를 의뢰하면, 상기 해당 허가 기관에서는 상기 신청한 신청물자에 대한 정보를 해당분야 전문가로 구성된 전략물자 기술자문단에게 검토 의뢰하고, 상기 기술자문단은 상기 신청물자가 전략물자인지 여부를 판정하게 되는데, 상기 기술자문단의 주관적인 판단이 개입될 수밖에 없고, 상기 기술자문단을 구성하는 전문가가 변경되는 경우 일관되고 정확한 판정 결과를 기대하기 어려운 문제가 있다.If there is any doubt as to whether or not a specific individual or corporation deals with a commodity or technology that the commodity or technology handled by the commodity is a strategic commodity, And the technical advisory group judges whether or not the applied material is a strategic commodity. The subjective judgment of the technical advisory team is inevitably involved, and the technical advisory group composed of the technical advisory group There is a problem that it is difficult to expect a consistent and accurate judgment result when the expert is changed.

특히 원자력 플랜트 수출과 같이 대량의 기술 문서가 이전되는 사업을 수행할 경우, 제한된 시간 내에 모든 기술 문서에 대한 전수 심사가 불가피하며, 심사 결과의 일관성 혹은 객관성을 유지하기 위하여 과거 UAE 원전 수출 등과 관련되어 수행된 심사 결과에 대한 검색, 비교, 확인 과정은 필수적이다. In particular, when carrying out a project in which a large number of technical documents are transferred, such as nuclear power plant exports, all technical documents must be reviewed within a limited time, and in order to maintain the consistency or objectivity of the examination results, A search, comparison, and verification process is essential for the audit results.

그러나 기존 데이터베이스 시스템에서 과거 심사 결과에 대한 검색을 수행할 경우 문서의 제목 혹은 신청자가 명기한 품명에 따라 키워드 검색을 수행할 수밖에 없으며, 이는 과거 심사 자료 (예를 들면 약 5,000여 건)에서 심사자가 비교 혹은 참고하고자 하는 문서를 검색하는데 어려움을 야기할 수 있다. However, when searching the past examination result in the existing database system, it is inevitable to perform a keyword search according to the title of the document or the name of the article specified by the applicant. This is because in the past examination data (for example, about 5,000) It can cause difficulties in retrieving documents to be compared or referenced.

또한, 심사 대상 기술 문서 중에는 키워드를 전혀 포함하지 않은 순수 이미지로만 구성된 문서 (도면 등)도 다수 존재하므로 키워드에 의한 검색만을 수행할 경우 심사자가 참고하고자 하는 자료를 검색하는데 어려움이 야기될 수 있다. Also, since there are many documents (drawings, etc.) composed only of pure images that do not include keywords at all in the examination subject technical documents, it may be difficult for the judge to search for the data to be referenced when performing only the search by the keyword.

이에 본 발명에서는 신규로 심사 신청된 문서에 포함된 텍스트와 이미지에 대한 정보를 추출하고, 과거 심사가 완료된 기술 문서에서 추출된 텍스트와 이미지에 대한 정보를 비교하여 신규 심사 문서와 가장 유사한 기술 문서를 검색하는 시스템 및 그 방법을 제공하고자 한다.Accordingly, in the present invention, information about text and images included in a document newly requested for examination is extracted, and a text similar to the new examination document is compared with information about the text extracted from the past technical document, And to provide a method and a system for searching the same.

대상 문서에 대한 정보로부터 키워드를 추출하는 단계, 상기 키워드를 이용하여 상기 대상 문서를 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 텍스트 기반 유사도를 판정하는 단계, 상기 대상 문서에 포함된 이미지를 추출하는 단계, 상기 이미지를 이용하여 상기 대상 문서를 상기 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 이미지 기반 유사도를 판정하는 단계, 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 단계를 포함하는 유사 문서 검색 방법을 제공한다. The method comprising the steps of: extracting a keyword from information about an object document, designating the object document as at least one classification system in a database using the keyword, determining a textual basis between the document stored in the at least one classification system and the object document Determining a degree of similarity, extracting an image included in the object document, designating the object document as at least one classification system of the database using the image, Based similarity between the target document and the target document, and retrieving a similar document using the text-based similarity and the image-based similarity.

상기 대상 문서에 대한 정보로부터 키워드를 추출하는 단계에서는, 상기 대상 문서에 대한 정보를 입력하는 단계, 상기 정보를 텍스트로 변환하는 단계 및 The step of extracting keywords from the information on the target document may include the steps of inputting information on the target document, converting the information into text,

상기 텍스트로부터 상기 키워드를 추출하는 단계를 더 포함할 수 있다. And extracting the keyword from the text.

상기 대상 문서에 대한 정보는 텍스트 및 이미지 중 적어도 하나를 포함 할 수 있다. The information about the target document may include at least one of text and images.

상기 키워드는 원자력과 관련된 단어를 나타낼 수 있다. The keyword may represent a word associated with nuclear power.

상기 키워드를 이용하여 상기 대상 문서를 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계는, 상기 키워드의 빈도수와 상기 데이터 베이스의 적어도 하나의 분류 계통에 대한 가중치를 이용하여 상기 데이터 베이스의 적어도 하나의 분류 계통에 해당되는 적어도 하나의 소속도를 판정하는 단계 및 상기 적어도 하나의 소속도를 이용하여 상기 대상 문서의 적어도 하나의 분류 계통을 지정하는 단계를 더 포함할 수 있다. Wherein the step of designating the target document as at least one classification system of a database using the keyword comprises the steps of determining at least one classification system of the database by using a frequency of the keyword and a weight for at least one classification system of the database Determining at least one membership degree corresponding to the classification system, and designating at least one classification system of the target document using the at least one membership degree.

상기 데이터 베이스의 적어도 하나의 분류 계통은 원자로냉각재, 원자로 용기, 안전 주입 및 정지 냉각, 증기 발생 기계, 중성자속감시, 제어봉 집합체, 핵연료 집합체, 화학 및 체적 제어, 제어, 기기 냉각, 터빈 및 발전기, 급수, 증기, 복수, 가스, 비상 디젤발전기, 원자로격납건물, 폐기물, 감시, 발전소보호, 배수, 전력, 연료 취급 및 이송, 기타 계통 등을 포함할 수 있다. Wherein at least one of the classification systems of the database includes at least one of a reactor coolant, a reactor vessel, a safe injection and quench cooling system, a steam generating machine, a neutron flux monitoring system, a control rod assembly, a fuel assembly, chemical and volume control, It can include water, steam, revenues, gas, emergency diesel generators, reactor containment buildings, waste, surveillance, plant protection, drainage, power, fuel handling and transportation, and other systems.

상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 텍스트 기반 유사도를 판정하는 단계는, 상기 적어도 하나의 분류 계통으로 저장된 문서의 키워드 및 키워드 빈도수와 상기 대상 문서의 키워드 및 키워드 빈도수를 이용하여 코사인 유사도를 계산하는 단계를 더 포함할 수 있다. Wherein the step of determining the text based similarity between the document stored in the at least one classification system and the target document comprises determining a text based degree of similarity between a document stored in the at least one classification system and a keyword frequency And calculating the degree of similarity.

상기 이미지를 이용하여 상기 대상 문서를 상기 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계는, 상기 이미지의 지역적 특징점 추출하는 단계, 상기 지역적 특징점과 사전에 심사된 이미지들의 특징점을 이용하여 상기 이미지의 매칭점의 빈도수를 계산하는 단계, 상기 데이터 베이스에 저장된 이미지의 지역적 특징점과 상기 사전에 심사된 이미지들의 특징점을 이용하여 상기 데이터 베이스에 저장된 이미지의 매칭점의 빈도수를 계산하는 단계, 상기 매칭점의 빈도수를 이용하여 상기 이미지와 상기 데이터 베이스에 저장된 이미지간의 유사도를 판정하는 단계, 상기 유사도를 이용하여 상기 데이터 베이스에 저장된 이미지로부터 유사 이미지를 판정하는 단계 및 상기 유사 이미지에 해당되는 상기 데이터 베이스의 적어도 하나의 분류 계통을 획득하는 단계를 더 포함할 수 있다. Wherein the step of designating the target document as at least one classification system of the database using the image comprises the steps of: extracting a local feature point of the image; extracting a local feature point of the image using the feature point of the pre- Calculating a frequency of a matching point of an image stored in the database using local feature points of the image stored in the database and feature points of the images judged in the dictionary, Determining a similarity between the image and an image stored in the database using a frequency, determining a similar image from the image stored in the database using the similarity, Single Motorcycles may further comprise the step of acquiring the system.

상기 유사도를 이용하여 상기 데이터 베이스에 저장된 이미지로부터 유사 이미지를 판정하는 단계는, 별도로 저장된 심사와 관련성이 낮은 이미지와의 유사도가 일정 수치 이상인 이미지를 필터링하는 단계를 더 포함할 수 있다. The step of determining a similar image from the image stored in the database using the similarity may further include the step of filtering an image having a degree of similarity to an image not related to the stored examination separately.

상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 이미지 기반 유사도를 판정하는 단계는, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 유사 이미지의 수와 대상 문서의 총 이미지의 수를 이용하여 계산하는 단계를 더 포함할 수 있다.Wherein determining the image-based similarity between the document stored in the at least one classification system and the target document comprises: determining a number of similar images between the document stored in the at least one classification system and the target document, And a step of calculating by using.

적어도 하나의 분류 계통과 상기 적어도 하나의 분류 계통으로 저장된 문서를 포함하는 데이터 베이스, 대상 문서에 대한 정보로부터 키워드를 추출하는 키워드 추출부, 상기 대상 문서에 대한 정보로부터 이미지를 추출하는 이미지 추출부, 상기 키워드와 상기 이미지를 이용하여 상기 대상 문서를 상기 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 계통 지정부, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 텍스트 기반 유사도와 이미지 기반 유사도를 판정하는 유사도 판정부, 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 검색부를 포함하는 유사 문서 검색 시스템을 제공한다.A database including at least one classification system and a document stored in the at least one classification system, a keyword extraction unit for extracting keywords from information on the target document, an image extraction unit for extracting an image from the information about the target document, A system management unit configured to designate the target document as at least one classification system of the database using the keyword and the image, a text-based similarity and an image-based similarity between the document stored in the at least one classification system and the target document, Based similarity degree and a similarity document using the image-based similarity degree, based on the text-based similarity and the image-based similarity.

상기 데이터 베이스는 과거 심사 문서에 포함된 이미지들에서 추출 및 군집화하여 구성한 매칭점 정보를 포함할 수 있다. The database may include matching point information configured by extracting and clustering images included in past examination documents.

상기 계통 지정부는 상기 키워드와 상기 이미지를 이용하여 지정된 상기 대상 문서의 적어도 하나의 분류 계통을 상기 데이터 베이스에 저장할 수 있다.The system designation unit may store at least one classification system of the target document specified using the keyword and the image in the database.

상기 유사도 판정부는 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 상기 데이터 베이스에 저장할 수 있다. The similarity determination unit may store the text-based similarity and the image-based similarity in the database.

상기 문서 검색부는 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 검색된 유사 문서를 상기 데이터 베이스에 저장할 수 있다. The document retrieval unit may store the similar document retrieved using the text based similarity and the image based similarity in the database.

본 발명에 따른 유사 문서 검색 시스템 및 그 방법을 통하여 심사자는 비교 혹은 참고하려고 하는 문서들을 빠르고 일관되게 검색, 비교, 확인할 수 있다.Through the similar document retrieval system and method according to the present invention, the judge can quickly and consistently search, compare, and confirm documents to be compared or referenced.

도 1는 본 발명의 일례에 따른 유사 문서 검색 방법을 나타낸 순서도이다.
도 2는 본 발명의 일례에 따른 텍스트 기반 유사도 판정 방법을 나타낸 순서도이다.
도 3는 본 발명의 일례에 따른 이미지 기반 유사도 판정 방법을 나타낸 순서도이다.
도 4는 본 발명의 일례에 따른 유사 문서 검색 시스템을 나타낸 블록도이다.
도 5는 본 발명의 일례에 따른 추출된 이미지간에 코사인 유사도를 계산하는 예시도이다.
도 6은 본 발명의 일례에 따른 추출된 이미지의 계통 정보가 저장된 데이터 베이스의 예시도이다. 1 is a flowchart illustrating a similar document retrieving method according to an exemplary embodiment of the present invention.
2 is a flowchart illustrating a text-based similarity determination method according to an exemplary embodiment of the present invention.
3 is a flowchart illustrating an image-based similarity determination method according to an exemplary embodiment of the present invention.
4 is a block diagram illustrating a similar document retrieval system according to an example of the present invention.
5 is an exemplary diagram for calculating cosine similarity between extracted images according to an example of the present invention.
FIG. 6 is an exemplary view of a database storing grid information of an extracted image according to an exemplary embodiment of the present invention. FIG.

이하, 본 발명의 일례에 따른 유사 문서 검색 시스템 및 그 방법에 대하여 첨부된 도면들을 참조하여 상세하게 설명하도록 한다.DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS Hereinafter, a similar document retrieving system and method according to an exemplary embodiment of the present invention will be described in detail with reference to the accompanying drawings.

다르게 정의되지 않는 한, 기술적이거나 과학적인 용어를 포함해서 여기서 사용되는 모든 용어들은 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자에 의해 일반적으로 이해되는 것과 동일한 의미를 가지고 있다. 일반적으로 사용되는 사전에 정의되어 있는 것과 같은 용어들은 관련 기술의 문맥 상 가지는 의미와 일치하는 의미를 가지는 것으로 해석되어야 하며, 본 출원에서 명백하게 정의하지 않는 한, 이상적이거나 과도하게 형식적인 의미로 해석되지는 않는다.Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application .

제 1, 제 2 및 제 3 등의 용어는 다양한 구성 요소들을 설명하는데 사용될 수 있지만, 이러한 구성 요소들은 상기 용어들에 의해 한정되는 것은 아니다. 상기 용어들은 하나의 구성 요소를 다른 구성 요소로부터 구별하는 목적으로 사용된다. 예를 들어, 본 발명의 권리 범위로부터 벗어나지 않고, 제 1 구성 요소가 제 2 또는 제 3 구성 요소 등으로 명명될 수 있으며, 유사하게 제 2 또는 제 3 구성 요소도 교호적으로 명명될 수 있다.The terms first, second and third, etc. may be used to describe various components, but such components are not limited by the terms. The terms are used for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second or third component, and similarly, the second or third component may be alternately named.

본 명세서에서,"대상 문서"란 신규 심사를 위해 입력된 문서를 말하며, 대상 문서는 전략물자 혹은 원자력 관련 정보를 텍스트 혹은 이미지 형태로 포함할 수 있다. In this specification, "target document" refers to a document input for a new examination, and the target document may include strategic material or nuclear-related information in text or image form.

본 명세서에서, "계통"이란 일정한 원리나 법칙에 따라 분류해 놓은 것을 말한다. In this specification, "systematic" means a systematic classification based on a certain principle or rule.

또한, 본 명세서에서 "유사 문서"는 검토가 완료된 문서 혹은 데이터 베이스에 저장된 문서 가운데 대상 문서에 포함된 키워드나 이미지 혹은 포함된 이미지와 유사한 이미지를 포함하고 있는 문서를 말한다.As used herein, the term "similar document" refers to a document containing a keyword or an image similar to a target document, or an image similar to an embedded image, among the documents stored in the reviewed document or database.

또한, 본 명세서에서 "심사자"란 해당 수출 허가 기관에서 상기 대상 문서가 전략물자 혹은 원자력 관련 문서인지를 실질적으로 판정하는 전문가 개인 또는 그룹을 말한다.Also, in this specification, the term "judge" means an individual or group of experts who substantially determine whether the target document is a strategic material or nuclear related document at the export licensing authority.

도 1은 본 발명의 일례에 따른 유사 문서 검색 방법을 나타낸 순서도이다.1 is a flowchart illustrating a similar document retrieval method according to an exemplary embodiment of the present invention.

도 1을 참조하면, 본 발명의 일례에 따른 유사 문서 검색 방법은 신규로 입력된 대상 문서에 대한 정보로부터 키워드를 추출하는 단계(S100), 추출된 키워드를 이용하여 대상 문서를 데이터 베이스에 있는 적어도 하나의 분류 계통으로 지정하는 단계(S110), 지정된 적어도 하나의 분류 계통으로 데이터 베이스 내에 저장된 문서와 대상 문서간의 텍스트 기반 유사도를 판정하는 단계(S120), 대상 문서에 포함된 이미지를 추출하는 단계(S130), 추출된 이미지를 이용하여 대상 문서를 상기 데이터 베이스에 있는 적어도 하나의 분류 계통으로 지정하는 단계(S140), 지정된 적어도 하나의 분류 계통으로 데이터 베이스 내에 저장된 문서와 대상 문서간의 이미지 기반 유사도를 판정하는 단계(S150), 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 단계(S160)를 포함할 수 있다. Referring to FIG. 1, a similar document search method according to an exemplary embodiment of the present invention includes extracting a keyword from information on a newly input target document (S100), extracting a target document from at least (S120) determining a text-based similarity between a document stored in the database and a target document stored in the specified at least one sorting system (S120), extracting an image included in the target document S130), designating the target document as at least one sorting system in the database (S140) using the extracted image (S140), determining the image-based similarity between the document stored in the database and the target document in the specified at least one sorting system (S150). The similarity is determined using similarity between the text-based similarity and the image- And retrieving the document (S160).

도 2는 본 발명의 일례에 따른 텍스트 기반 유사도 판정 방법을 나타낸 순서도이다. 2 is a flowchart illustrating a text-based similarity determination method according to an exemplary embodiment of the present invention.

도 2를 참조하면, 본 발명의 일례에 따른 대상 문서에 대한 정보로부터 키워드를 추출하는 단계(S100)는, 상기 대상 문서에 대한 정보를 입력하는 단계(S200), 입력된 정보를 텍스트로 변환하는 단계(S210) 및 변환된 텍스트로부터 키워드를 추출하는 단계(S220)를 포함할 수 있다. Referring to FIG. 2, a step S100 of extracting a keyword from information on an object document according to an example of the present invention includes inputting information on the object document (S200), converting the input information into text Step S210 and extracting keywords from the converted text (S220).

상기 대상 문서에 대한 정보를 입력하는 단계(S200)에서는 대상 문서에 대한 정보가 스캐너나 문서 리더기를 통해 서버 혹은 개인 컴퓨터로 입력될 수 있다. 또한, 입력된 정보는 서버 혹은 개인 컴퓨터의 데이터 베이스에 저장될 수 있다. 상기 대상 문서는 HWP, MS OFFICE, PDF등과 같은 전자 문서 혹은 종이 문서 형태일 수 있다. 일반적으로 대상 문서는 신규 심사를 위해 입력되거나 신청된 문서이다. In step S200 of inputting information on the target document, information on the target document may be input to a server or a personal computer through a scanner or a document reader. In addition, the input information can be stored in a database of a server or a personal computer. The target document may be in the form of an electronic document such as HWP, MS OFFICE, PDF or the like or a paper document. In general, the target document is a document input or requested for a new examination.

상기 입력된 정보를 텍스트로 변환하는 단계(S210)에서 입력된 정보가 OCR(Optical Character Reader)에 의해 처리되어 인식 가능한 텍스트로 변환될 수 있다. 이 때, 변환된 텍스트들은 데이터 베이스에 txt 파일의 형태로 저장될 수 있다. The information input in the step of converting the input information into text (S210) may be processed by OCR (Optical Character Reader) and converted into recognizable text. At this time, the converted texts can be stored in the form of a txt file in the database.

입력된 대상 문서가 HWP, MS OFFICE, PDF등과 같은 전자 문서인 경우, OCR 처리 과정 없이 직접 대상 문서로부터 텍스트가 추출될 수 있다. 추출된 텍스트들은 txt 파일 형태로 데이터 베이스에 저장될 수 있다. 입력된 대상 문서가 이미지를 포함하고, 또한 그 이미지에 텍스트가 포함되어 있다면, 이미지 상에 포함된 텍스트는 OCR 처리를 통하여 추출될 수 있다. 예를 들면, 만약 PDF 파일 형태의 대상 문서에 JPEG 파일 형태의 설계도면과 그 설계도면에 대한 설명이 텍스트로 되어 있다면, 그 설계 도면에 포함된 텍스트들은 설계도면을 OCR 처리하여 추출될 수 있다. 일반적으로 대상 문서의 양이 방대하고 대상 문서가 이미지를 포함하는지 확인하기 어려운 경우, 심사관은 입력된 모든 대상 문서에 대해 일괄적으로 OCR 처리하여 텍스트를 추출할 수 있다. If the input document is an electronic document such as HWP, MS OFFICE, PDF, etc., text can be extracted directly from the target document without OCR processing. The extracted texts can be stored in a database in the form of a txt file. If the input document contains an image and the image also contains text, the text contained on the image can be extracted through OCR processing. For example, if the design document in the form of a JPEG file and the description of the design drawing are text in the target document in the form of a PDF file, the text included in the design drawing can be extracted by OCR processing of the design drawing. In general, when it is difficult to confirm whether the target document is large and the target document includes an image, the examiner can extract the text by performing OCR processing on all inputted target documents collectively.

상기 변환된 텍스트로부터 키워드를 추출하는 단계(S220)에서 원자력과 관련된 키워드가 검색 및 추출될 수 있다. 원자력과 관련된 키워드의 예로는 원자로, 핵연료, 감속재, 냉각재, 붕소, 카드뮴, 제어봉 등이 있다. 원자력 관련 키워드는: (1) 변환된 텍스트로부터 어근과 어미를 제외한 모든 단어를 추출; 그리고 (2) 추출된 단어 중에서 원자력과 관련이 없는 단어를 제외하여 추출될 수 있다. 또한, 추출된 키워드는 그 도출 빈도와 함께 데이터 베이스에 저장될 수 있다. In step S220 of extracting keywords from the converted text, a keyword related to nuclear power can be searched and extracted. Examples of keywords related to nuclear power include nuclear reactors, nuclear fuel, moderator, coolant, boron, cadmium, and control rods. Nuclear-related keywords are: (1) extract all words except the root and the ending from the translated text; And (2) can be extracted by excluding words that are not related to atomic power among the extracted words. In addition, the extracted keyword can be stored in the database together with the extraction frequency.

본 발명의 일례에 따른 대상 문서에서 추출된 키워드를 이용하여 대상 문서를 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계(S110)는 다음의 두 단계를 더 포함할 수 있다: (1) 추출된 키워드의 빈도수와 분류 계통에 대한 가중치를 이용하여 데이터 베이스의 분류 계통에 해당되는 소속도를 판정(S230); 그리고 (2) 판정된 소속도를 이용하여 대상 문서의 분류 계통을 지정 (S240).The step (S110) of designating the target document as at least one sorting system of the database using the keyword extracted from the target document according to an exemplary embodiment of the present invention may further include the following two steps: (1) Determining a degree of belonging to the classification system of the database using the frequency of the keyword and the weight for the classification system (S230); And (2) a classification system of the target document is designated using the determined degree of affiliation (S240).

본 발명의 일례에 따른 대상 문서의 분류 계통별 소속도는 아래의 식 1로 계산될 수 있다. The affiliation rate of the object document according to the classification system according to the example of the present invention can be calculated by the following Equation 1.

[식 1][Formula 1]

구체적으로, 데이터 베이스는 각 계통에 포함된 키워드 별로, 예를 들면 1, 2, 3 등으로, 3단계의 가중치 정보를 저장할 수 있다. 가중치의 수치가 클수록 가중치가 높음을 의미한다. 또한 심사관은 각 분류 계통에 포함된 키워드 별 가중치를 직접 지정할 수 있다. Specifically, the database can store weight information of three levels, for example, 1, 2, and 3, for each keyword included in each system. The larger the value of the weight, the higher the weight. The examiner can also directly assign the weight for each keyword included in each classification system.

분류 계통별 소속도를 계산하기 위해, 첫째로 대상 문서에서 추출된 키워드와 데이터 베이스에 저장된 키워드와 매칭하여 추출된 키워드의 가중치를 찾는다. 둘째로, 각 추출된 키워드 별로 추출된 키워드의 빈도수와 추출된 키워드의 가중치의 곱하고, 그 총 합계를 구한다. 즉, 추출된 모든 키워드에 대해 키워드 빈도수와 키워드 가중치의 곱을 합한다. 이 때, 추출된 키워드의 빈도수 이외에도 TF-IDF (Term Frequency-Inverse Document Frequency)와 같은 가중치 값이 사용될 수 있다. 마지막으로, 상기 단계에서 구한 총 합계를 해당 계통의 가중치의 총 합계로 나눈다. In order to calculate the membership level of each classification system, first, the weight of the extracted keyword is searched by matching with the keyword extracted from the target document and the keyword stored in the database. Secondly, the frequency of extracted keywords is multiplied by the weight of the extracted keywords for each extracted keyword, and the total sum thereof is obtained. That is, the product of the keyword frequency and the keyword weight is summed for all extracted keywords. In this case, a weight value such as TF-IDF (Term Frequency-Inverse Document Frequency) may be used in addition to the frequency of extracted keywords. Finally, the total sum obtained in the above step is divided by the total sum of the weights of the respective systems.

예를 들면, 아래 표 1과 같이 대상 문서 A에서 추출된 키워드는 원자로, 냉각재, 용기, 발전기, 그리고 펌프의 빈도수가 구성되고, 아래 표 2과 같이 데이터 베이스에 계통별 키워드의 가중치가 구성되는 경우, 1번 계통에 대한 소속도는 아래 식 2로 계산될 수 있다. For example, if the frequency of the reactor, coolant, vessel, generator, and pump are configured as shown in Table 1 below and the keywords extracted from the target document A are configured as shown in Table 2 below, , And the degree of belonging to the first line can be calculated by the following equation (2).

원자로nuclear pile 2개2 냉각재Coolant 2개2 용기Vessel 0개0 발전기generator 1개One 펌프Pump 1개One

키워드keyword 1번 계통Line 1 2번 계통Line 2 3번 계통Line 3 4번 계통Line 4 5번 계통Line 5 원자로nuclear pile 33 22 00 00 00 냉각재Coolant 33 1One 00 1One 00 증기steam 1One 1One 00 33 00 용기Vessel 00 33 1One 00 00 수소Hydrogen 00 00 33 00 22 발전기generator 00 00 00 00 33 펌프Pump 33 00 00 00 00

[식 2][Formula 2]

분류 계통별 소속도가 높다는 것은 대상 문서가 분류 계통에 대한 원전 관련 키워드를 많이 포함하고 있다는 것을 의미한다. 계통별 소속도의 수치 범위는 0에서 1사이로 정해질 수 있다. 그러나 그 수치 범위는 필요에 따라 정규화를 통해 조정될 수 있다. 판정된 계통별 소속도를 이용하여 대상 문서의 분류 계통을 지정하기 위해서, 검색 시스템은 계통별 소속도가 높은 순으로 심사자에게 문서의 분류 계통을 디스플레이할 수 있다. 심사자는 디스플레이된 소속도 리스트를 참조하여 심사자의 지식과 경험에 따라 대상 문서의 분류 계통을 선택할 수 있다. 본 발명의 다른 예에서 시스템은 소속도 리스트에서 상위 3개의 분류 계통을 자동으로 심사자에게 디스플레이할 수 있다. A high degree of classification by classification system means that the target document contains a lot of nuclear-related keywords for the classification system. The numerical range of systematic affiliation can be set between 0 and 1. However, the numerical range can be adjusted through normalization as needed. In order to designate the classification system of the target document by using the determined degree of affiliation by the system, the search system can display the classification system of the document to the judge in descending order of the affiliation of each system. The reviewer can select the classification system of the target document according to the knowledge and experience of the reviewer by referring to the displayed list of membership degrees. In another example of the present invention, the system can automatically display the top three classification systems from the belonging list to the judges.

데이터 베이스에 저장된 적어도 하나의 분류 계통은 원자로냉각재, 원자로 용기, 안전 주입 및 정지 냉각, 증기 발생 기계, 중성자속감시, 제어봉 집합체, 핵연료 집합체, 화학 및 체적 제어, 제어, 기기 냉각, 터빈 및 발전기, 급수, 증기, 복수, 가스, 비상 디젤발전기, 원자로 격납 건물, 폐기물, 감시, 발전소보호, 배수, 전력, 연료 취급 및 이송, 기타 계통 등을 포함할 수 있다. 데이터 베이스는 상기에 나열된 분류 계통 이외에도 시스템에 따라 다른 분류 계통을 포함할 수 있다. At least one of the classification systems stored in the database includes at least one of the following systems: reactor coolant, reactor vessel, safety injection and quench cooling, steam generator, neutron flux monitoring, control rod assembly, fuel assembly, chemical and volume control, It can include water, steam, revenues, gas, emergency diesel generators, reactor containment buildings, waste, surveillance, plant protection, drainage, power, fuel handling and transportation, and other systems. In addition to the classification systems listed above, the database may also include different classification systems depending on the system.

검색 시스템은 대상 문서의 분류 계통을 지정함으로써 검색해야 할 문서의 수와 시간을 상당히 줄일 수 있다. 예를 들면, 분류 계통별 소속도의 계산 결과로 대상 문서의 분류 계통이 계통 1, 계통 2, 계통 3으로 지정되고, 데이터 베이스에 총 1,600개의 문서 중 계통 1에 대한 문서가 100개, 계통 2에 대한 문서가 200개, 계통 3에 대한 문서 300개가 저장되어 있는 경우, 검색 시스템은 데이터 베이스의 모든 문서를 검색할 필요 없이 각 계통 1, 2, 3에 대한 문서만을 검색할 수 있다. The search system can significantly reduce the number and time of documents to be searched by specifying the classification system of the target document. For example, the classification system of the target document is designated as the system 1, the system 2, and the system 3 as the calculation result of the classification degree by the classification system. In the database, there are 100 documents for the system 1 among the total 1,600 documents, The search system can only retrieve documents for each of the systems 1, 2, and 3 without having to search all the documents in the database.

대상 문서의 적어도 하나의 분류 계통이 지정되면, S250 단계에서는 적어도 하나의 분류 계통으로 저장된 문서와 대상 문서간의 텍스트 기반 유사도를 판정한다. 그 텍스트 기반 유사도로 코사인 유사도가 이용될 수 있다. 코사인 유사도는 적어도 하나의 분류 계통으로 저장된 문서의 키워드 및 키워드 빈도수와 대상 문서의 키워드 및 키워드 빈도수를 이용하여 계산될 수 있다. If at least one classification line of the target document is designated, then in step S250, the text-based similarity between the document stored in at least one classification system and the target document is determined. The cosine similarity can be used as the text-based similarity. The degree of cosine similarity may be calculated using the keyword and keyword frequency of the document stored in at least one classification system and the keyword and keyword frequency of the target document.

코사인 유사도는 내적공간의 두 벡터간 각도의 코사인 값을 이용하여 측정된 벡터간의 유사한 정도를 의미한다. 코사인 유사도는 어떤 개수의 차원에도 적용이 가능하며 흔히 다차원의 양수 공간에서의 유사도 측정에 사용될 수 있다. 예를 들면, 정보 검색에 있어서 단어 하나 하나는 각각의 차원을 구성하고 문서는 각 단어가 나타나는 회수로 표현되는 벡터값을 가진다. 즉, 대상 문서의 키워드 빈도수와 데이터 베이스에 대상 문서의 분류 계통으로 저장된 문서의 키워드 빈도수는 벡터로 표현이 가능하다. 두 문서간에 코사인 유사도는 두 벡터의 내적 / 각 벡터의 길이의 곱으로 계산이 가능하다. The cosine similarity means the similarity between the vectors measured using the cosine of the angles between the two vectors of the inner space. Cosine similarity can be applied to any number of dimensions and can often be used to measure similarity in multidimensional amniotic space. For example, in an information retrieval, each word constitutes a dimension, and a document has a vector value represented by the number of times each word appears. That is, the keyword frequency of the target document and the keyword frequency of the document stored in the classification system of the target document in the database can be expressed as a vector. The cosine similarity between two documents can be calculated as the product of the inner product of each vector and the length of each vector.

예를 들면, 대상 문서의 키워드 빈도수가 표 3 과 같고, 데이터 베이스에 대상 문서의 분류 계통으로 저장된 문서의 키워드 빈도수가 표 4과 같은 경우, 두 문서간의 유사도는 아래 식3과 같이 계산될 수 있다.For example, if the frequency of keywords in the target document is equal to Table 3 and the frequency of keywords in the document stored in the classification system of the target document in the database is as shown in Table 4, the similarity between the two documents can be calculated as shown in Equation 3 below .

원자로nuclear pile 2개2 냉각재Coolant 2개2 증기발생기Steam generator 0개0 수소Hydrogen 1개One 펌프Pump 1개One

원자로nuclear pile 3개Three 냉각재Coolant 1개One 증기발생기Steam generator 1개One 수소Hydrogen 3개Three 펌프Pump 0개0

[식 3] [Formula 3]

유사도의 범위는 0에서 1사이로 정해질 수 있지만, 시스템에 따라 다른 범위도 가능하다. 유사도를 계산함에 있어서 문서의 키워드 빈도수 이외에도 문서의 TF-IDF 가중치가 이용될 수 있다. 또한 검색 시스템은 유사도 계산 결과를 심사자에게 디스플레이 함에 있어서, 먼저 대상 문서와 데이터 베이스에 대상 문서의 분류 계통으로 저장된 문서간의 유사도를 개별적으로 계산하고, 그 결과로 유사도가 높은 문서부터 차례로 정렬하여 보여줄 수 있다. 심사자는 정렬된 유사도 리스트를 참고로 기존의 지식과 경험을 바탕으로 유사 문서를 선택할 수 있다. 본 발명의 다른 일례에서, 시스템은 심사자의 개입 없이 유사도가 사전에 정해진 수치 이상인 경우 자동으로 유사 문서로 선택할 수도 있다.The range of similarity can be set between 0 and 1, but different ranges are possible depending on the system. In calculating the similarity, the TF-IDF weight of the document may be used in addition to the keyword frequency of the document. In displaying the similarity calculation result to the reviewer, the search system firstly calculates the similarity between the documents stored in the target document and the database in the classification system of the target document, and then displays the sorted documents in order from the document having high similarity as a result have. The reviewer can select a similar document based on existing knowledge and experience by referring to the sorted similarity list. In another example of the invention, the system may automatically select a similar document if the similarity is greater than or equal to a predetermined value without the intervention of the auditor.

도 3는 본 발명의 일례에 따른 이미지 기반 유사도 판정 방법을 나타낸 순서도이다. 3 is a flowchart illustrating an image-based similarity determination method according to an exemplary embodiment of the present invention.

본 발명의 일례에 따른 상기 대상 문서에 포함된 이미지를 추출하는 단계(S130)는 대상 문서에서 이미지 객체들을 선택하여 별도의 파일로 저장하는 프로그램이나 문서의 각 페이지를 JPEG 파일로 출력하는 프로그램을 이용하여 추출할 수 있다. 추출된 이미지의 종류에는 배치도, 사진, 도면, 표, 그래프 등이 포함될 수 있다. The step of extracting an image included in the target document according to an exemplary embodiment of the present invention (S130) may include a program for selecting image objects in a target document and storing the selected image objects in a separate file or a program for outputting each page of the document as a JPEG file . The type of the extracted image may include a layout, a photograph, a drawing, a table, a graph, and the like.

도 3을 참조하면, 본 발명의 일례에 따른 대상 문서에서 추출된 이미지를 이용하여 대상 문서를 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 단계(S140)는, 상기 이미지의 지역적 특징점 추출하는 단계(S310), 상기 지역적 특징점과 사전에 심사된 이미지들의 특징점을 이용하여 상기 이미지의 매칭점의 빈도수를 계산하는 단계(S320), 상기 데이터 베이스에 저장된 이미지의 지역적 특징점과 상기 사전에 심사된 이미지들의 특징점을 이용하여 상기 데이터 베이스에 저장된 이미지의 매칭점의 빈도수를 계산하는 단계(S325), 매칭점의 빈도수를 이용하여 추출된 이미지와 데이터 베이스에 저장된 이미지간의 유사도를 판정하는 단계(S330), 판정된 유사도를 이용하여 데이터 베이스에 저장된 이미지로부터 유사 이미지를 판정하는 단계(S340), 그리고 상기 유사 이미지에 해당되는 데이터 베이스의 적어도 하나의 분류 계통을 지정하는 단계(S350)를 포함할 수 있다. Referring to FIG. 3, a step (S140) of designating an object document as at least one sorting system of a database using an image extracted from an object document according to an exemplary embodiment of the present invention includes extracting local feature points of the image S310), calculating the frequency of matching points of the image using the regional feature points and the feature points of the images reviewed in advance (S320), calculating local feature points of the images stored in the database and feature points Calculating a frequency of a matching point of the image stored in the database using the frequency of the matching point, determining the similarity between the extracted image and the image stored in the database (S330) Determining a similar image from the image stored in the database using the similarity (S340), and It may include a step (S350) that specifies at least one classification system of the database corresponding to group similar images.

대상 문서에서 추출된 이미지의 지역적 특징점 추출하는 단계(S310)에서 지역적 특징점들은 SIFT (Scale-Invariant Feature Transform), SURF (Speed Up Robust Features) 등의 알고리즘을 활용하여 추출될 수 있다. 일반적으로 SIFT 알고리즘은 DOG (Difference of Gaussian)을 사용하여 특징점을 추출하고 0 ~ 255 사이의 값을 가지는 128차원의 특징 벡터를 계산한다. 또한 DOG 이외에도 다른 검출기를 사용하여 특징점을 추출할 수 있다. DOG 방법을 통해 추출된 특징 벡터의 경우의 수는 256¹²⁸ 가지가 되므로 각 특징 벡터는 완전히 일치하기 어렵다. In step S310, the local feature points may be extracted using algorithms such as Scale-Invariant Feature Transform (SIFT) and Speed Up Robust Features (SURF). In general, the SIFT algorithm extracts feature points using Difference of Gaussian (DOG) and computes a 128-dimensional feature vector with a value between 0 and 255. In addition, feature points can be extracted using other detectors in addition to DOG. The number of feature vectors extracted by the DOG method is 256 ¹²⁸ , so each feature vector is hard to match completely.

상기 지역적 특징점과 사전에 심사된 이미지들의 특징점을 이용하여 상기 이미지의 매칭점의 빈도수를 계산하는 단계(S320)와 상기 데이터 베이스에 저장된 이미지의 지역적 특징점과 상기 사전에 심사된 이미지들의 특징점을 이용하여 상기 데이터 베이스에 저장된 이미지의 매칭점의 빈도수를 계산하는 단계(S325)는, 사전에 심사된 이미지들의 특징점은 Visual Word 사전을 이용하여 구현될 수 있다. 즉, Visual Word 사전은 데이터 베이스의 과거 심사 문서에 포함된 이미지로부터 추출된 특징점을 추출 및 군집화하여 구성된 사전으로, 이 Visual Word 사전의 특징점들과 대상 이미지 혹은 데이터 베이스에 저장된 이미지의 특징점을 비교하여 매칭점의 빈도수를 계산할 수 있다. (S320) of calculating a frequency of matching points of the image using the local feature points and the feature points of the images judged in advance, using the local feature points of the images stored in the database and the feature points of the images reviewed in the pre- The step of calculating the frequency of matching points of images stored in the database (S325) may be implemented using a Visual Word dictionary. That is, the Visual Word dictionary is a dictionary constructed by extracting and clustering minutiae extracted from the images included in past review documents in the database, comparing the minutiae points of the Visual Word dictionary with the minutiae points of images stored in the target image or database The frequency of the matching point can be calculated.

Visual Word 사전은 128차원의 SIFT 특징 벡터를 포함하고 군집화를 위해 K-means 알고리즘을 활용할 수 있다. K-means 알고리즘의 군집의 수는 심사자 혹은 개발자가 미리 지정할 수 있다. The Visual Word dictionary contains a 128-dimensional SIFT feature vector and can use the K-means algorithm for clustering. The number of clusters of the K-means algorithm can be specified in advance by the evaluator or the developer.

추출된 이미지와 데이터 베이스에 저장된 이미지간의 유사도를 판정하는 단계(S330)는, 대상 문서에서 추출된 이미지의 매칭점의 빈도수와 데이터 베이스에 저장된 이미지의 매칭점의 빈도수를 입력 정보로 코사인 유사도를 계산하여 판정할 수 있다. 코사인 유사도는 내적공간의 두 벡터간 각도의 코사인 값을 이용하여 측정된 벡터간의 유사한 정도를 나타낸다. 대상 문서에서 추출된 이미지의 매칭점의 빈도수와 데이터 베이스에 저장된 이미지의 매칭점의 빈도수는 벡터로 표현이 가능하므로, 이미지 간의 코사인 유사도는 두 벡터의 내적 / (각 벡터의 길이의 곱)으로 계산될 수 있다. 또한 상기 두 개의 벡터는 TF-IDF 가중치를 이용하여 나타낼 수 있다. The step of determining the degree of similarity between the extracted image and the image stored in the database includes calculating the frequency of the matching point of the image extracted from the target document and the frequency of the matching point of the image stored in the database with the input information, . The cosine similarity represents the similarity between the vectors measured using the cosine of the angles between the two vectors of the inner space. Since the frequency of matching points of the image extracted from the target document and the frequency of matching points of the images stored in the database can be expressed as vectors, the cosine similarity between images is calculated by the inner product / (the product of the length of each vector) . The two vectors can be represented using TF-IDF weights.

도 5는 심사 대상 문서 A에서 추출된 이미지의 매칭점의 빈도수와 데이터 베이스에 저장된 비교 대상 문서 B에서 추출된 이미지의 매칭점의 빈도수를 입력 정보로 코사인 유사도를 계산한 예시이다. SIFT는 보통 128차원이지만, 도 5에서는 편의상 3차원 벡터로 표시되어 유사도가 계산 되었다. 또한 도 5에서는 편의상 A와 B문서에 이미지가 각각 하나씩 포함되었다고 가정하여 유사도가 계산 되었다. 5 is an example of calculating the degree of cosine similarity using the frequency of the matching point of the image extracted from the document to be judged A and the frequency of the matching point of the image extracted from the comparison object document B stored in the database. The SIFT is usually 128 dimensions, but in FIG. 5, the similarity is calculated by expressing it as a three-dimensional vector for convenience. In Fig. 5, the similarity is calculated assuming that one image is included in each of the A and B documents.

상기 판정된 유사도를 이용하여 데이터 베이스에 저장된 이미지로부터 유사 이미지를 검색하는 단계(S340)는, 상기 유사도가 일정 수치 이상인 이미지들을 심사자에게 제공하거나 시스템이 자동으로 유사 이미지로 선택하여 심사자에게 제공하는 단계를 포함할 수 있다. 또한, S340 단계는 별도로 저장된 심사와 관련성이 낮은 이미지와의 유사도가 일정 수치 이상인 이미지를 필터링하여 제외시키는 단계를 포함할 수 있다. 예를 들면 유사 이미지가 로고, 배경 그림 등과 같이 심사와 관련성이 낮은 이미지만으로 구성되어 있을 경우, 필터링 과정을 이용하여 관련성이 낮은 이미지를 심사 대상에서 제외시킬 수 있다. 구체적으로, 로고 배경 그림과 같은 관련성이 낮은 이미지의 특징점을 별도로 저장하고 심사 대상 이미지와 비교하여 유사도가 일정 수치 이상인 경우 심사와 관련성이 낮은 것으로 판단하여 심사 대상에서 제외 시킬 수 있다. (S340) of retrieving a similar image from the image stored in the database using the determined similarity may include providing images to the examiner with the image having the similarity of the predetermined value or more, or automatically selecting the similar image as the similar image to the examiner . &Lt; / RTI > In addition, the step S340 may include a step of filtering out images having a degree of similarity to the images that are not related to the stored examination separately. For example, if similar images are composed only of images that are not relevant to the examination, such as logos, background images, etc., the irrelevant images can be excluded from the examination by using the filtering process. Specifically, the minutiae points of the irrelevant images such as the logo background image are separately stored, and compared with the image to be judged, if the similarity is more than a certain value, it can be judged that it is not relevant to the examination.

검색된 유사 이미지에 해당되는 데이터 베이스의 적어도 하나의 분류 계통을 지정하는 단계(S350)는 상기 유사도에 따라 유사 이미지로 검색된 이미지에 태그된 계통 정보를 이용하여 대상 문서에 포함된 이미지의 계통을 지정하는 단계를 포함할 수 있다. 데이터 베이스는 검색된 이미지에 태그된 계통 정보를 제공하기 위하여 과거 심사 자료에서 추출된 이미지에 대한 정보를 데이터 베이스 상에 도 6과 같이 저장할 수 있다. The step of designating at least one classification system of the database corresponding to the searched similar image is to designate the system of the image included in the target document using the systematic information tagged to the image searched with the similar image according to the similarity Step < / RTI > The database may store information on images extracted from past review data as shown in FIG. 6 in order to provide systematic information tagged to the retrieved images.

본 발명의 일례에 따른 상기 적어도 하나의 분류 계통으로 데이터 베이스에 저장된 문서와 상기 대상 문서간의 이미지 기반 유사도를 판정하는 단계(S150)는, 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 유사 이미지수와 상기 대상 문서의 총 이미지수를 이용하여 계산하는 단계(S360)를 포함할 수 있다. 즉, 시스템은 대상 문서의 이미지에 대한 분류 계통이 정해지면 데이터 베이스에 지정된 분류 계통으로 저장된 문서 중에서 대상 문서에 포함된 이미지와 유사한 이미지를 가장 많이 포함하는 문서를 검색할 수 있다. 검색된 문서와 대상 문서 간의 이미지 기반 유사도는 유사 이미지의 수 / 대상 문서의 총 이미지 수로 계산할 수 있다. The step of determining (S150) the image-based similarity between the document stored in the database and the target document in the at least one classification system according to an exemplary embodiment of the present invention is characterized in that the similarity between the document stored in the at least one classification system and the target document (S360) using the number of images and the total number of images of the target document. That is, if the classification system for the image of the target document is determined, the system can search for the document that contains the image most similar to the image included in the target document among the documents stored in the classification system specified in the database. The image-based similarity between the retrieved document and the target document can be calculated by the number of similar images / the total number of images of the target document.

본 발명의 일례에 따른, 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 단계는 상기 수행된 텍스트 기반 유사 문서 검색 결과와 이미지 기반 유사 문서 검색 결과를 통합한 결과를 심사자에게 제시할 수 있다. 예를 들면, 통합된 결과는 (텍스트 기반 유사도 + 이미지 기반 유사도) / 2 로 정해질 수 있다.The retrieval of the similar document using the text-based similarity and the image-based similarity according to an exemplary embodiment of the present invention may include displaying the result of the retrieval of the text-based similar document retrieval result and the image- can do. For example, the combined result can be set to (text-based similarity + image-based similarity) / 2.

다음 단계로 통합된 결과를 기반으로 심사자는 가장 유사한 문서로 추천된 문서를 신규 심사시 활용할 수 있다. 또한 검색 시스템은 상기 텍스트 기반 유사 문서 검색과 이미지 기반 유사 문서 검색 과정을 재수행할 수 있다. 문서 검색 과정 중 계통 분류 과정에서 활용되지 않은 키워드가 발견될 경우, 이를 심사자에게 디스플레이할 수 있다. Based on the results integrated into the next step, the judge can utilize the documents recommended as the most similar documents at the time of the new examination. In addition, the search system may re-execute the text-based similar document search and the image-based similar document search process. If a keyword that is not utilized in the systematic classification process is found during the document search process, it can be displayed to the reviewer.

심사자는 시스템이 제시한 키워드를 선택할 수 있고, 각 키워드에 가중치를 신규로 부여하여 신규 키워드 및 그 가중치를 데이터 베이스에 저장할 수 있다. 또한 최종 심사가 완료된 후, 심사 완료된 문서의 텍스트와 이미지 및 계통 정보는 데이터 베이스에 저장될 수 있다. The judge can select a keyword presented by the system, assign a new weight to each keyword, and store the new keyword and its weight in the database. Also, after the final screening is completed, the text, image, and grid information of the reviewed document can be stored in the database.

도 4는 본 발명의 일례에 따른 유사 문서 검색 시스템을 나타낸 블록도이다. 4 is a block diagram illustrating a similar document retrieval system according to an example of the present invention.

도 4를 참조하면, 본 발명의 일례에 따른 유사 문서 검색 시스템은 적어도 하나의 분류 계통과 적어도 하나의 분류 계통으로 저장된 문서를 포함하는 데이터 베이스(460), 대상 문서에 대한 정보로부터 키워드를 추출하는 키워드 추출부(410), 상기 대상 문서에 대한 정보로부터 이미지를 추출하는 이미지 추출부(420), 상기 키워드와 상기 이미지를 이용하여 상기 대상 문서를 상기 데이터 베이스의 적어도 하나의 분류 계통으로 지정하는 계통 지정부(430), 상기 적어도 하나의 분류 계통으로 저장된 문서와 상기 대상 문서간의 텍스트 기반 유사도와 이미지 기반 유사도를 판정하는 유사도 판정부(440), 상기 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 유사 문서를 검색하는 검색부(450)를 포함할 수 있다. 4, a similar document search system according to an exemplary embodiment of the present invention includes a database 460 including documents stored in at least one classification system and at least one classification system, A keyword extracting unit 410, an image extracting unit 420 extracting an image from information on the target document, a system for specifying the target document as at least one sorting system of the database using the keyword and the image, Based similarity determining unit 440 for determining a text-based similarity and an image-based similarity between the document stored in the at least one classification system and the target document, a similarity determining unit 440 for determining similarity based on the text- And a search unit 450 for searching a document.

상기 데이터 베이스(460)는 과거 심사 문서에 포함된 이미지들에서 추출 및 군집화하여 구성한 매칭점 정보를 Visual Word 사전의 형태로 데이터 베이스(460)에 저장하고 계통 지정부(430) 혹은 유사도 판정부(440)에서 매칭점 정보를 요청할 경우 매칭점 정보를 제공할 수 있다. The database 460 stores matching point information formed by extracting and clustering the images included in past examination documents in a database 460 in the form of a Visual Word dictionary, 440 may provide the matching point information when the matching point information is requested.

상기 계통 지정부(430)는 상기 키워드 추출부(410)에서 추출된 키워드와 상기 이미지 추출부(420)에서 추출된 이미지를 이용하여 대상 문서의 적어도 하나의 분류 계통을 지정하고, 지정된 분류 계통을 데이터베이스(460)에 저장할 수 있다. 데이터 베이스(460)는 유사도 판정부(440) 혹은 문서 검색부(450)의 요청이 있는 경우, 저장된 대상 문서의 분류 계통을 유사도 판정부(440) 혹은 문서 검색부(450)에 제공할 수 있다. The system designation unit 430 designates at least one classification system of the target document by using the keyword extracted by the keyword extraction unit 410 and the image extracted by the image extraction unit 420, And stored in the database 460. The database 460 may provide the similarity level determination unit 440 or the document search unit 450 with a classification system of the stored target document in response to a request from the similarity level determination unit 440 or the document search unit 450 .

상기 유사도 판정부(440)는 유사도 판정부(440)에서 판정된 텍스트 기반 유사도와 이미지 기반 유사도를 데이터 베이스(460)에 저장할 수 있고, 데이터 베이스(460)은 문서 검색부(450)의 요청이 있을 경우, 판정된 텍스트 기반 유사도와 이미지 기반 유사도를 문서 검색부(450)로 제공할 수 있다. The similarity degree determining unit 440 may store the text based similarity and the image based similarity determined by the similarity degree determining unit 440 in the database 460 and the database 460 may include Based similarity degree and the image-based similarity degree to the document retrieval unit 450. [0064]

상기 문서 검색부(460)는 텍스트 기반 유사도와 상기 이미지 기반 유사도를 이용하여 검색된 유사 문서를 데이터 베이스(460)에 저장할 수 있다. 또한, 도 4에는 도시되지 않았지만, 본 발명의 일례에 따른 유사 문서 검색 시스템은 상기 데이터 베이스(460)에 저장된 대상 문서의 적어도 하나의 분류 계통, 텍스트 기반 유사도, 이미지 기반 유사도, 텍스트 기반 유사도와 이미지 기반 유사도를 이용하여 검색된 유사 문서 등을 심사자에게 보여주기 위한 디스플레이부를 더 포함할 수 있다. The document retrieval unit 460 may store the retrieved similar document in the database 460 using the text-based similarity and the image-based similarity. Although not shown in FIG. 4, the similar document retrieval system according to an exemplary embodiment of the present invention includes at least one classification system, a text based similarity, an image based similarity, a text based similarity, And a display unit for displaying a similar document retrieved using the base similarity to the reviewer.

이상, 첨부된 도면을 참조하여 본 발명의 일례들을 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명의 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 일례들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. The examples described above are therefore to be understood as being in all respects illustrative and not restrictive

400 : 대상 문서 410 : 키워드 추출부
420 : 이미지 추출부 430 : 계통 지정부
440 : 유사도 판정부 450 : 문서 검색부
460 : 데이터 베이스400: target document 410: keyword extracting unit
420: image extracting unit 430:
440: degree of similarity determination unit 450:
460: Database

Claims

Extracting a keyword from information on a target document;
Designating the target document as at least one classification system in the database using the keyword;
Determining a text based similarity between the document stored in the at least one classification system and the target document;
Extracting an image included in the target document;
Designating the target document as at least one classification system of the database using the image;
Determining an image-based similarity between the document stored in the at least one classification system and the target document; And
And searching for a similar document using the text-based similarity and the image-based similarity,
Wherein determining image-based similarity between the document stored in the at least one classification system and the target document comprises:
Calculating the image-based similarity using the number of similar images between the document stored in the at least one classification system and the target document and the total number of images of the target document.

The method of claim 1, wherein extracting keywords from the information on the target document comprises:
Inputting information on the target document;
Converting the information into text; And
And extracting the keyword from the text.

3. The method of claim 2, wherein the information about the target document includes at least one of text and images.

3. The method according to claim 2, wherein the keyword indicates a word related to nuclear power.

2. The method of claim 1, wherein the designating the target document as at least one classification system in the database using the keyword comprises:
Determining at least one belonging degree belonging to at least one classification system of the database by using the frequency of the keyword and the weight for at least one classification system of the database; And
And designating at least one classification system of the target document using the at least one degree of belonging.

6. The system of claim 5, wherein at least one of the classification systems of the database includes at least one of a reactor coolant, a reactor vessel, a safety injection and quench cooling system, a steam generator, a neutron flux monitor, a control rod assembly, a fuel assembly, A similar document retrieval system including a cooling system, a cooling system, a turbine and a generator, a water supply, a steam, a plural, a gas, an emergency diesel generator, a reactor containment building, waste, monitoring, Way.

The method of claim 1, wherein determining the text based similarity between the document stored in the at least one classification system and the target document comprises:
And calculating the cosine similarity using the keyword and keyword frequency of the document stored in the at least one classification system and the keyword and keyword frequency of the target document.

2. The method of claim 1, wherein the designating the target document as at least one classification system in the database using the image comprises:
Extracting local feature points of the image;
Calculating a frequency of matching points of the image using the local feature points and the feature points of the images reviewed in advance;
Calculating a frequency of matching points of images stored in the database using local feature points of the images stored in the database and feature points of the images reviewed in the dictionary;
Determining a degree of similarity between the image and an image stored in the database using the frequency of the matching point;
Determining a similar image from an image stored in the database using the similarity; And
And designating at least one classification system of the database corresponding to the similar image.

9. The method of claim 8, wherein determining similar images from images stored in the database using the similarity comprises:
And filtering the image having the degree of similarity equal to or greater than a predetermined value with the separately stored examination and the irrelevant image.

delete

A database comprising at least one classification system and a document stored in the at least one classification system;
A keyword extracting unit for extracting a keyword from information on a target document;
An image extracting unit for extracting an image from information on the target document;
A system specifying unit for specifying the target document as the at least one sorting system using the keyword and the image;
A degree of similarity determining unit that determines a text-based similarity and an image-based similarity between the document stored in the at least one classification system and the target document;
And a document retrieval unit retrieving a similar document using the text based similarity and the image based similarity,
Wherein the similarity determination unit calculates the image-based similarity using the number of similar images between the document stored in the at least one classification system and the target document and the total number of images of the target document.

12. The similar document retrieval system according to claim 11, wherein the database includes matching point information configured by extracting and clustering the images included in past examination documents.

12. The similar document retrieval system according to claim 11, wherein the system designation unit stores at least one classification system of the target document specified using the keyword and the image in the database.

12. The similar document retrieval system according to claim 11, wherein the similarity degree determination unit stores the text-based similarity and the image-based similarity in the database.

The similar document retrieval system according to claim 11, wherein the document retrieval unit stores the similar document retrieved using the text-based similarity and the image-based similarity in the database.