KR20070090321A

KR20070090321A - System for extracting information from document, method thereof and recording medium thereof

Info

Publication number: KR20070090321A
Application number: KR1020060019897A
Authority: KR
Inventors: 권찬용
Original assignee: (주)윕스
Priority date: 2006-03-02
Filing date: 2006-03-02
Publication date: 2007-09-06

Abstract

A system and a method for extracting information from a document, and a recording medium storing the same are provided quickly to offer data to a user by constructing a database with the information obtained from recognition of patent documents, and generate a new drawing by combining a drawing and drawing information extracted from the patent document. A recognizer(71) recognizes a code from a drawing section comprising an explanation part, which is a text part, and a drawing part, which is an image part. An extractor(72) extracts the information related to the code from the explanation part. The database(73) stores the patent documents and the information extracted by the extractor. The information includes at least one of the number and order of drawings forming the drawing part, the codes, and code names. The extractor extracts the information by using location information of text blocks, which are detail explanation of each drawing, of each drawing.

Description

System for Extracting Information from Document, Method Thereof And Recording Medium Thereof}

도 1은 본 발명의 일실시예에 따른 특허문서의 구조를 간략하게 도시한 도면,1 is a view briefly showing the structure of a patent document according to an embodiment of the present invention,

도 2는 본 발명의 일실시예에 따른 문서 정보 추출 시스템을 도시한 블록도,2 is a block diagram showing a document information extraction system according to an embodiment of the present invention;

도 3은 본 발명의 일실시예에 따른 문서 정보 추출 방법을 설명하는 흐름도. 3 is a flowchart illustrating a document information extraction method according to an embodiment of the present invention.

* 도면의 주요 부분에 대한 부호의 설명 *Explanation of symbols on the main parts of the drawings

10, 30, 50: 설명부 20, 40, 60: 도면부10, 30, 50: description 20, 40, 60: drawing

70: 정보 추출 시스템 71: 인식부70: information extraction system 71: recognition unit

72: 추출부 73: 데이터베이스72: extractor 73: database

151, 357, 560: 도면 설명부 156, 355, 570: 구성 설명부151, 357, 560: Drawing explanation part 156, 355, 570: Configuration description part

본 발명은 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체 에 관한 것으로, 특히 특허문서로부터 도면 정보를 추출하여 도면과 도면 정보를 결합한 새로운 도면을 생성하기 위한 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체에 관한 것이다.The present invention relates to a document information extraction system, a method of extracting the same, and a recording medium recording the same. In particular, a document information extraction system for extracting drawing information from a patent document and generating a new drawing combining drawing and drawing information, and a method of extracting the same It relates to a recording medium recording this.

일반적으로 문서인식이란 그림영역과 글자영역으로 구분되어 인쇄된 인쇄물을 이미지(Image) 형태로 읽어들여 분석한 후, 글자 영역의 문자들을 문서편집기에서 편집가능한 텍스트(Text) 형태로 변환하는 기술로서, 미리 작성된 인쇄물에서 사용자가 원하는 부분을 컴퓨터를 통해 자동으로 입력받기 위해 등장하였으며, 문자인식(Character Recognition)의 세부 분야이다.In general, document recognition is a technology that reads and analyzes printed matter divided into a picture area and a text area in the form of an image, and then converts the characters of the text area into an editable text form in a text editor. It appeared in order to automatically input the user's desired part in the pre-printed printed matter through a computer, and it is a detailed field of Character Recognition.

문자인식은 시각 정보를 통하여 문자를 인식하고 그 문자가 가지고 있는 의미를 이해하는 패턴인식의 한 분야로서, 광학 문자 인식(Optical Character Recognition, OCR), 도면 인식 등의 분야에서 실용화가 이루어지고 있다. 최근에는 인공지능(Artificial Intelligence) 분야의 신경망(Neural Network), 퍼지(Fuzzy) 이론, 유전 알고리즘(Genetic Algorithm), 자연어처리(Natural Language Processing), 심리학, 인지과학(Cognitive Science) 등과의 접목에 의해 새롭게 발전하고 있다. 뿐만 아니라, 문자인식의 대표적인 예인 OCR은 우편물 자동 분류, 산업 제품 검사나 분류, 도면 인식, 팩스의 문자 인식, 수표의 자동 입력 등 여러 분야에 걸쳐 실용화되고 있다. Character recognition is a field of pattern recognition that recognizes a character through visual information and understands the meaning of the character, and has been put to practical use in the fields of optical character recognition (OCR) and drawing recognition. Recently, it has been integrated with Neural Network, Fuzzy Theory, Genetic Algorithm, Natural Language Processing, Psychology and Cognitive Science in the field of Artificial Intelligence. It is developing anew. In addition, OCR, which is a representative example of text recognition, has been put into practical use in various fields such as automatic mail sorting, industrial product inspection or classification, drawing recognition, text recognition of faxes, and automatic entry of checks.

이와 같이 문서 및 이미지의 편집 기술의 일례가 미국 특허 제5623681호(1997.04.22 등록, 텍스트 및 이미지 문서를 동기화시켜 표시하며 조작하기 위한 방법 및 장치)에 개시되어 있다.One example of such a technique for editing documents and images is disclosed in US Pat. No. 56,3681 (a method and apparatus for synchronizing, displaying and manipulating registration, text and image documents on April 22, 1997).

상기 미국 특허 제5623681호에 개시된 기술은 소스 텍스트 파일(Source Text File) 및 소스 이미지 파일(Source Image File)로부터 문서의 등가 텍스트 파일을 생성하는 장치로서, 적어도 하나의 기억 매체로부터 소스 텍스트 파일 및 소스 이미지 파일을 추출하는 추출 수단, 등가 텍스트 파일을 생성하기 위해 상기 문서의 소스 이미지 파일을 이용하여 소스 텍스트 파일을 페이지(Page) 분할하는 페이지 분할 수단, 텍스트 및 이미지를 표시하는 표시장치, CPU(Central Processing Unit, 중앙 처리 장치), 상기 CPU에 결합되고 등가 텍스트 파일로 구성된 적어도 하나의 텍스트 문서와 그 기억된 이미지 파일로 구성된 적어도 하나의 이미지 문서를 기억하는 기억 수단, 상기 표시 장치상에 표시하기 위해 상기 CPU에 의해 생성되는 사용자 인터페이스(Interface) 등으로 구성되며, 적어도 하나의 기억매체로부터 소스 텍스트 파일 및 소스 이미지 파일을 추출하는 단계와 등가 텍스트 파일을 생성하기 위해 상기 문서의 소스 이미지 파일을 이용하여 상기 소스 텍스트 파일을 페이지 분할하는 단계를 통해 소스 텍스트 파일과 소스 이미지 파일로부터 문서의 등가 텍스트 파일을 생성한다고 기재되어 있다.The technique disclosed in US Pat. No. 56,3681 is an apparatus for generating an equivalent text file of a document from a source text file and a source image file, the source text file and the source from at least one storage medium. Extraction means for extracting an image file, page dividing means for page dividing a source text file by using the source image file of the document to generate an equivalent text file, a display device for displaying text and images, and a CPU (Central) Processing unit), storage means for storing at least one text document coupled to the CPU and composed of an equivalent text file and at least one image document composed of the stored image file, for displaying on the display device It consists of a user interface (Interface), etc. generated by the CPU, Source text file and source through a step of extracting the source text file and the source image file from a storage medium and by dividing the source text file using the source image file of the document to generate an equivalent text file. It is described that an equivalent text file of a document is generated from an image file.

또, 텍스트 및 그림으로 구성된 정보의 제공 기술의 일례가 대한민국 특허 제0495034호(2005.06.02 등록, 인포박스를 이용한 정보제공 시스템 및 방법)에 개시되어 있다.In addition, an example of a technology for providing information consisting of text and pictures is disclosed in Korean Patent No. 0495034 (registered on June 02, 2005, an information providing system and method using an info box).

상기 대한민국 특허 제0495034호에 개시된 기술은 전문 DB용 데이터(Data)를 주기적으로 수집하고 소정 컨텐츠(Contents)를 컨텐츠 마스터 DB(Contents Maser Database)에 수집하는 전문DB시스템을 포함하고, 적어도 하나의 클라이언트(Client) 단말과 네트워크(Network)망을 통하여 결합하는 정보제공서버(Server)와 전문DB시스템에 의해 수행되는 인포박스(Infobox)를 이용한 정보제공 방법으로서, 상기 정보제공서버의 인포박스 처리 모듈(Module)이 규칙 기반 방식으로 전문DB별 추출 변수 파일을 생성하는 단계, 상기 추출 변수 파일을 이용하는 키워드(Keyword) 추출 단계, 상기 키워드 추출 단계를 통해 만들어진 상기 컨텐츠 마스터DB의 인포박스정보(여기서, 인포박스정보는 적어도 검색 키워드와 상세정보를 포함함)를 출력하는 단계로 이루어지며, 상기 인포박스 처리 모듈을 이용하여 사용자들이 뉴스 기사 등의 컨텐츠를 읽는 도중 활성화된 소정 텍스트 내지 그림 위로 마우스를 가져가기만 하면 해당 텍스트 또는 이미지 관련 정보를 용이하게 확인할 수 있다고 기재되어 있다.The technology disclosed in Korean Patent No. 0495034 includes a specialized DB system that periodically collects data for specialized DBs and collects predetermined contents in a contents master DB, and includes at least one client. (Client) An information providing method using an information box that is coupled through a terminal and a network (Network) network and an info box (Infobox) performed by a specialized DB system, the information box processing module of the information providing server ( The module generates an extraction variable file for each specialized DB in a rule-based manner, extracts a keyword using the extraction variable file, info box information of the content master DB created through the keyword extraction step (here, info Box information includes at least a search keyword and detailed information). It is described that users can easily check the text or image related information by simply hovering the mouse over a predetermined text or picture activated while the user reads content such as a news article.

그러나 상기 공보들에 개시된 기술들을 비롯하여 종래 기술에 있어서는 OCR 제품의 구매로 인하여 서비스 생산 비용이 증가하는 문제가 있었다.However, in the prior art, including the technologies disclosed in the above publications, there is a problem in that the service production cost increases due to the purchase of the OCR product.

또, 실시간으로 이미지와 텍스트를 결합한 문서정보를 제공하는 서비스일 경우, 사용자가 요청한 정보를 추출하기 위해 해당 문서를 요청할 때마다 인식한 후에 해당 정보를 추출하여 제공하는 과정이 수반되기 때문에 즉답성을 필요로 하는 사용자에게 빠른 서비스를 제공할 수 없는 문제가 있었다.In addition, in the case of a service providing document information combining images and texts in real time, each time a user requests the document to extract the information requested by the user, the service is required to extract and provide the information. There was a problem that can not provide a quick service to the user in need.

또, 상기와 같은 실시간 서비스 제공시 한꺼번에 많은 사용자가 서비스를 요 청할 경우, 시스템에 과부하가 발생하여 서비스가 지연되거나 중단되는 문제가 있었다.In addition, when a large number of users request the service at the same time when providing the real-time service as described above, there was a problem that the service is delayed or stopped due to an overload of the system.

본 발명의 목적은 상술한 바와 같은 문제점을 해결하기 위한 것으로서, 문서인식을 통해 얻어진 정보를 데이터베이스(Database)로 구축하여 사용자에게 제공하는 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체를 제공하는 것이다.SUMMARY OF THE INVENTION An object of the present invention is to solve the problems described above, and provides a document information extraction system, a method for extracting the same, and a recording medium recording the information obtained through document recognition as a database. will be.

본 발명의 다른 목적은 구축한 데이터베이스를 통해 사용자에게 더욱 빠른 서비스를 제공할 수 있는 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체를 제공하는 것이다.Another object of the present invention is to provide a document information extraction system, a method of extracting the same, and a recording medium recording the same, which can provide a faster service to a user through a built database.

상기 목적을 달성하기 위해 본 발명에 따른 문서 정보 추출 시스템은 문자영역인 설명부와 그림영역인 도면부로 이루어진 문서의 도면부로부터 부호를 인식하는 인식부, 상기 부호와 관련된 정보를 상기 설명부로부터 추출하는 추출부, 다수의 상기 문서들이 저장되고 상기 추출부가 추출한 상기 정보가 저장되는 데이터베이스를 포함하는 것을 특징으로 한다.In order to achieve the above object, a document information extraction system according to the present invention comprises a recognition unit for recognizing a code from a drawing part of a document comprising a description part which is a text area and a drawing part which is a picture area, and extracting information related to the code from the description part. And an extraction unit, a database in which a plurality of documents are stored and the information extracted by the extraction unit is stored.

또, 본 발명에 따른 문서 정보 추출 시스템에 있어서 상기 정보는 상기 도면부를 구성하는 도면의 개수 및 순서와 상기 부호와 부호 명칭 중 적어도 어느 하나를 포함하는 것을 특징으로 한다.In the document information extracting system according to the present invention, the information includes at least one of the number and order of the drawings constituting the drawing unit, and the code and code name.

또, 본 발명에 따른 문서 정보 추출 시스템에 있어서 상기 추출부는 각각의 도면에 대한 세부설명인 도면별 텍스트 블록(Text Block)들의 위치 정보를 이용하여 정보를 추출하는 것을 특징으로 한다.In the document information extracting system according to the present invention, the extracting unit extracts information by using location information of text blocks for each drawing, which is detailed description of each drawing.

또, 본 발명에 따른 문서 정보 추출 시스템에 있어서 상기 추출부는 상기 설명부에

,

또는

의 형식으로 기재된 내용을 상기 인식부에서 인식한 부호에 대응한다고 간주하여 판독하는 것을 특징으로 한다.In the document information extracting system according to the present invention, the extracting section is provided in the description section.

,

or

The content described in the form of? Is considered to correspond to a code recognized by the recognition unit, and is read.

또, 본 발명에 따른 문서 정보 추출 시스템에 있어서 상기 추출부는 상기 설명부에서 상기 부호의 전위 또는 후위에 존재하는 명사형을 부호 명칭으로 간주하여 판독하는 것을 특징으로 한다.In the document information extracting system according to the present invention, the extracting unit reads the noun type existing in the preceding or rearward of the code in the description section as a code name.

또, 본 발명에 따른 문서 정보 추출 시스템에 있어서 상기 도면별 텍스트 블록의 위치는 도면마다 1개 이상 설정 가능한 것을 특징으로 한다.In the document information extraction system according to the present invention, one or more positions of the text blocks may be set for each drawing.

상기 목적을 달성하기 위해 본 발명에 따른 문서 정보 추출 방법은 (a) 문자영역인 설명부와 그림영역인 도면부로 이루어진 다수의 문서들이 저장된 데이터베이스로부터 어느 한 문서를 선택하는 단계, (b) 상기 도면부로부터 부호를 인식하는 단계, (c) 상기 설명부 내에 기재된 도면 설명부에서 도면의 개수 정보와 순서 정보를 추출하는 단계, (d) 도면 정보인 상기 도면의 개수 정보와 순서 정보를 상기 데이터베이스에 저장하는 단계, (e)상기 설명부 내에 기재된 구성 설명부의 처음 위치를 최초 검색 위치로 설정하는 단계, (f) 상기 구성 설명부에서 각각의 도면에 대한 세부설명인 도면별 텍스트 블록들의 위치 정보를 구체화하는 단계, (g) 상기 도면별 텍스트 블록에서 상기 단계 (b)에서 인식한 부호에 대응하는 부호와 부호 명칭을 판독하는 단계, (h) 상기 부호와 상기 부호 명칭을 상기 단계 (d)의 도면 정보에 추가하는 단계를 포함하는 것을 특징으로 한다.In order to achieve the above object, the document information extraction method according to the present invention comprises the steps of: (a) selecting a document from a database in which a plurality of documents including a description part which is a text area and a drawing part which is a picture area are stored; (B) extracting number information and order information of the drawings from the drawing description section described in the description section; and (d) counting information and order information of the drawing as drawing information to the database. Storing (e) setting the initial position of the component description described in the description as an initial search position; (G) reading a code and a code name corresponding to the code recognized in step (b) in the text block for each drawing, (h) And adding the code and the code name to the drawing information of the step (d).

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 도면 설명부는 상기 문서가 한국 특허문서일 경우 '도면의 간단한 설명'이 기재된 부분인 것을 특징으로 한다.In addition, in the document information extraction method according to the present invention, the drawing description part is characterized in that the part described in the 'short description of the drawings' when the document is a Korean patent document.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 도면 설명부는 상기 문서가 일본 특허문서일 경우 '도면의 간단한 설명'이 기재된 부분인 것을 특징으로 한다.Further, in the document information extracting method according to the present invention, the drawing description section is characterized in that the part described in the 'simple description of the drawings' when the document is a Japanese patent document.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 도면 설명부는 상기 문서가 미국 특허문서일 경우 '도면의 간단한 설명(BRIEF DESCRIPTION OF THE DRAWINS)'가 기재된 부분인 것을 특징으로 한다.In addition, in the document information extraction method according to the present invention, the drawing description portion is characterized in that the part described in the BRIEF DESCRIPTION OF THE DRAWINS when the document is a US patent document.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 구성 설명부는 상기 문서가 한국 특허문서일 경우 '발명의 구성 및 작용'이 기재된 부분인 것을 특징으로 한다.In addition, in the document information extraction method according to the present invention, when the document is a Korean patent document, the document is characterized in that the part 'the composition and operation of the invention' is described.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 구성 설명부는 상기 문서가 일본 특허문서일 경우 '발명의 실시 형태'가 기재된 부분인 것을 특징으로 한다.Further, in the document information extraction method according to the present invention, the configuration description portion is a part in which "embodiment of the invention" is described when the document is a Japanese patent document.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 구성 설명부는 상기 문서가 미국 특허문서일 경우 '발명의 상세한 설명(DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS)'가 기재된 부분인 것을 특징으로 한다.In addition, in the method for extracting document information according to the present invention, if the document is a US patent document, it is characterized in that the 'detailed description of the invention (DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS)' is described.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 단계 (g)는

,

또는

의 형식으로 기재된 내용을 부호로 간주하여 판독하는 단계를 포함하는 것을 특징으로 한다.In the document information extracting method according to the present invention, the step (g)

,

or

It is characterized in that it comprises the step of reading the content described in the form of a sign.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 단계 (g)는 상기 부호의 전위 또는 후위에 존재하는 명사형을 부호 명칭으로 간주하여 판독하는 단계를 더 포함하는 것을 특징으로 한다.The document information extracting method according to the present invention is characterized in that the step (g) further includes a step of reading the noun type existing at the front or the rear of the code as a code name.

또, 본 발명에 따른 문서 정보 추출 방법에 있어서, 상기 단계 (f)에서 상기 도면별 텍스트 블록의 위치는 도면마다 1개 이상 설정 가능한 것을 특징으로 한다.In the method for extracting document information according to the present invention, in the step (f), at least one position of the text block for each drawing may be set per drawing.

또, 상기 목적을 달성하기 위해 본 발명에 따른 컴퓨터로 읽을 수 있는 기록매체는 (a) 문자영역인 설명부와 그림영역인 도면부로 이루어진 다수의 문서들이 저장된 데이터베이스로부터 어느 한 문서를 선택하는 단계, (b) 상기 도면부로부터 부호를 인식하는 단계, (c) 상기 설명부 내에 기재된 도면 설명부에서 도면의 개수 정보와 순서 정보를 추출하는 단계, (d) 도면 정보인 상기 도면의 개수 정보와 순서 정보를 상기 데이터베이스에 저장하는 단계, (e)상기 설명부 내에 기재된 구성 설명부의 처음 위치를 최초 검색 위치로 설정하는 단계, (f) 상기 구성 설명부에서 각각의 도면에 대한 세부설명인 도면별 텍스트 블록들의 위치 정보를 구체화하는 단계, (g) 상기 도면별 텍스트 블록에서 상기 단계 (b)에서 인식한 부호에 대응하는 부호와 부호 명칭을 판독하는 단계, (h) 상기 부호와 상기 부호 명칭을 상기 단계 (d)의 도면 정보에 추가하는 단계를 실행시키기 위한 프로그램을 기록하는 것을 특징으로 한다.In addition, to achieve the above object, a computer-readable recording medium according to the present invention comprises the steps of: (a) selecting a document from a database in which a plurality of documents including a description part which is a text area and a drawing part which is a picture area are stored; (b) recognizing a code from the drawing part, (c) extracting number information and order information of the drawing from the drawing description part described in the description part, (d) number information and order of the drawing which is drawing information Storing information in the database, (e) setting the initial position of the component description described in the description as the initial search position, and (f) drawing-specific text that is a detailed description of each drawing in the configuration description. Specifying the location information of the blocks; (g) selling the code and the code name corresponding to the code recognized in the step (b) in the text block for each drawing; Step, (h) that is characterized by recording a program for executing the step of adding the code as the code name of the drawing information of the step (d).

이하, 본 발명이 속하는 기술 분야에서 통상의 지식을 가진 자가 본 발명을 용이하게 실시할 수 있을 정도로 상세히 설명하기 위하여, 본 발명의 가장 바람직한 실시 예를 첨부한 도면을 참조하여 상세하게 설명한다. 또한, 본 발명을 설명하는데 있어서 동일 부분은 동일 부호를 붙이고, 그 반복 설명은 생략한다.DETAILED DESCRIPTION Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings so that those skilled in the art may easily implement the present invention. In addition, in describing this invention, the same code | symbol is attached | subjected and the repeated description is abbreviate | omitted.

도 1은 본 발명의 일실시예에 따른 특허문서의 구조를 간략하게 도시한 도면이다.1 is a view briefly showing the structure of a patent document according to an embodiment of the present invention.

도 1a는 한국 특허문서, 도 1b는 일본 특허문서, 도 1c는 미국 특허문서의 구조이다.1A is a structure of a Korean patent document, FIG. 1B is a Japanese patent document, and FIG. 1C is a US patent document.

도 1a 내지 도 1c에서 도시한 바와 같이, 한국, 일본 및 미국의 특허문서는 특정한 구조를 갖는다. 즉, 특허문서는 구분 가능한 형태를 갖추어 저장 및 가공 가능한 상태를 갖는다. 특허문서는 크게 문자영역인 설명부(10, 30, 50)와 그림영역인 도면부(20, 40, 60)로 이루어진다. As shown in Figs. 1A to 1C, Korean, Japanese, and US patent documents have a specific structure. That is, the patent document has a distinguishable form and has a state in which it can be stored and processed. The patent document is largely composed of a description section (10, 30, 50) that is a text area and a drawing section (20, 40, 60) that is a picture area.

우선, 한국 특허문서의 설명부(10)는 서지사항(110), 요약(120), 대표도(130), 색인어(140), 명세서(150), 청구의 범위(160)로 이루어진다. 이 중에서 서지사항(110)은 출원번호(110a), 출원일(110b), 출원인(110c), 발명의 명칭(110d) 등을 포함한다. 또한, 명세서(150)는 도면의 간단한 설명(151)과 발명의 상세한 설명(152)으로 이루어진다. 이 중에서 발명의 상세한 설명(152)은 발명의 목적(153), 발명의 구성 및 작용(156), 발명의 효과(157)로 이루어진다. 이때, 발명의 목적(153)은 발명이 속하는 기술 및 그 분야의 종래기술(154)과 발명이 이루고자 하는 기술적 과제(155)로 이루어진다. 도면부(20)는 도면(210)으로 이루어진다.First, the description portion 10 of the Korean patent document consists of bibliographic information 110, summary 120, representation diagram 130, index word 140, specification 150, claims 160. Among them, the bibliographic item 110 includes an application number 110a, an application date 110b, an applicant 110c, a name of the invention 110d, and the like. In addition, the specification 150 consists of a brief description 151 of the drawings and a detailed description 152 of the invention. The detailed description 152 of the invention consists of the object 153 of the invention, the constitution and operation 156 of the invention, and the effect 157 of the invention. At this time, the object 153 of the invention consists of the technology to which the invention belongs, and the prior art 154 in the field and the technical problem 155 to achieve the invention. The drawing unit 20 consists of a drawing 210.

다음으로, 일본 특허문서의 설명부(30)는 서지사항(310), 요약(320), 대표도(330), 특허청구의 범위(340), 발명의 상세한 설명(350)으로 이루어진다. 이 중에서 서지사항(310)은 출원번호(310a), 출원일(310b), 출원인(310c), 발명의 명칭(310d) 등을 포함한다. 또한, 발명의 상세한 설명(350)은 발명이 속하는 기술 분야(351), 종래의 기술(352), 발명이 해결하려고 하는 과제(353), 과제를 해결하기 위한 수단(354), 발명의 실시 형태(355), 발명의 효과(356), 도면의 간단한 설명(357)으로 이루어진다. 도면부(40)는 도면(410)으로 이루어진다.Next, the explanation section 30 of the Japanese patent document consists of a bibliographic item 310, a summary 320, a representation diagram 330, a claim 340, and a detailed description 350 of the invention. Among these, the bibliographic information 310 includes an application number 310a, an application date 310b, an applicant 310c, a name 310d of the invention, and the like. In addition, the detailed description 350 of the invention includes technical field 351 to which the invention belongs, conventional technology 352, problem 353 to be solved by the invention, means 354 to solve the problem, and embodiments of the invention. 355, the effect 356 of the invention, and the brief description 357 of the drawings. The drawing unit 40 is made up of a drawing 410.

마지막으로, 미국 특허문서의 설명부(50)는 서지사항(510), 요약(520), 대표도(530), 발명의 배경(540), 발명의 개요(550), 도면의 간단한 설명(BRIEF DESCRIPTION OF THE DRAWINS; 560), 발명의 상세한 설명(DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS; 570), 청구의 범위(CLAIM; 580)로 이루어진다. 이 중에서 서지사항(510)은 출원번호(510a), 출원일(510b), 출원인(510c), 발명의 명칭(510d) 등을 포함한다. 또한, 발명의 배경(540)은 발명의 분야(541)와 배경기술(542)로 이루어진다. 도면부(60)는 도면(610)으로 이루어진다.Finally, the description section 50 of the US patent document includes bibliography 510, summary 520, representation 530, invention background 540, invention summary 550, and a brief description of the drawings (BRIEF). DESCRIPTION OF THE DRAWINS 560, DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS 570, and CLAIM 580. Among these, the bibliographic information 510 includes an application number 510a, an application date 510b, an applicant 510c, a name of the invention 510d, and the like. In addition, the background 540 of the invention consists of the field of invention 541 and the background art 542. The drawing unit 60 is made up of the drawing 610.

상기 설명에서 한국, 일본 및 미국의 특허문서의 구조에 대해 설명하였으나, 상기 설명에서 언급한 구조뿐만 아니라 더욱 상세한 구조를 가질 수도 있으며, 상기 언급한 구조의 명칭과 유사한 의미의 명칭 역시 사용될 수 있음은 물론이다.Although the structure of the patent documents of Korea, Japan, and the United States has been described in the above description, not only the structure mentioned in the above description but also a more detailed structure may be used. Of course.

이러한 특허문서의 구조에는 도면과 발명의 구성에 대한 설명을 기재한 부분들이 있는데, 이 부분들을 각각 도면 설명부와 구성 설명부라 한다. 도면 설명부는 한국 또는 일본 특허문서의 경우 도면의 간단한 설명(151, 357)이고, 미국 특허문서의 경우 도면의 간단한 설명(BRIEF DESCRIPTION OF THE DRAWINS; 560)이다. 또한, 구성 설명부는 한국 특허문서의 경우 발명의 구성 및 작용(156), 일본 특허문서의 경우 발명의 실시 형태(355), 미국 특허문서의 경우 발명의 상세한 설명(DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS; 570)이다. 이하, 도면 설명부는 '도면 설명부(151, 357, 560)'이라 하고, 구성 설명부는 '구성 설명부(156, 355, 570)'이라 한다.In the structure of such a patent document, there are sections describing the drawings and the description of the composition of the invention, which are referred to as a drawing description section and a configuration description section, respectively. The drawing description section is a brief description of the drawings (151, 357) for a Korean or Japanese patent document, and BRIEF DESCRIPTION OF THE DRAWINS (560) for a US patent document. In addition, the configuration description is the configuration and operation (156) of the invention in the case of the Korean patent document, the embodiment of the invention (355) in the case of the Japanese patent document, DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS (570) to be. Hereinafter, the drawing description section will be referred to as 'drawing description sections 151, 357 and 560', and the configuration description section will be referred to as 'configuration description sections 156, 355 and 570'.

다음에 본 발명의 일실시예에 따른 문서 정보 추출 시스템을 도 2에 따라 설명한다.Next, a document information extraction system according to an embodiment of the present invention will be described with reference to FIG.

도 2는 본 발명의 일실시예에 따른 문서 정보 추출 시스템을 도시한 블록도이다.2 is a block diagram illustrating a document information extraction system according to an embodiment of the present invention.

도 2에서 도시한 바와 같이, 문서 정보 추출 시스템(70)은 문자영역인 설명부(10, 30, 50)와 그림영역인 도면부(20, 40, 60)로 이루어진 특허문서의 도면부(20, 40, 60)로부터 부호를 인식하는 인식부(71), 인식부(71)에서 인식한 부호에 대응하는 도면 정보를 설명부(10, 30, 50)로부터 추출하는 추출부(72), 다수의 문서들을 저장하고 추출부(72)에서 추출한 도면 정보를 저장하는 데이터베이스(73)로 이루어진다. 도면부(20, 40, 60)는 그림영역으로서 그림(211, 411, 611)과 그림의 특정 부위를 가리키는 문자(예를 들어, 도 1a, 도 1b, 도 1c에서 도면부의 그림에 도시된 100, 200, 300)로 이루어졌는데, 상기 문자가 바로 도면의 부호이다. 즉, 인식부(71)에서 인식하는 부호는 도면부(20, 40, 60)에 기재된 부재번호(211a, 411a, 611a)를 뜻한다. 또한, 설명부(10, 30, 50)에서 추출하는 도면 정보는 도면 개수, 도면 순서, 부호, 부호명칭 등을 말한다.As shown in FIG. 2, the document information extraction system 70 is a drawing section 20 of a patent document consisting of a description section 10, 30, 50, which is a text area, and drawing sections 20, 40, 60, which are picture areas. Recognition unit 71 for recognizing a code from 40, 60, extraction unit 72 for extracting drawing information corresponding to code recognized by recognition unit 71 from description unit 10, 30, 50, many And a database 73 for storing the document information and storing the drawing information extracted by the extraction unit 72. Drawings 20, 40, and 60 are drawing areas 1001, 211, 411 and 611 and letters indicating specific portions of the drawing (for example, 100 shown in the drawing of the drawing in Figs. 1A, 1B and 1C). , 200, 300), which are the symbols of the drawings. That is, the code | symbol recognized by the recognition part 71 means the member numbers 211a, 411a, 611a described in the drawing part 20, 40, 60. As shown in FIG. In addition, the drawing information extracted by the description units 10, 30, and 50 refers to the number of drawings, the order of drawings, signs, code names, and the like.

도 2에 도시한 추출부(72)는 설명부(10, 30, 50)의 도면 설명부(151, 357, 560)에서 도면의 개수 및 순서 정보를 추출한다. 즉, 특허문서의 도면 설명부(151, 357, 560)에는 '도 1은~' 또는 '도 2는~'과 같이 각 도면에 대한 간단한 설명이 기재되어 있다. 따라서, 해당 특허문서에 몇 개의 도면이 삽입되었는지, 각 도면들은 어떠한 순서로 나열되는지에 대한 정보를 획득할 수 있다. The extraction unit 72 shown in FIG. 2 extracts the number and order information of the drawings from the drawing description units 151, 357, and 560 of the description units 10, 30, and 50. That is, in the drawing description units 151, 357, and 560 of the patent document, a brief description of each drawing is described, such as 'Fig. 1 is' or 'Fig. 2 is'. Therefore, it is possible to obtain information on how many drawings are inserted in the patent document and in which order the drawings are arranged.

또한, 추출부(72)는 설명부(10, 30, 50)의 구성 설명부(156, 355, 570)에 기재된 본문으로부터 부호와 부호명칭을 추출한다. 이때, 부호는 인식부(71)에서 인식된 부호에 대응되는 것을 검색하여 판독하며, 각각의 도면에 대한 세부설명인 도면별 텍스트 블록(Text Block)들의 위치 정보를 이용하여 도면 정보를 추출한다. 여기서, 위치 정보는 한국 또는 일본 특허문서의 경우 식별번호, 미국 특허의 경우 라인(Line)번호를 이용하여 획득한다. 즉, 구성 설명부(156, 355, 570)에 '도 1은~', '도 2는~'과 같이 기재된 각 도면에 대한 설명들을 각 도면별 텍스트 블록이라 구분하고, 도 1의 도면 정보는 도 1의 텍스트 블록 내에서 검색함으로써 더욱 효과적으로 도면 정보를 추출하게 된다. 추출부(72)는 인식부(71)에서 인식한 부호에 대응하는 도면 정보를 추출하는데, 설명부(10, 30, 50)에서

,

또는

(이하,

는 공란(Space)을 의미한다)의 형식으로 기재된 내용을 인식부(71)가 인식한 부호에 대응한다고 간주하여 판독한다. 예를 들어, 인식부(71)가 도면부(20, 40, 60)에서 부호 '100'을 인식할 경우, 추출부(72)는 설명부에서

,

또는

의 형식으로 기재된 내용을 부호 '100'에 대응한다고 판독한다. 한국 특허문서의 경우는

, 일본 특허문서의 경우는

, 그리고 미국 특허문서의 경우는

을 주로 사용한다. In addition, the extraction unit 72 extracts a code and a code name from the text described in the

configuration description units

156, 355, and 570 of the

description units

10, 30, and 50. In this case, the code retrieves and reads the code corresponding to the code recognized by the recognition unit 71, and extracts the drawing information using the location information of the text blocks of each drawing, which is a detailed description of each drawing. Here, the location information is obtained by using an identification number in case of a Korean or Japanese patent document and a line number in case of a US patent. That is, descriptions of the drawings described in the

configuration description units

156, 355, and 570 as 'FIG. 1' and 'FIG. 2' are divided into text blocks for each drawing, and the drawing information of FIG. Searching in the text block of FIG. 1 extracts the drawing information more effectively. The extraction unit 72 extracts drawing information corresponding to the code recognized by the recognition unit 71. In the

description units

10, 30, and 50,

,

or

(Below,

The contents described in the form of "space" are regarded as corresponding to the code recognized by the recognition unit 71, and are read. For example, when the recognition unit 71 recognizes the symbol '100' in the

drawing units

20, 40, and 60, the extraction unit 72 is used in the description unit.

,

or

The content described in the following format is read as corresponding to the symbol '100'. In the case of Korean patent documents

For Japanese patent documents

, And in the case of US patent documents,

Mainly used.

추출부(72)는 설명부(10, 30, 50)에서 부호를 판독한 후에 부호 명칭을 판독한다. 부호 명칭은 설명부에서 판독한 부호의 전위 또는 후위에 존재하는 명사형이 간주된다. 예를 들어, 판독한 부호가 '(100)'일 경우, 부호명칭은 '(100)'의 전위 또는 후위를 검색한다. 부호 '(100)'이 '컴퓨터(100)는'이라는 단어에 속했다면 부호명칭은 '(100)'의 전위에 위치한 명사형인 '컴퓨터'가 된다. 다른 예로, 판독한 부호가 '200'일 경우, 부호 '200'이 '모니터200를'이라는 단어에 속했다면 부호명칭은 '모니터'가 된다.The extraction unit 72 reads the code name after the code is read by the description units 10, 30, and 50. The code name is regarded as a noun form existing at or after the code read from the description. For example, when the read code is '(100)', the code name searches for the potential or the back of '(100)'. If the sign '(100)' belongs to the word 'computer 100', the code name becomes 'computer' which is a noun type located at the potential of '(100)'. As another example, when the read code is '200', the code name is 'monitor' if the code '200' belongs to the word 'monitor 200'.

도면 정보 추출시에 사용하는 도면별 텍스트 블록의 위치는 도면마다 1개 이상 설정할 수 있다. 예를 들어, '도 1은~'으로 시작되는 도 1의 텍스트 블록과 '도 2는~'으로 시작되는 도 2의 텍스트 블록 이후에 다시 '도 1은~'으로 시작되는 도 1의 텍스트 블록이 발생할 수 있다. 또한, 도 1의 첫 번째 텍스트 블록 내에서 단 하나의 부호도 검색되지 않을 경우 두 번째 텍스트 블록에서 부호를 검색해야 하는 경우가 발생할 수 있다. 따라서 도면별 텍스트 블록의 위치는 1개 이상 설정 가능하다. One or more positions of the text block for each drawing used when extracting the drawing information may be set for each drawing. For example, after the text block of FIG. 1 begins with 'FIG. 1 begins with' and the FIG. 2 text block begins with 'FIG. 2 begins with', the text block of FIG. 1 begins with 'FIG. This can happen. In addition, when only one sign is not found in the first text block of FIG. 1, a case may be required to search for a sign in the second text block. Therefore, one or more positions of text blocks may be set for each drawing.

도면 정보 추출을 완료한 후에는 도면 설명부(151, 357, 560)와 구성 설명부(156, 355, 570)에서 추출한 도면 개수, 도면 순서, 부호, 부호명칭 등의 도면 정보를 데이터베이스(73)에 저장한다.After the drawing information extraction is completed, the database 73 stores drawing information such as the number of drawings, drawing order, codes, and code names extracted from the drawing description units 151, 357, and 560 and the configuration description units 156, 355, and 570. Store in

다음에 본 발명의 일실시예에 따른 문서 정보 추출 방법에 대해 도 3에 따라 설명한다.Next, a document information extraction method according to an embodiment of the present invention will be described with reference to FIG.

도 3은 본 발명의 일실시예에 따른 문서 정보 추출 방법을 설명하는 흐름도이다.3 is a flowchart illustrating a document information extraction method according to an embodiment of the present invention.

도 3에서 도시한 바와 같이, 우선 데이터베이스(73)에 저장된 다수의 문서들 중에서 작업하고자 하는 하나의 문서를 선택한다(ST 3010). 다음으로, 인식부(71)가 문서의 도면부(20, 40, 60)로부터 부호를 인식한다(ST 3020). 추출부(72)는 도면부(20, 40, 60)에서 인식한 부호에 대응되는 문자를 설명부(10, 30, 50)에서 추출한다. 추출과정은 다음과 같다. 추출부(72)가 설명부(10, 30, 50)의 도면 설명부(151, 357, 560)인 도면의 간단한 설명(151, 357)이나 도면의 간단한 설명(BRIEF DESCRIPTION OF THE DRAWINS)(560)에서 도면의 개수 및 순서 정보를 추출한다(ST 3030). 즉, '도 1은~', '도 2는~'과 같이 기재된 내용을 통하여 전체 도면의 개수와 그 순서에 대한 정보를 획득한다. 추출부(72)는 추출한 도면의 개수 및 순서 정 보를 데이터베이스(73)에 도면 정보로서 저장한다(ST 3040). 다음으로, 추출부(102)가 설명부(10, 30, 50)의 구성 설명부(156, 355, 570)인 발명의 구성 및 작용(156), 발명의 실시 형태(355), 또는 발명의 상세한 설명(DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS)(570)가 시작하는 위치를 최초 검색 위치로 설정한다(ST 3050). 이는 검색 시간 및 서버 부하를 경감하기 위함이다. 이제, 추출부(72)가 구성 설명부(156, 355, 570)의 최초 검색 위치부터 각 도면별 텍스트 블록의 위치 정보를 구체화한다(ST 3060). 예를 들어, '도 1~', '도 2~', '도 3~'이 기재된 각각의 위치를 검색하고, 검색한 위치를 저장한다. 이때, '도 1~'과 '도 2~' 사이에 기재된 내용들은 도 1에 관한 설명이므로 도 1의 텍스트 블록이 된다. 도면 정보의 추출은 구성 설명부(156, 355, 570) 내에서 이루어진다. 물론, 도면별 텍스트 블록의 위치는 1개 이상 설정할 수 있다. 이제, 획득한 위치 정보를 이용하여 각 도면별 텍스트 블록에서 부호와 부호 명칭을 판독한다(ST 3070). 이때, 부호는 단계 (b)에서 인식한 부호에 대응하는 형태만 검색하면 된다. 따라서, 추출부(72)는

,

또는

의 형식으로 기재된 내용을 부호로 간주하여 판독한 후, 판독한 부호의 전위 또는 후위에 존재하는 명사형을 부호 명칭으로 간주하여 판독한다. 판독이 완료되면, 부호와 부호 명칭을 데이터베이스(73)에 저장한 도면 정보에 추가한다(ST 3080). 이로써, 도면 정보가 완성된다.As shown in FIG. 3, one document to be worked on is selected from among a plurality of documents stored in the database 73 (ST 3010). Next, the recognition unit 71 recognizes a code from the drawing

units

20, 40, and 60 of the document (ST 3020). The extractor 72 extracts the characters corresponding to the codes recognized by the drawing

units

20, 40, and 60 from the

description units

10, 30, and 50. The extraction process is as follows. BRIEF DESCRIPTION OF THE DRAWINS (560) or the brief description of the drawings (151, 357, 560), wherein the extraction section (72) is the drawing description (151, 357, 560) of the description (10, 30, 50). ) And extract the number and order information of the drawing (ST 3030). That is, information about the number and order of the entire drawings is obtained through the contents described as 'FIG. 1 is' and 'FIG. 2 is'. The extraction unit 72 stores the number and order information of the extracted drawings in the database 73 as drawing information (ST 3040). Next, the structure and action 156 of the invention, the embodiment 355 of the invention, or the invention of which the extraction section 102 is the

component description parts

156, 355, 570 of the

description parts

10, 30, 50. The position where the DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS 570 starts is set as the initial search position (ST 3050). This is to reduce search time and server load. Now, the extraction unit 72 embodies the position information of the text block for each drawing from the initial search position of the

configuration description units

156, 355, and 570 (ST 3060). For example, each of the locations described in FIGS. 1 to 2, 3 and 3 is searched, and the searched location is stored. In this case, the contents described between FIGS. 1 to 2 and 2 to 2 are text blocks of FIG. Extraction of the drawing information is made in the

component description sections

156, 355, and 570. FIG. Of course, one or more positions of the text block for each drawing may be set. Now, the sign and the sign name are read from the text block of each drawing by using the acquired position information (ST 3070). In this case, the code only needs to be searched for the form corresponding to the code recognized in step (b). Therefore, the extraction unit 72

,

or

The content described in the form of " a " is regarded as a code, and then read. When the reading is completed, the code and code name are added to the drawing information stored in the database 73 (ST 3080). Thus, the drawing information is completed.

다음에, 본 발명의 보다 구체적인 실시예를 설명한다.Next, a more specific embodiment of the present invention will be described.

문서 정보 추출 방법은 크게 5개의 스텝(Step)에 의해 수행된다.The document information extraction method is largely performed by five steps.

1. 스텝 11. Step 1

우선, 데이터베이스(73)에 저장된 어느 하나의 특허문서를 선택한(ST 3010) 후, 인식부(71)가 도면부(20, 40, 60)에서 부호를 인식한다(ST 3020). 다음으로, 추출부(72)가 도면 설명부(151, 357, 560)에서 각 문서별 도면의 순서와 문서당 도면 데이터 필드(Data Field)의 개수 정보를 획득한다(ST 3030). 예를 들어, 특허문서 A, B가 다음과 같다고 가정한다.First, after any one patent document stored in the database 73 is selected (ST 3010), the recognition unit 71 recognizes a code in the drawing units 20, 40, and 60 (ST 3020). Next, the extraction unit 72 obtains the information of the order of the drawings for each document and the number of data fields per document in the drawing description units 151, 357, and 560 (ST 3030). For example, assume that patent documents A and B are as follows.

특허문서 APatent Document A

『[도면의 간단한 설명]"[A Brief Description of Drawings]

도 1은 본 발명이 사용되는 비디오 디스플레이 스크린의 프레임을 도시한 도면,1 shows a frame of a video display screen in which the present invention is used;

도 2는 본 발명의 원리에 따라 구성된 메모리 회로의 블록도,2 is a block diagram of a memory circuit constructed in accordance with the principles of the invention;

도 3은 본 발명의 원리에 따라 구성된 메모리 회로의 어드레스 발생기 부분의 다른 제1실시예의 블록도,3 is a block diagram of another first embodiment of an address generator portion of a memory circuit constructed in accordance with the principles of the invention;

도 4는 본 발명의 원리에 따라 구성된 메모리 회로의 어드레스 발생기 부분의 다른 제2실시예의 블록도,4 is a block diagram of another second embodiment of an address generator portion of a memory circuit constructed in accordance with the principles of the invention;

도 5는 본 발명의 원리에 따라 구성된 메모리 회로의 어드레스 발생기 부분에 의해 사용된 어드레스 순차기의 블록도.』5 is a block diagram of an address sequencer used by an address generator portion of a memory circuit constructed in accordance with the principles of the present invention.

상기 특허문서 A의 도면은 5개이며, 도 1, 도 2, 도 3, 도 4, 도 5의 순서로 구성된다.The patent document A has five drawings, which are configured in the order of FIGS. 1, 2, 3, 4, and 5.

특허문서 BPatent Document B

『[도면의 간단한 설명]"[A Brief Description of Drawings]

(a) 상기 블록도의 실시예1 (a) Embodiment 1 of the block diagram

(b) 상기 블록도의 실시예2 (b) Embodiment 2 of the block diagram

도 4는 본 발명의 원리에 따라 구성된 메모리 회로의 어드레스 발생기 부분에 의해 사용된 어드레스 순차기의 블록도.』4 is a block diagram of an address sequencer used by an address generator portion of a memory circuit constructed in accordance with the principles of the present invention.

상기 특허문서 B의 도면은 5개이며, 도 1, 도 2 중 (a), 도 2 중 (b), 도 3, 도 4, 도 5의 순서로 구성된다.The patent document B has five drawings, and is composed in the order of FIGS. 1 and 2 (a) and 2 (b), 3, 4 and 5.

추출부(72)는 각 문서당 생성되는 도면 정보를 표 1에 도시한 바와 같이 데이터베이스(104)에 자동 저장한다(ST 3040).The extraction unit 72 automatically stores the drawing information generated for each document in the database 104 as shown in Table 1 (ST 3040).

문서 번호Document number 도면drawing 부호sign 부호 명칭Code Name 부호sign 부호 명칭Code Name 부호sign 부호 명칭Code Name 문서 A Document A 도1Figure 1 100100 컴퓨터computer 200200 모니터monitor 300300 키보드keyboard 도2Figure 2 ...... ...... ...... ...... ...... ...... 도3Figure 3 ...... ...... ...... ...... ...... ...... 도4Figure 4 ...... ...... ...... ...... ...... ...... 도5Figure 5 ...... ...... ...... ...... ...... ...... 문서 B Document B 도1Figure 1 ...... ...... ...... ...... ...... ...... 도2 (a)Figure 2 (a) ...... ...... ...... ...... ...... ...... 도2 (b)Figure 2 (b) ...... ...... ...... ...... ...... ...... 도3Figure 3 ...... ...... ...... ...... ...... ...... 도4Figure 4 ...... ...... ...... ...... ...... ...... 도5Figure 5 ...... ...... ...... ...... ...... ......

2. 스텝 22. Step 2

추출부(72)는 구성 설명부(156, 355, 570)가 시작하는 위치를 최초 검색 위치로 설정한다(ST 3050). 최초 검색 위치는 특허문서의 설명부(10, 30, 50)에서 문자를 검색하는 검색 시간 및 서버 부하의 경감을 위하여 설정한다. The extraction unit 72 sets the position where the configuration description units 156, 355, and 570 start as the initial search position (ST 3050). The initial search position is set in order to reduce the server load and the search time for searching for characters in the description sections 10, 30, and 50 of the patent document.

3. 스텝 33. Step 3

추출부(72)는 설정한 최초 검색 위치의 후위에서부터 시작하여 최초로 등장하는 '도 1', '도 2', 도 3' 또는 'Fig 1', 'Fig 2', 'Fig 3'의 위치를 검색하고, 검색한 각 문자의 위치를 저장한다(ST 3060). 즉, 각각의 도면 정보를 저장하기 위해 각 도면별 정보가 위치한 텍스트 블록을 구분한다. 이는 '도 1'이 검색된 위치와 '도 2'가 검색된 위치 사이의 텍스트 블록간을 구분하기 위함으로써, 도면간의 순서 정보는 스텝 1에서 획득한 도면 순서 정보에 의거하여 블록간의 위치 정보를 구체화한다.The extraction unit 72 starts at the rear of the set initial search position and starts with the positions of 'Fig. 1', 'Fig. 2', 3 'or' Fig 1 ',' Fig 2 'and' Fig 3 'that appear first. The searched position is stored (ST 3060). That is, in order to store each drawing information, a text block in which information for each drawing is located is distinguished. This is to distinguish between text blocks between the position where FIG. 1 is searched and the position where FIG. 2 is searched, so that the order information between drawings embodies the position information between blocks based on the drawing order information obtained in step 1. .

4. 스텝 44. Step 4

추출부(72)는 각 '도 1', '도 2', '도 3'의 텍스트 블록들을 개별 블록화하여 '도 1'이 검색된 위치와 '도 2'가 검색된 위치 사이에 기재된 내용을 '도 1'의 정보가 위치한 블록으로 인식한 후, 인식한 텍스트 블록 내에 인식부(72)에서 인식한 부호에 대응하는 ,

,

또는

의 형식을 가진 문자가 존재하면 각 문자들을 구분하여 인식한다. 인식한 문자들은 부호로 간주하여 판독한다. 이때, 단계 (ST 3030)에서 획득한 정보에 의거하여 작업하는 도면이 가장 마지막에 위치하는 도면일 경우, 텍스트 블록은 구성 설명부(156, 355, 570)이내로 제한다. 즉, 마지막 도면인 '도 5'의 경우에는 디폴트 텍스트 블록(Default Text Block) 값을 지정하거나, 각 설명부의 발명의 구성 및 작용(156), 발명의 실시 형태(355) 또는 발명의 상세한 설명(DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS)(570)이내의 내용만을 도면 정보 획득 블록으로 지정한다.The extraction unit 72 blocks the text blocks of each of FIGS. 1, 2, and 3, respectively, and displays the contents between the position where FIG. 1 is searched and the position where FIG. 2 is searched. 1 'information is recognized as a block, and then, corresponding to a code recognized by the recognition unit 72 in the recognized text block,

,

or

If characters with the form of are present, they are recognized separately. Recognized characters are read as if they are signs. In this case, when the drawing working on the basis of the information obtained in step ST 3030 is the last drawing, the text block is removed within the

configuration description units

156, 355, and 570. That is, in the case of FIG. 5, which is the last drawing, a default text block value is designated, or the configuration and operation 156 of the invention of each description section, the embodiment 355 of the invention, or the detailed description of the invention ( Only contents within the DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS) 570 are designated as drawing information acquisition blocks.

5. 스텝 55. Step 5

추출부(72)는 스텝 4에서 추출한 부호의 전위 또는 후위에 존재하는 명사형을 부호에 매칭(Matching) 가능한 부호 명칭으로 판독한다(ST 3070). 이때, 각 도면별 블록의 위치는 특허문서 내에 1개소 이상 설정가능하다. 이는 '도 1'의 정보가 위치할 것으로 간주한 텍스트 블록 내에 단 하나의 부호 형식도 검색되지 않을 수 있다는 것을 의미한다.The extraction unit 72 reads the noun type existing at the front or the rear of the code extracted in step 4 as a code name capable of matching the code (ST 3070). At this time, one or more positions of blocks in each drawing can be set in the patent document. This means that even one code format may not be searched in the text block that the information of FIG. 1 is considered to be located.

마지막으로, 추출부(72)는 판독한 부호와 부호 명칭을 데이터베이스(73)에 저장한 도면 정보에 추가한다(ST 3080). 이로써, 도면 정보가 완성된다.Finally, the extraction unit 72 adds the read code and the code name to the drawing information stored in the database 73 (ST 3080). Thus, the drawing information is completed.

이와 같이, 본 발명은 도면 정보에 위치한 텍스트 형태소 해석을 통하여, 각 도면 구성부 사이의 위치, 크기에 해당하는 정보를 획득 가공 가능하다. 즉, 도면 구성부 100, 상부에 110이 위치했는지의 여부, 구성부 200, 상부에 같은 크기의 구성부 300이 교차 위치하는지의 여부, 구성부, 100, 200, 300, 400 중 구성부 400은 100과 300을 연결하는 형태로 위치하는지의 여부 등의 데이터를 특허문서을 직접 해석함으로써 획득 가능하다.As described above, the present invention is capable of acquiring and processing information corresponding to the position and size between the respective drawing elements through text morpheme analysis located in the drawing information. That is, whether or not the drawing component 100, 110 is located at the top, the component 200, whether the component 300 of the same size on the upper portion intersecting, the component 400 of the component, 100, 200, 300, 400 Data such as whether it is located in the form of connecting 100 and 300 can be obtained by directly interpreting a patent document.

이상, 본 발명자에 의해서 이루어진 발명은 상기 실시 예에 따라 구체적으로 설명하였지만, 본 발명은 상기 실시 예에 한정되는 것은 아니고, 그 요지를 이탈하지 않는 범위에서 여러 가지로 변경 가능한 것은 물론이다.As mentioned above, although the invention made by this inventor was demonstrated concretely according to the said Example, this invention is not limited to the said Example and can be variously changed in the range which does not deviate from the summary.

상술한 바와 같이, 본 발명에 따른 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체에 의하면, 문서인식을 통해 얻어진 정보를 데이터베이스로 구축하여 사용자에게 제공할 수 있다는 효과가 얻어진다.As described above, according to the document information extraction system, the extraction method thereof, and the recording medium recording the same, the information obtained through the document recognition can be constructed into a database and provided to the user.

또, 본 발명에 따른 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체에 의하면, 구축한 데이터베이스를 통해 사용자에게 더욱 빠른 서비스를 제공할 수 있다는 효과도 얻어진다.In addition, according to the document information extraction system, the extraction method and the recording medium recording the same according to the present invention, it is also possible to provide a faster service to the user through the built database.

또, 본 발명에 따른 문서 정보 추출 시스템과 그 추출 방법 및 이를 기록한 기록매체에 의하면, 데이터베이스 구축으로 인하여 OCR 제품의 구매 비용을 절감시킬 수 있다는 효과도 얻어진다.In addition, according to the document information extraction system, the extraction method and the recording medium recording the same according to the present invention, it is also possible to reduce the purchase cost of OCR products due to the database construction.

Claims

A recognition unit for recognizing a code from a drawing unit of a document including a description unit which is a text area and a drawing unit which is a picture area;

An extraction unit for extracting information associated with the code from the description unit;

And a database in which a plurality of the documents are stored and the information extracted by the extracting unit is stored.

The method of claim 1,

And the information includes at least one of the number and order of drawings constituting the drawing unit and the code and code name.

The method of claim 1,

And the extracting unit extracts information using location information of the text blocks corresponding to each drawing, which is a detailed description of each drawing.

The method according to claim 1 or 3,

The extraction section is in the description section

,

or

The document information extraction system characterized in that it reads the content described in the form of "corresponding to the code recognized by the said recognition part."

The method of claim 4, wherein

And the extracting unit reads the noun form existing at the front or the rear of the code as the code name in the description unit and reads it.

The method of claim 3, wherein

And at least one position of the text block per drawing can be set for each drawing.

(a) selecting any one document from a database in which a plurality of documents including a description part which is a text area and a drawing part which is a picture area are stored;

(b) recognizing a sign from the drawing part,

(c) extracting number information and order information of the drawings from the drawing description section described in the description section;

(d) storing the number information and the order information of the drawing as drawing information in the database;

(e) setting the initial position of the component description described in the description as the initial search position,

(f) embodying location information of the text blocks for each drawing which is a detailed description of each drawing in the configuration description section;

(g) reading the code and code name corresponding to the code recognized in step (b) in the text block for each drawing;

(h) adding the code and the code name to the drawing information of the step (d).

The method of claim 7, wherein

The drawing description unit is a document information extraction method, characterized in that the part is described 'simple description of the drawings' when the document is a Korean patent document.

The method of claim 7, wherein

The document description section is a document information extraction method, characterized in that the part is described in the "short description of the drawings" when the document is a Japanese patent document.

The method of claim 7, wherein

The drawing description unit is a document information extraction method, characterized in that the 'BRIEF DESCRIPTION OF THE DRAWINS' is described when the document is a US patent document.

The method of claim 7, wherein

The configuration description unit is a document information extraction method, characterized in that the part is described 'constitution and operation of the invention' when the document is a Korean patent document.

The method of claim 7, wherein

And the configuration description portion is a portion in which 'embodiment of the invention' is described when the document is a Japanese patent document.

The method of claim 7, wherein

The configuration description unit is a document information extraction method, characterized in that the 'DETAILED DESCRIPTION OF EXEMPLARY EMBODIMENTS' is described when the document is a US patent document.

The method of claim 7, wherein

Step (g) is

,

or

And reading out the content described in the form of a code as a code.

The method of claim 14,

The step (g) further includes the step of reading the noun form existing at the front or the rear of the code as a code name and reading the document information.

The method of claim 7, wherein

And in the step (f), at least one position of the text block for each drawing can be set for each drawing.

(b) recognizing a sign from the drawing part,

(h) a computer-readable recording medium having recorded thereon a program for executing the step of adding the code and the code name to the drawing information of the step (d).