KR20180033786A

KR20180033786A - System of searching and providing original document image file and method thereof

Info

Publication number: KR20180033786A
Application number: KR1020160123177A
Authority: KR
Inventors: 오종배; 이병용
Original assignee: 오종배; 이병용
Priority date: 2016-09-26
Filing date: 2016-09-26
Publication date: 2018-04-04

Abstract

A system and method for searching and providing an original document image file are disclosed. Original document image file database where the original document image file is pre-stored; OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module; An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata; An XML document file database in which the XML document converted by the XML conversion module is stored; An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database; An original document image file corresponding to the original document data retrieved from the XML document retrieval module is retrieved from the original document image file database and is provided to the user terminal.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a system and a method for searching and providing an original document image file,

The present invention relates to a system and method for searching and providing a file, and more particularly, to a system and method for searching and providing an original document image file.

In a government office, a securities company, or a bank, all documents such as contract documents, personal information documents, and application documents are stored. These documents are scanned by a scanner and stored as image files . In addition to written documents, fax documents and e-mails are kept as scanned image files for personal information documents.

Since the amount of documents is so large and constantly accumulated, it is easily stored as a scanned image file since it is not easy to keep it as a document file.

However, when a personal information document or a document to be searched later is to be searched, the character of the image file can not be read so that the search level is lowered and the usability thereof is very low.

When storing a scanned image file, only the basic information such as the type of the scanned image file, the title, the date, and related personal information are stored together. Therefore, the scanned image file can be searched only within the range that can be searched through such basic information Only.

Since the scanned image file also contains various contents, there is a limitation in performing a specific search in a wide range, for example, collecting real estate contract documents or searching for a real estate contract document of a specific individual. For example, although you can specifically search for Person A's September 23, 2016 dossier, a more diverse and broader search based on the text is not possible.

Of course, scan image files may be recognized as OCR (optical character reader) and stored as document files such as PDF. However, personal information and security related contents contained in these documents may be illegally leaked through document files or exposed by hacking The risk is very high. Most of these documents contain personal information.

In addition, in the case of separately storing each document file such as PDF, it takes a considerable time to open and read each file to check the contents of the document file, and there is a problem that the convenience of access is very low.

Accordingly, it is possible to solve the problem that the inconvenience of searching according to the existing scan image file method and the problem that the searchable level itself is lowered, and the security risk or convenience of access which is caused when the document file is stored is lowered There is a demand.

0160215 10-2005-0075301

An object of the present invention is to provide a system for searching and providing an original document image file.

It is another object of the present invention to provide a method for searching and providing an original document image file.

According to another aspect of the present invention, there is provided a system for searching and providing an original document image file, comprising: an original document image file database in which an original document image file is stored in advance; OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module; An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata; An XML document file database in which the XML document converted by the XML conversion module is stored; An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database; And an original document image file providing module for searching an original document image file corresponding to the original document data retrieved from the XML document retrieving module in the original document image file database and providing the original document image file to the user terminal.

The tag extraction module may further include a tag extraction module for automatically extracting a tag for searching the contents of the original document image file in which OCR recognition is performed by the OCR recognition / metadata extraction module.

In this case, the XML document search module may be configured to search the XML document using the tag extracted from the tag extraction module for the text data requested to be searched by the user terminal.

According to another aspect of the present invention, there is provided a method for searching and providing an original document image file, the OCR recognition module scanning an original document image file stored in advance in an original document image file database to perform OCR recognition, Extracting metadata representing a format of an original document image file; Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and storing the converted XML document in an XML document file database; The XML document retrieval module retrieving the original document data to be retrieved by the user terminal in the XML document stored in the XML document file database; And the original document image file providing module searches the original document image file database for the original document image file corresponding to the original text data retrieved from the XML document retrieving module and provides the original document image file to the user terminal.

Here, the XML conversion module converts the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and stores the converted XML document in an XML document file database. The module may be configured to automatically extract a tag for searching the contents of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module.

The step of the XML document retrieval module retrieving the original document data requested to be retrieved by the user terminal in the XML document stored in the XML document file database may include retrieving the original document data to be retrieved by the user terminal And retrieve the XML document using the tag extracted by the tag extracting module.

According to the above-described system and method for searching and providing original document image files, a separate document file database is constructed by recognizing OCR of the original document image file while keeping the existing original document image file database established, And the original document image file of the original document image file database is selected and read by using the search result in the database so that a desired original document image file can be searched in a variety of search methods in a large original document image file, There is an effect that can be read out.

In addition, the XML document file database only provides search results and is configured not to directly access the document body to obtain information, thereby preventing personal information or security information contained in the document body from being leaked.

1 is a block diagram of a system for searching and providing an original document image file according to an embodiment of the present invention.
2 is a flowchart illustrating a method of searching for and providing an original document image file according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail to the concrete inventive concept. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram of a system for searching and providing an original document image file according to an embodiment of the present invention.

Referring to FIG. 1, an original document image file searching and providing system 100 according to an embodiment of the present invention includes an original document image file database 110, an optical character reader (OCR) recognition / metadata metadata extraction module 120, an XML conversion module 130, an XML document file database 140, a tag extraction module 150, an XML document search module 160, an original document image file provision module 170, . &Lt; / RTI >

The system 100 for searching and providing an original document image file is configured to separately construct an original document image file database 110 and an XML document file database 140. The system 100 includes two operations, Can be implemented.

In particular, there is no fear that the contents of the text or the personal information contained in the text will be leaked, and each document is implemented as an XML document, so it is very convenient and easy to search Is high.

Hereinafter, the detailed configuration will be described.

The original document image file database 110 is a configuration in which an original document image file is stored in advance.

The original document image file database 110 is a configuration for scanning original documents including personal information of a securities company, a bank, a government office, a court, a hospital, and the like and storing them as image files. It is a database already built in various uses.

The OCR recognition module 120 may be configured to perform an optical character reader (OCR) recognition by scanning an original document image file stored in the original document image file database 110. [

The OCR recognition / metadata extraction module 120 can be configured to recognize characters of an original document image file and convert them into readable character codes such as ASCII codes.

In addition, the OCR recognition / metadata extraction module 120 may be configured to extract metadata representing a format of an original document image file.

The OCR recognition / metadata extraction module 120 may be configured to extract metadata for replacing an original document such as data, schema, contents, links, etc. in an XML format.

Meanwhile, the tag extraction module 150 may be configured to automatically extract a tag from the contents of the original document image file recognized by the OCR recognition / metadata extraction module 120.

The tags can be defined by predetermined algorithms and can be determined according to the characteristics of the documents. The tag can be automatically extracted from the contents of the document. In the case of a bank, the title such as an account opening agreement or a cash card application can be automatically designated as a contract month, a name, and the like. In addition, the tag may be composed of predetermined keywords such as a contract, an opening, an account, a card, a loan, a credit loan, a security, and the like.

These tags can be used to perform searches on vast amounts of document content. The tag can also be used for searching for big data analysis.

The XML conversion module 130 may be configured to convert the original document data of the original document image file recognized by the OCR recognition / metadata extraction module 120 into an XML document using the metadata. An XML document is implemented in a format that is configured to facilitate retrieval as well as accessing its contents. Multiple XML documents can be interlinked and configured to enable full content retrieval from a remote location.

The XML document converted by the XML conversion module 130 may be stored in the XML document file database 160. [ The XML document stored in the XML conversion module 130 may be stored in a one-to-one correspondence with the original document image file already stored in the original document image file database 110. [

Meanwhile, the XML document search module 160 may be configured to receive a search request from the user terminal 10 at a remote location.

The XML document retrieval module 160 may be configured to retrieve the original text data requested to be retrieved by the user terminal 10 in the XML document stored in the XML document file database 140. [

The original document image file providing module 170 does not provide the XML document retrieved from the XML document retrieving module 160 directly to the user terminal 10 but directly outputs the original document image file corresponding to the retrieved XML document to the original document image file DB (110) and provide it to the user terminal (10).

The XML document search module 160 searches the contents of the text itself and does not provide the retrieved contents to the user terminal 10 as it is. This is because if the contents of the body of the XML document can be read or read as character code, the access is restricted because of the risk of leakage of personal information or important security information.

The XML document retrieving module 170 may be configured to retrieve the original document data requested to be retrieved by the user terminal 10 by using the metadata extracted from the tag extracting module 150. [

For example, the contents of an XML document can be retrieved by using search terms such as X, XX, contract, and loan. Multiple XML documents containing these search terms can be retrieved.

On the other hand, the retrieval method using metadata can be utilized for big data analysis. Since the content of the text is not provided in the form of a readable character code even in the analysis of the big data, the XML document search module 160 searches the search result while limiting the exposure of the contents of the text, Results can be provided. For example, the search results such as the number of contracts for mortgage loans in 2016, the number of cases by region, and age can be culled, and the Big Data Analysis Module (not shown) can perform big data analysis using these search results. The XML document retrieval module 160 provides only the number of hits, such as the number of contracts, the number of regions, and the age, without exporting the personal information, so that the big data analysis can be performed without exposing personal information or security information can do. Searchers or big data analysts who search for them or search for big data can do search or big data analysis without exposing personal information or security information.

That is, the content of the XML document itself is configured not to leak out of the XML document file database 140. It can be seen that only an identifier indicating the retrieval result (retrieval result, number of hits, etc.) or an original document image file corresponding to the retrieved XML document is provided outside the XML document file database 140.

2 is a flowchart illustrating a method of searching for and providing an original document image file according to an embodiment of the present invention.

Referring to FIG. 2, an optical character reader (OCR) recognition module 120 scans an original document image file stored in advance in an original document image file database 110 to perform OCR recognition, Metadata is extracted (S101).

Next, the XML conversion module 130 converts the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition module 120 into the XML document using the extracted metadata, (S102).

Next, the tag extraction module 150 automatically extracts a tag for searching the contents of the original document image file in which OCR recognition is performed by the OCR recognition / metadata extraction module 120 (S103 ).

Next, the XML document search module 160 searches the XML document stored in the XML document file database 140 for the text data to be searched by the user terminal (S104).

At this time, the XML document retrieving module 160 may retrieve the original document data that is requested to be retrieved by the user terminal 10 by using the tag extracted from the tag extracting module 150.

Next, the original document image file providing module 170 searches the original document image file database 110 for the original document image file corresponding to the original text data retrieved from the XML document retrieval module 160, and provides the original document image file to the user terminal 10 (S105).

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the following claims. There will be.

110: Original document image file database
120: OCR recognition / metadata extraction module
130: XML transformation module
140: XML document file database
150: tag extraction module
160: XML document retrieval module
170: Original document image file provision module

Claims

Original document image file database where the original document image file is pre-stored;
OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module;
An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata;
An XML document file database in which the XML document converted by the XML conversion module is stored;
An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database;
An original document image file providing module for searching an original document image file corresponding to the original document data retrieved from the XML document retrieving module from the original document image file database and providing the original document image file to the user terminal; .

The method according to claim 1,
Further comprising a tag extracting module for automatically extracting a tag for searching the contents of an original document image file in which OCR recognition is performed by the OCR recognition / metadata extracting module,
Wherein the XML document retrieval module comprises:
And searching the XML document using the tag extracted from the tag extracting module for the original document data to be searched for by the user terminal.

An optical character reader (OCR) recognition module performs an OCR recognition by scanning an original document image file stored in advance in an original document image file database and extracts metadata representing a format of the original document image file ;
Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and storing the converted XML document in an XML document file database;
The XML document retrieval module retrieving the original document data to be retrieved by the user terminal in the XML document stored in the XML document file database;
Retrieving an original document image file corresponding to the original document data retrieved from the XML document retrieving module from the original document image file database and providing the retrieved original document image file to the user terminal; Delivery method.

The method of claim 3,
Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into the XML document and storing the converted XML document in the XML document file database,
Further comprising the step of automatically extracting a tag for retrieving a content from an original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module,
The step of the XML document retrieval module retrieving the original text data to be retrieved by the user terminal in the XML document stored in the XML document file database,
Wherein the XML document retrieving module is configured to retrieve an original document data to be retrieved by the user terminal from an XML document using a tag extracted from the tag extracting module.