KR20180033786A - System of searching and providing original document image file and method thereof - Google Patents

System of searching and providing original document image file and method thereof Download PDF

Info

Publication number
KR20180033786A
KR20180033786A KR1020160123177A KR20160123177A KR20180033786A KR 20180033786 A KR20180033786 A KR 20180033786A KR 1020160123177 A KR1020160123177 A KR 1020160123177A KR 20160123177 A KR20160123177 A KR 20160123177A KR 20180033786 A KR20180033786 A KR 20180033786A
Authority
KR
South Korea
Prior art keywords
original document
image file
module
xml
document image
Prior art date
Application number
KR1020160123177A
Other languages
Korean (ko)
Inventor
오종배
이병용
Original Assignee
오종배
이병용
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 오종배, 이병용 filed Critical 오종배
Priority to KR1020160123177A priority Critical patent/KR20180033786A/en
Publication of KR20180033786A publication Critical patent/KR20180033786A/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/178Techniques for file synchronisation in file systems
    • G06F16/1794Details of file format conversion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/50Information retrieval; Database structures therefor; File system structures therefor of still image data
    • G06F16/58Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/583Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A system and method for searching and providing an original document image file are disclosed. Original document image file database where the original document image file is pre-stored; OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module; An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata; An XML document file database in which the XML document converted by the XML conversion module is stored; An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database; An original document image file corresponding to the original document data retrieved from the XML document retrieval module is retrieved from the original document image file database and is provided to the user terminal.

Description

FIELD OF THE INVENTION [0001] The present invention relates to a system and a method for searching and providing an original document image file,

The present invention relates to a system and method for searching and providing a file, and more particularly, to a system and method for searching and providing an original document image file.

In a government office, a securities company, or a bank, all documents such as contract documents, personal information documents, and application documents are stored. These documents are scanned by a scanner and stored as image files . In addition to written documents, fax documents and e-mails are kept as scanned image files for personal information documents.

Since the amount of documents is so large and constantly accumulated, it is easily stored as a scanned image file since it is not easy to keep it as a document file.

However, when a personal information document or a document to be searched later is to be searched, the character of the image file can not be read so that the search level is lowered and the usability thereof is very low.

When storing a scanned image file, only the basic information such as the type of the scanned image file, the title, the date, and related personal information are stored together. Therefore, the scanned image file can be searched only within the range that can be searched through such basic information Only.

Since the scanned image file also contains various contents, there is a limitation in performing a specific search in a wide range, for example, collecting real estate contract documents or searching for a real estate contract document of a specific individual. For example, although you can specifically search for Person A's September 23, 2016 dossier, a more diverse and broader search based on the text is not possible.

Of course, scan image files may be recognized as OCR (optical character reader) and stored as document files such as PDF. However, personal information and security related contents contained in these documents may be illegally leaked through document files or exposed by hacking The risk is very high. Most of these documents contain personal information.

In addition, in the case of separately storing each document file such as PDF, it takes a considerable time to open and read each file to check the contents of the document file, and there is a problem that the convenience of access is very low.

Accordingly, it is possible to solve the problem that the inconvenience of searching according to the existing scan image file method and the problem that the searchable level itself is lowered, and the security risk or convenience of access which is caused when the document file is stored is lowered There is a demand.

0160215 10-2005-0075301

An object of the present invention is to provide a system for searching and providing an original document image file.

It is another object of the present invention to provide a method for searching and providing an original document image file.

According to another aspect of the present invention, there is provided a system for searching and providing an original document image file, comprising: an original document image file database in which an original document image file is stored in advance; OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module; An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata; An XML document file database in which the XML document converted by the XML conversion module is stored; An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database; And an original document image file providing module for searching an original document image file corresponding to the original document data retrieved from the XML document retrieving module in the original document image file database and providing the original document image file to the user terminal.

The tag extraction module may further include a tag extraction module for automatically extracting a tag for searching the contents of the original document image file in which OCR recognition is performed by the OCR recognition / metadata extraction module.

In this case, the XML document search module may be configured to search the XML document using the tag extracted from the tag extraction module for the text data requested to be searched by the user terminal.

According to another aspect of the present invention, there is provided a method for searching and providing an original document image file, the OCR recognition module scanning an original document image file stored in advance in an original document image file database to perform OCR recognition, Extracting metadata representing a format of an original document image file; Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and storing the converted XML document in an XML document file database; The XML document retrieval module retrieving the original document data to be retrieved by the user terminal in the XML document stored in the XML document file database; And the original document image file providing module searches the original document image file database for the original document image file corresponding to the original text data retrieved from the XML document retrieving module and provides the original document image file to the user terminal.

Here, the XML conversion module converts the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and stores the converted XML document in an XML document file database. The module may be configured to automatically extract a tag for searching the contents of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module.

The step of the XML document retrieval module retrieving the original document data requested to be retrieved by the user terminal in the XML document stored in the XML document file database may include retrieving the original document data to be retrieved by the user terminal And retrieve the XML document using the tag extracted by the tag extracting module.

According to the above-described system and method for searching and providing original document image files, a separate document file database is constructed by recognizing OCR of the original document image file while keeping the existing original document image file database established, And the original document image file of the original document image file database is selected and read by using the search result in the database so that a desired original document image file can be searched in a variety of search methods in a large original document image file, There is an effect that can be read out.

In addition, the XML document file database only provides search results and is configured not to directly access the document body to obtain information, thereby preventing personal information or security information contained in the document body from being leaked.

1 is a block diagram of a system for searching and providing an original document image file according to an embodiment of the present invention.
2 is a flowchart illustrating a method of searching for and providing an original document image file according to an embodiment of the present invention.

While the invention is susceptible to various modifications and alternative forms, specific embodiments thereof are shown by way of example in the drawings and will herein be described in detail to the concrete inventive concept. It should be understood, however, that the invention is not intended to be limited to the particular embodiments, but includes all modifications, equivalents, and alternatives falling within the spirit and scope of the invention. Like reference numerals are used for like elements in describing each drawing.

The terms first, second, A, B, etc. may be used to describe various elements, but the elements should not be limited by the terms. The terms are used only for the purpose of distinguishing one component from another. For example, without departing from the scope of the present invention, the first component may be referred to as a second component, and similarly, the second component may also be referred to as a first component. And / or < / RTI > includes any combination of a plurality of related listed items or any of a plurality of related listed items.

It is to be understood that when an element is referred to as being "connected" or "connected" to another element, it may be directly connected or connected to the other element, . On the other hand, when an element is referred to as being "directly connected" or "directly connected" to another element, it should be understood that there are no other elements in between.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. The singular expressions include plural expressions unless the context clearly dictates otherwise. In the present application, the terms "comprises" or "having" and the like are used to specify that there is a feature, a number, a step, an operation, an element, a component or a combination thereof described in the specification, But do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, or combinations thereof.

Unless defined otherwise, all terms used herein, including technical or scientific terms, have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. Terms such as those defined in commonly used dictionaries are to be interpreted as having a meaning consistent with the contextual meaning of the related art and are to be interpreted as either ideal or overly formal in the sense of the present application Do not.

Hereinafter, preferred embodiments according to the present invention will be described in detail with reference to the accompanying drawings.

1 is a block diagram of a system for searching and providing an original document image file according to an embodiment of the present invention.

Referring to FIG. 1, an original document image file searching and providing system 100 according to an embodiment of the present invention includes an original document image file database 110, an optical character reader (OCR) recognition / metadata metadata extraction module 120, an XML conversion module 130, an XML document file database 140, a tag extraction module 150, an XML document search module 160, an original document image file provision module 170, . ≪ / RTI >

The system 100 for searching and providing an original document image file is configured to separately construct an original document image file database 110 and an XML document file database 140. The system 100 includes two operations, Can be implemented.

In particular, there is no fear that the contents of the text or the personal information contained in the text will be leaked, and each document is implemented as an XML document, so it is very convenient and easy to search Is high.

Hereinafter, the detailed configuration will be described.

The original document image file database 110 is a configuration in which an original document image file is stored in advance.

The original document image file database 110 is a configuration for scanning original documents including personal information of a securities company, a bank, a government office, a court, a hospital, and the like and storing them as image files. It is a database already built in various uses.

The OCR recognition module 120 may be configured to perform an optical character reader (OCR) recognition by scanning an original document image file stored in the original document image file database 110. [

The OCR recognition / metadata extraction module 120 can be configured to recognize characters of an original document image file and convert them into readable character codes such as ASCII codes.

In addition, the OCR recognition / metadata extraction module 120 may be configured to extract metadata representing a format of an original document image file.

The OCR recognition / metadata extraction module 120 may be configured to extract metadata for replacing an original document such as data, schema, contents, links, etc. in an XML format.

Meanwhile, the tag extraction module 150 may be configured to automatically extract a tag from the contents of the original document image file recognized by the OCR recognition / metadata extraction module 120.

The tags can be defined by predetermined algorithms and can be determined according to the characteristics of the documents. The tag can be automatically extracted from the contents of the document. In the case of a bank, the title such as an account opening agreement or a cash card application can be automatically designated as a contract month, a name, and the like. In addition, the tag may be composed of predetermined keywords such as a contract, an opening, an account, a card, a loan, a credit loan, a security, and the like.

These tags can be used to perform searches on vast amounts of document content. The tag can also be used for searching for big data analysis.

The XML conversion module 130 may be configured to convert the original document data of the original document image file recognized by the OCR recognition / metadata extraction module 120 into an XML document using the metadata. An XML document is implemented in a format that is configured to facilitate retrieval as well as accessing its contents. Multiple XML documents can be interlinked and configured to enable full content retrieval from a remote location.

The XML document converted by the XML conversion module 130 may be stored in the XML document file database 160. [ The XML document stored in the XML conversion module 130 may be stored in a one-to-one correspondence with the original document image file already stored in the original document image file database 110. [

Meanwhile, the XML document search module 160 may be configured to receive a search request from the user terminal 10 at a remote location.

The XML document retrieval module 160 may be configured to retrieve the original text data requested to be retrieved by the user terminal 10 in the XML document stored in the XML document file database 140. [

The original document image file providing module 170 does not provide the XML document retrieved from the XML document retrieving module 160 directly to the user terminal 10 but directly outputs the original document image file corresponding to the retrieved XML document to the original document image file DB (110) and provide it to the user terminal (10).

The XML document search module 160 searches the contents of the text itself and does not provide the retrieved contents to the user terminal 10 as it is. This is because if the contents of the body of the XML document can be read or read as character code, the access is restricted because of the risk of leakage of personal information or important security information.

The XML document retrieving module 170 may be configured to retrieve the original document data requested to be retrieved by the user terminal 10 by using the metadata extracted from the tag extracting module 150. [

For example, the contents of an XML document can be retrieved by using search terms such as X, XX, contract, and loan. Multiple XML documents containing these search terms can be retrieved.

On the other hand, the retrieval method using metadata can be utilized for big data analysis. Since the content of the text is not provided in the form of a readable character code even in the analysis of the big data, the XML document search module 160 searches the search result while limiting the exposure of the contents of the text, Results can be provided. For example, the search results such as the number of contracts for mortgage loans in 2016, the number of cases by region, and age can be culled, and the Big Data Analysis Module (not shown) can perform big data analysis using these search results. The XML document retrieval module 160 provides only the number of hits, such as the number of contracts, the number of regions, and the age, without exporting the personal information, so that the big data analysis can be performed without exposing personal information or security information can do. Searchers or big data analysts who search for them or search for big data can do search or big data analysis without exposing personal information or security information.

That is, the content of the XML document itself is configured not to leak out of the XML document file database 140. It can be seen that only an identifier indicating the retrieval result (retrieval result, number of hits, etc.) or an original document image file corresponding to the retrieved XML document is provided outside the XML document file database 140.

2 is a flowchart illustrating a method of searching for and providing an original document image file according to an embodiment of the present invention.

Referring to FIG. 2, an optical character reader (OCR) recognition module 120 scans an original document image file stored in advance in an original document image file database 110 to perform OCR recognition, Metadata is extracted (S101).

Next, the XML conversion module 130 converts the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition module 120 into the XML document using the extracted metadata, (S102).

Next, the tag extraction module 150 automatically extracts a tag for searching the contents of the original document image file in which OCR recognition is performed by the OCR recognition / metadata extraction module 120 (S103 ).

Next, the XML document search module 160 searches the XML document stored in the XML document file database 140 for the text data to be searched by the user terminal (S104).

At this time, the XML document retrieving module 160 may retrieve the original document data that is requested to be retrieved by the user terminal 10 by using the tag extracted from the tag extracting module 150.

Next, the original document image file providing module 170 searches the original document image file database 110 for the original document image file corresponding to the original text data retrieved from the XML document retrieval module 160, and provides the original document image file to the user terminal 10 (S105).

It will be apparent to those skilled in the art that various modifications and variations can be made in the present invention without departing from the spirit or scope of the invention as defined in the following claims. There will be.

110: Original document image file database
120: OCR recognition / metadata extraction module
130: XML transformation module
140: XML document file database
150: tag extraction module
160: XML document retrieval module
170: Original document image file provision module

Claims (4)

Original document image file database where the original document image file is pre-stored;
OCR recognition / metadata extraction for extracting metadata indicating the format of an original document image file by performing an OCR (optical character reader) recognition by scanning an original document image file stored in the original document image file database module;
An XML conversion module for converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document using the extracted metadata;
An XML document file database in which the XML document converted by the XML conversion module is stored;
An XML document retrieving module for retrieving original document data requested to be retrieved by a user terminal in an XML document stored in the XML document file database;
An original document image file providing module for searching an original document image file corresponding to the original document data retrieved from the XML document retrieving module from the original document image file database and providing the original document image file to the user terminal; .
The method according to claim 1,
Further comprising a tag extracting module for automatically extracting a tag for searching the contents of an original document image file in which OCR recognition is performed by the OCR recognition / metadata extracting module,
Wherein the XML document retrieval module comprises:
And searching the XML document using the tag extracted from the tag extracting module for the original document data to be searched for by the user terminal.
An optical character reader (OCR) recognition module performs an OCR recognition by scanning an original document image file stored in advance in an original document image file database and extracts metadata representing a format of the original document image file ;
Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into an XML document and storing the converted XML document in an XML document file database;
The XML document retrieval module retrieving the original document data to be retrieved by the user terminal in the XML document stored in the XML document file database;
Retrieving an original document image file corresponding to the original document data retrieved from the XML document retrieving module from the original document image file database and providing the retrieved original document image file to the user terminal; Delivery method.
The method of claim 3,
Converting the original document data of the original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module into the XML document and storing the converted XML document in the XML document file database,
Further comprising the step of automatically extracting a tag for retrieving a content from an original document image file in which the OCR recognition is performed by the OCR recognition / metadata extraction module,
The step of the XML document retrieval module retrieving the original text data to be retrieved by the user terminal in the XML document stored in the XML document file database,
Wherein the XML document retrieving module is configured to retrieve an original document data to be retrieved by the user terminal from an XML document using a tag extracted from the tag extracting module.
KR1020160123177A 2016-09-26 2016-09-26 System of searching and providing original document image file and method thereof KR20180033786A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020160123177A KR20180033786A (en) 2016-09-26 2016-09-26 System of searching and providing original document image file and method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
KR1020160123177A KR20180033786A (en) 2016-09-26 2016-09-26 System of searching and providing original document image file and method thereof

Publications (1)

Publication Number Publication Date
KR20180033786A true KR20180033786A (en) 2018-04-04

Family

ID=61975260

Family Applications (1)

Application Number Title Priority Date Filing Date
KR1020160123177A KR20180033786A (en) 2016-09-26 2016-09-26 System of searching and providing original document image file and method thereof

Country Status (1)

Country Link
KR (1) KR20180033786A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102350111B1 (en) * 2021-04-02 2022-01-12 (주)광개토연구소 Method for issuing combined web page including both at least one specific identification phrase and at least one locator capable of allowing at least one docking result data corresponding to the specific identification phrase to be accessed and server using the same

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR102350111B1 (en) * 2021-04-02 2022-01-12 (주)광개토연구소 Method for issuing combined web page including both at least one specific identification phrase and at least one locator capable of allowing at least one docking result data corresponding to the specific identification phrase to be accessed and server using the same
WO2022211163A1 (en) * 2021-04-02 2022-10-06 (주)광개토연구소 Method for issuing combined web page including at least one specific identification phrase extracted from specific input data of user and at least one locator which can access at least one docking result data piece corresponding to same at least one specific identification phrase, and combined web page issuance server using same

Similar Documents

Publication Publication Date Title
CN107608958B (en) Contract text risk information mining method and system based on unified modeling of clauses
US20190286898A1 (en) System and method for data extraction and searching
AU2007325200B2 (en) Digital image archiving and retrieval using a mobile device system
KR100990018B1 (en) Method for adding metadata to data
US8244037B2 (en) Image-based data management method and system
JP4740916B2 (en) Image document processing apparatus, image document processing program, and recording medium recording image document processing program
JP4364914B2 (en) Image document processing apparatus, image document processing method, program, and recording medium
CN101297319B (en) Embedding hot spots in electronic documents
Déjean et al. A system for converting PDF documents into structured XML format
US8290269B2 (en) Image document processing device, image document processing method, program, and storage medium
US20120284250A1 (en) Enhanced search engine
EP2442273A1 (en) Object identification image database creating method, creating apparatus and creating process program
Ugale et al. Document management system: A notion towards paperless office
WO2007023992A1 (en) Method and system for image matching in a mixed media environment
CN101493896B (en) Document image processing apparatus and method
US20090030882A1 (en) Document image processing apparatus and document image processing method
KR102373884B1 (en) Image data processing method for searching images by text
Maiti et al. Capturing, eliciting, and prioritizing (CEP) NFRs in agile software engineering
Kavallieratou et al. The GRUHD database of Greek unconstrained handwriting
Schröder et al. Supporting land reuse of former open pit mining sites using text classification and active learning
US20060210171A1 (en) Image processing apparatus
EP1917637A1 (en) Data organization and access for mixed media document system
US7286722B2 (en) Memo image managing apparatus, memo image managing system and memo image managing method
WO2007023991A1 (en) Embedding hot spots in electronic documents
KR20180033786A (en) System of searching and providing original document image file and method thereof

Legal Events

Date Code Title Description
A201 Request for examination
E902 Notification of reason for refusal
E601 Decision to refuse application