CN108897862A - One kind being based on government document picture retrieval method and system - Google Patents
One kind being based on government document picture retrieval method and system Download PDFInfo
- Publication number
- CN108897862A CN108897862A CN201810705428.4A CN201810705428A CN108897862A CN 108897862 A CN108897862 A CN 108897862A CN 201810705428 A CN201810705428 A CN 201810705428A CN 108897862 A CN108897862 A CN 108897862A
- Authority
- CN
- China
- Prior art keywords
- picture
- text
- government
- terminal
- official document
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
- G06V30/14—Image acquisition
- G06V30/148—Segmentation of character regions
- G06V30/153—Segmentation of character regions using recognition of characters or words
Landscapes
- Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Character Discrimination (AREA)
Abstract
The present invention discloses a kind of based on government document picture retrieval method and system, it can be realized to government affairs picture file full-text search, it is quickly found out the key message for including inside picture, improve retrieval user, pass through keyword, text file can be retrieved, picture file can also be retrieved, improves retrieval recall ratio.Official document picture retrieval technology in the present invention, for E-Government industry government affairs uploading pictures feature, most of picture is all text, the copy for the official document that part picture still scans, it can complete picture character by ORC identification technology, algorithm verification and identify converting text data, establish full-text index and picture corresponding relationship, after official document picture global search technology is released, greatly facilitates government staff in recall precision, improve retrieval recall ratio, recall ratio reaches 99.11%, precision ratio>95%, it solves official document text in picture very well and is unable to search problem.
Description
Technical field
The present invention discloses a kind of picture retrieval method and system, it is especially a kind of based on government document picture retrieval method and
System.
Background technique
Picture file quantity accounts for the 1/3 of E-Government file data total amount, if the full-text search to picture cannot be supported,
The recall ratio of the full-text search of E-Government management all documents of platform will be directly affected.For reality existing for official document,
Quite a few is with the presence of the graphical formats such as TIF, JPG, BMP, and general retrieval technique is cannot to examine to pictograph
Rope, if not can solve the problem, system be it is incomplete, it is also incomplete for being supplied to the information of user.Existing skill
Art scheme is to take most of retrieval types for being all based on database structure data, be essentially all based on document retrieval, though
So also have in the prior art in the presence of the application based on picture retrieval, but it is substantially the interior of identification picture that it, which does picture recognition,
Holding, algorithm is extremely complex, and it is very high to device hardware requirement, for E-Government industry characteristic, and combine government affairs special character
Algorithm identification, temporarily not yet.
In the prior art to be easily recognized based on picture, the discrimination of text and accuracy are relatively low after identification, retrieval
It is mostly based on database retrieval, recall ratio is low, and not high with electronic government affairs system Percentage bound, government document source data feature can not
It embodies.
Summary of the invention
The shortcomings that for picture retrieval can not be used in government affairs official document in the prior art retrieval mentioned above, this hair
The bright one kind that provides can be realized to government affairs picture file full-text search, quickly based on government document picture retrieval method and system
The key message for including inside picture is found, text file can be retrieved by improving retrieval user by keyword, can be with
Picture file is retrieved, retrieval recall ratio is improved.
The technical solution used to solve the technical problems of the present invention is that:One kind being based on government document picture retrieval method, should
Search method includes the following steps:
Step S1:Electronic government documents picture is uploaded by government affairs terminal, is given by internet, local area network or telecommunication network transport
Background server;
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception
After the picture file uploaded to government affairs terminal, by calling OCR program assembly to know the text on the official document picture of upload
Not;
Step S3:It identifies the non-standard character element in official document picture, and passes through alignment algorithm for non-standard character element
Special text is converted into be stored;
Step S4:Word segmentation processing is carried out to the text of identification;
Step S5:The text of identification and non-standard character element are saved in database;
Step S6:Establish full-text index library in the database, and by official document picture and the text, the non-standard word that identify
Symbol element carries out mapping processing;
Step S7:Keyword to be retrieved is inputted in government affair work platform, using input keyword in the database into
Row full-text search;
Step S8:Return retrieves official document picture corresponding with keyword and accessories list.
One kind being based on government document picture retrieval system, which includes:
Input module:For inputting electronic government documents picture, and by internet, local area network or telecommunication network transport to backstage
Server;
Monitor module:For in background server monitor government affairs terminal whether have picture input and to the picture of input into
Row Text region, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly pair
Text on the official document picture of upload is identified;
Non-standard character elemental recognition module:Non-standard character element in official document picture for identification, and pass through comparison
Non-standard character element is converted into special text and stored by algorithm;
Database module:For text, non-standard character element and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture and the text, non-that identifies
Standard character element and other auxiliary informations carry out mapping processing;
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using the keyword of input in data
Full-text search is carried out in library;
Result return module:For returning to the official document picture retrieved and accessories list.
The technical scheme adopted by the invention to solve the technical problem further comprises:
The government affairs terminal is government affair work platform, and government affair work platform includes computer terminal, mobile phone terminal, hand-held sets
Standby terminal or fixed equipment terminal.
The oracle listener uses multithreading oracle listener.
The input module is government affair work platform, and government affair work platform includes computer terminal, mobile phone terminal, holds eventually
End or fixed terminal.
The system further includes word segmentation module:For carrying out word segmentation processing to the text for monitoring module identification.
The beneficial effects of the invention are as follows:Official document picture retrieval technology in the present invention, in E-Government industry government affairs
Blit piece feature, most of picture are all texts, and the copy for the official document that part picture still scans can identify skill by ORC
Art, algorithm verification complete picture character and identify converting text data, establish full-text index and picture corresponding relationship, official document picture
After global search technology is released, greatly facilitates government staff in recall precision, improve retrieval recall ratio, recall ratio reaches
99.11%, precision ratio>95%, it solves official document text in picture very well and is unable to search problem.
Below in conjunction with the drawings and specific embodiments, the present invention will be further described.
Detailed description of the invention
Fig. 1 is flow chart of the present invention.
Specific embodiment
The present embodiment is the preferred embodiment for the present invention, other its all principles and basic structure are identical or close as the present embodiment
As, within that scope of the present invention.
The present invention is mainly a kind of based on government document picture retrieval method, is mainly included the following steps:
Step S1:Electronic government documents picture (document, reference scanned picture) is uploaded by government affairs terminal, the present embodiment
In, government affairs terminal is usually government affair work platform (including computer terminal, mobile phone terminal or other hand-held or fixed terminals etc.), is led to
The electronic government documents picture for crossing terminal upload passes through internet, local area network or telecommunication network transport to background server.
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception
After the picture file uploaded to government affairs terminal, by calling OCR program assembly (OCR, that is, OpticalCharacter
Recognition, also known as optical character identification refer to that electronic equipment (such as scanner or digital camera) checks and print on paper
Character, determine its shape by detecting dark, bright mode, shape then translated into computword with character identifying method
Process;That is, being directed to printed character, the text conversion in paper document is become to the figure of black and white lattice using optical mode
As file, and pass through identification software for the text conversion in image into text formatting, further edits and add for word processor
The technology of work) text on the official document picture of upload is identified, in the present embodiment, multithreading monitoring is can be used in oracle listener
Program, when it is implemented, single thread oracle listener can also be used.
Step S3:It identifies the non-standard characters elements such as generation E-seal and the handwritten signature in official document picture, and passes through
Non-standard character element is converted into special text and stored by alignment algorithm.
Step S4:Carrying out word segmentation processing to the text identified in step S2, (participle technique is exactly search engine for user
It is segmented according to the crucial word string of user with various matching process after the query processing for submitting the crucial word string of inquiry to carry out
A kind of technology).
Step S5:By the E-Government official document ID identified in the text identified in step S4, step S3, (identification is
A kind of proof of identification, the i.e. mark of official document) and other auxiliary informations be saved in database;
Step S6:Establish full-text index library in the database, and by official document picture and the text, the special text that identify
This text, official document ID and other auxiliary informations carry out mapping processing.
Step S7:Keyword to be retrieved is inputted in government affair work platform, using input keyword in the database into
Row full-text search.
Step S8:Return to the official document picture corresponding with keyword retrieved and accessories list (including but not limited to phase
Answer the text identified in picture, official document ID and other auxiliary informations etc.).
The present invention also protects one kind based on government document picture retrieval system simultaneously, mainly includes:
Input module:For inputting electronic government documents picture, in the present embodiment, passes through government affairs terminal and upload electronic government documents picture
(document, reference scanned picture), in the present embodiment, government affairs terminal is usually government affair work platform (including computer terminal, hand
Machine terminal or other hand-held or fixed terminals etc.), internet, local area network or electricity are passed through by the electronic government documents picture that terminal uploads
Communication network is transferred to background server.
Monitor module:For in background server monitor government affairs terminal whether have picture input and to the picture of input into
Row Text region, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly
(OCR, that is, Optical Character Recognition, also known as optical character identification refer to that electronic equipment (such as scans
Instrument or digital camera) check the character printed on paper, its shape is determined by the mode for detecting dark, bright, then uses character recognition
Shape is translated into the process of computword by method;That is, it is directed to printed character, it will be in paper document using optical mode
Text conversion become the image file of black and white lattice, and by identification software by the text conversion in image at text formatting,
The technology further edited and processed for word processor) text on the official document picture of upload is identified, the present embodiment
In, multithreading oracle listener can be used in oracle listener, when it is implemented, single thread oracle listener can also be used.
Non-standard character elemental recognition module:The generation E-seal in official document picture and handwritten signature etc. are non-for identification
Standard character element, and non-standard character element is converted by special text by alignment algorithm and is stored.
Word segmentation module:For carrying out word segmentation processing to the text for monitoring module identification, (participle technique is exactly search engine needle
It is carried out according to the crucial word string of user with various matching process after the query processing for submitting the crucial word string of inquiry to carry out user
A kind of technology of participle).
Database module:For storing the text after word segmentation module word segmentation processing, non-standard character elemental recognition module is known
A kind of other E-Government official document ID (identification is proof of identification, i.e. the mark of official document) and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture with identify text, spy
Different text, official document ID and other auxiliary informations carries out mapping processing.
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using the keyword of input in data
Full-text search is carried out in library.
Result return module:For returning to the official document picture retrieved and accessories list (including but not limited to corresponding figure
Text, official document ID and other auxiliary informations for being identified in piece etc.).
Official document picture retrieval technology in the present invention, it is most of to scheme for E-Government industry government affairs uploading pictures feature
Piece is all text, and the copy for the official document that part picture still scans can complete picture by ORC identification technology, algorithm verification
Text region converting text data establish full-text index and picture corresponding relationship, after official document picture global search technology is released, pole
Facilitate government staff in recall precision greatly, improve retrieval recall ratio, recall ratio reaches 99.11%, precision ratio>95%, very
It solves official document text in picture well and is unable to search problem.
Claims (7)
1. one kind is based on government document picture retrieval method, it is characterized in that:The search method includes the following steps:
Step S1:Electronic government documents picture is uploaded by government affairs terminal, by internet, local area network or telecommunication network transport to backstage
Server;
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception to political affairs
After the picture file that terminal of being engaged in uploads, by calling OCR program assembly to identify the text on the official document picture of upload;
Step S3:It identifies the non-standard character element in official document picture, and is converted non-standard character element by alignment algorithm
It is stored at special text;
Step S4:Word segmentation processing is carried out to the text of identification;
Step S5:The text of identification and non-standard character element are saved in database;
Step S6:Full-text index library is established in the database, and official document picture and the text identified, non-standard character is first
Element carries out mapping processing;
Step S7:Keyword to be retrieved is inputted in government affair work platform, is carried out in the database entirely using the keyword of input
Text retrieval;
Step S8:Return retrieves official document picture corresponding with keyword and accessories list.
2. according to claim 1 be based on government document picture retrieval method, it is characterized in that:The government affairs terminal is political affairs
Business office platform, government affair work platform includes computer terminal, mobile phone terminal, hand held equipment terminal or fixed equipment terminal.
3. according to claim 1 be based on government document picture retrieval method, it is characterized in that:The oracle listener uses
Multithreading oracle listener.
4. one kind is based on government document picture retrieval system, it is characterized in that:The system includes:
Input module:For inputting electronic government documents picture, and by internet, local area network or telecommunication network transport to background service
Device;
Monitor module:For monitoring whether government affairs terminal has picture to input and carry out text to the picture of input in background server
Word identification, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly to upload
Official document picture on text identified;
Non-standard character elemental recognition module:Non-standard character element in official document picture for identification, and pass through alignment algorithm
Non-standard character element is converted into special text to store;
Database module:For text, non-standard character element and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture and the text, non-standard that identifies
Character element and other auxiliary informations carry out mapping processing;
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using input keyword in the database
Carry out full-text search;
Result return module:For returning to the official document picture retrieved and accessories list.
5. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The input module is political affairs
Business office platform, government affair work platform includes computer terminal, mobile phone terminal, handheld terminal or fixed terminal.
6. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The oracle listener uses
Multithreading oracle listener.
7. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The system further includes point
Word module:For carrying out word segmentation processing to the text for monitoring module identification.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810705428.4A CN108897862A (en) | 2018-07-02 | 2018-07-02 | One kind being based on government document picture retrieval method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810705428.4A CN108897862A (en) | 2018-07-02 | 2018-07-02 | One kind being based on government document picture retrieval method and system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108897862A true CN108897862A (en) | 2018-11-27 |
Family
ID=64347397
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810705428.4A Pending CN108897862A (en) | 2018-07-02 | 2018-07-02 | One kind being based on government document picture retrieval method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108897862A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175256A (en) * | 2019-05-30 | 2019-08-27 | 上海联影医疗科技有限公司 | A kind of image data retrieval method, apparatus, equipment and storage medium |
CN110516037A (en) * | 2019-07-29 | 2019-11-29 | 广东鼎义互联科技股份有限公司 | A kind of bidding document analysis system in government affairs field |
CN112949471A (en) * | 2021-02-27 | 2021-06-11 | 浪潮云信息技术股份公司 | Domestic CPU-based electronic official document identification reproduction method and system |
CN113806472A (en) * | 2020-06-17 | 2021-12-17 | 中国人寿资产管理有限公司 | Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece |
CN114611507A (en) * | 2022-03-10 | 2022-06-10 | 北京思源智通科技有限责任公司 | Text keyword analysis method, system and computer readable medium |
CN117688162A (en) * | 2024-01-16 | 2024-03-12 | 广东铭太信息科技有限公司 | Full text retrieval method and system based on OCR (optical character recognition) |
CN110175256B (en) * | 2019-05-30 | 2024-06-07 | 上海联影医疗科技股份有限公司 | Image data retrieval method, device, equipment and storage medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464903A (en) * | 2009-01-09 | 2009-06-24 | 江阴明伦科技有限公司 | OCR picture and text recognition and retrieval method and system through web mode |
CN102262640A (en) * | 2010-05-31 | 2011-11-30 | 中国移动通信集团贵州有限公司 | Method and device for full-text retrieval of document database |
CN107545391A (en) * | 2017-09-07 | 2018-01-05 | 安徽共生物流科技有限公司 | A kind of logistics document intellectual analysis and automatic storage method based on image recognition |
-
2018
- 2018-07-02 CN CN201810705428.4A patent/CN108897862A/en active Pending
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101464903A (en) * | 2009-01-09 | 2009-06-24 | 江阴明伦科技有限公司 | OCR picture and text recognition and retrieval method and system through web mode |
CN102262640A (en) * | 2010-05-31 | 2011-11-30 | 中国移动通信集团贵州有限公司 | Method and device for full-text retrieval of document database |
CN107545391A (en) * | 2017-09-07 | 2018-01-05 | 安徽共生物流科技有限公司 | A kind of logistics document intellectual analysis and automatic storage method based on image recognition |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175256A (en) * | 2019-05-30 | 2019-08-27 | 上海联影医疗科技有限公司 | A kind of image data retrieval method, apparatus, equipment and storage medium |
CN110175256B (en) * | 2019-05-30 | 2024-06-07 | 上海联影医疗科技股份有限公司 | Image data retrieval method, device, equipment and storage medium |
CN110516037A (en) * | 2019-07-29 | 2019-11-29 | 广东鼎义互联科技股份有限公司 | A kind of bidding document analysis system in government affairs field |
CN113806472A (en) * | 2020-06-17 | 2021-12-17 | 中国人寿资产管理有限公司 | Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece |
CN113806472B (en) * | 2020-06-17 | 2023-12-26 | 中国人寿资产管理有限公司 | Method and equipment for realizing full-text retrieval of text picture and image type scanning piece |
CN112949471A (en) * | 2021-02-27 | 2021-06-11 | 浪潮云信息技术股份公司 | Domestic CPU-based electronic official document identification reproduction method and system |
CN114611507A (en) * | 2022-03-10 | 2022-06-10 | 北京思源智通科技有限责任公司 | Text keyword analysis method, system and computer readable medium |
CN117688162A (en) * | 2024-01-16 | 2024-03-12 | 广东铭太信息科技有限公司 | Full text retrieval method and system based on OCR (optical character recognition) |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108897862A (en) | One kind being based on government document picture retrieval method and system | |
US9767379B2 (en) | Systems, methods and computer program products for determining document validity | |
CN102622592B (en) | Name card recognition method based on cloud technology | |
US10192279B1 (en) | Indexed document modification sharing with mixed media reality | |
US7933453B2 (en) | System and method for capturing and processing business data | |
US9530050B1 (en) | Document annotation sharing | |
US20050100216A1 (en) | Method and apparatus for capturing paper-based information on a mobile computing device | |
US20070047008A1 (en) | System and methods for use of voice mail and email in a mixed media environment | |
WO2013004036A1 (en) | Business card recognition method combining character recognition and image matching | |
CN114445836A (en) | Information auditing method and device combining RPA and AI and electronic equipment | |
CN116665228B (en) | Image processing method and device | |
CN112418813A (en) | AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium | |
US10579653B2 (en) | Apparatus, method, and computer-readable medium for recognition of a digital document | |
CN112464907A (en) | Document processing system and method | |
CN114238731A (en) | Domestic CPU retrieval method, system, device and computer readable medium | |
US20080144106A1 (en) | Automated processing of paper forms using remotely-stored form content | |
CN113971810A (en) | Document generation method, device, platform, electronic equipment and storage medium | |
US20150030241A1 (en) | Method and system for data identification and extraction using pictorial representations in a source document | |
CN115640952B (en) | Method and system for importing and uploading data | |
KR101659886B1 (en) | business card ordering system and method | |
CN116152480A (en) | Data extraction and structuring processing system and implementation method | |
CN113516044A (en) | Paper contract credit enhancement method and system based on OCR and Hash algorithm | |
CN115392209A (en) | Method, equipment and medium for automatically generating civil case legal documents | |
CN113536831A (en) | Reading assisting method, device, equipment and computer readable medium based on image recognition | |
WO2024115773A1 (en) | Computer implemented method for an automated search of an article of a printed medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |