CN108897862A - One kind being based on government document picture retrieval method and system - Google Patents

One kind being based on government document picture retrieval method and system Download PDF

Info

Publication number
CN108897862A
CN108897862A CN201810705428.4A CN201810705428A CN108897862A CN 108897862 A CN108897862 A CN 108897862A CN 201810705428 A CN201810705428 A CN 201810705428A CN 108897862 A CN108897862 A CN 108897862A
Authority
CN
China
Prior art keywords
picture
text
government
terminal
official document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810705428.4A
Other languages
Chinese (zh)
Inventor
李军
史玉洁
袁志远
吴恺
俞勋勋
雷久滩
蔡天祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
ZHUHAI FLYRISE SOFTWARE CO Ltd
Original Assignee
ZHUHAI FLYRISE SOFTWARE CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZHUHAI FLYRISE SOFTWARE CO Ltd filed Critical ZHUHAI FLYRISE SOFTWARE CO Ltd
Priority to CN201810705428.4A priority Critical patent/CN108897862A/en
Publication of CN108897862A publication Critical patent/CN108897862A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Character Discrimination (AREA)

Abstract

The present invention discloses a kind of based on government document picture retrieval method and system, it can be realized to government affairs picture file full-text search, it is quickly found out the key message for including inside picture, improve retrieval user, pass through keyword, text file can be retrieved, picture file can also be retrieved, improves retrieval recall ratio.Official document picture retrieval technology in the present invention, for E-Government industry government affairs uploading pictures feature, most of picture is all text, the copy for the official document that part picture still scans, it can complete picture character by ORC identification technology, algorithm verification and identify converting text data, establish full-text index and picture corresponding relationship, after official document picture global search technology is released, greatly facilitates government staff in recall precision, improve retrieval recall ratio, recall ratio reaches 99.11%, precision ratio>95%, it solves official document text in picture very well and is unable to search problem.

Description

One kind being based on government document picture retrieval method and system
Technical field
The present invention discloses a kind of picture retrieval method and system, it is especially a kind of based on government document picture retrieval method and System.
Background technique
Picture file quantity accounts for the 1/3 of E-Government file data total amount, if the full-text search to picture cannot be supported, The recall ratio of the full-text search of E-Government management all documents of platform will be directly affected.For reality existing for official document, Quite a few is with the presence of the graphical formats such as TIF, JPG, BMP, and general retrieval technique is cannot to examine to pictograph Rope, if not can solve the problem, system be it is incomplete, it is also incomplete for being supplied to the information of user.Existing skill Art scheme is to take most of retrieval types for being all based on database structure data, be essentially all based on document retrieval, though So also have in the prior art in the presence of the application based on picture retrieval, but it is substantially the interior of identification picture that it, which does picture recognition, Holding, algorithm is extremely complex, and it is very high to device hardware requirement, for E-Government industry characteristic, and combine government affairs special character Algorithm identification, temporarily not yet.
In the prior art to be easily recognized based on picture, the discrimination of text and accuracy are relatively low after identification, retrieval It is mostly based on database retrieval, recall ratio is low, and not high with electronic government affairs system Percentage bound, government document source data feature can not It embodies.
Summary of the invention
The shortcomings that for picture retrieval can not be used in government affairs official document in the prior art retrieval mentioned above, this hair The bright one kind that provides can be realized to government affairs picture file full-text search, quickly based on government document picture retrieval method and system The key message for including inside picture is found, text file can be retrieved by improving retrieval user by keyword, can be with Picture file is retrieved, retrieval recall ratio is improved.
The technical solution used to solve the technical problems of the present invention is that:One kind being based on government document picture retrieval method, should Search method includes the following steps:
Step S1:Electronic government documents picture is uploaded by government affairs terminal, is given by internet, local area network or telecommunication network transport Background server;
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception After the picture file uploaded to government affairs terminal, by calling OCR program assembly to know the text on the official document picture of upload Not;
Step S3:It identifies the non-standard character element in official document picture, and passes through alignment algorithm for non-standard character element Special text is converted into be stored;
Step S4:Word segmentation processing is carried out to the text of identification;
Step S5:The text of identification and non-standard character element are saved in database;
Step S6:Establish full-text index library in the database, and by official document picture and the text, the non-standard word that identify Symbol element carries out mapping processing;
Step S7:Keyword to be retrieved is inputted in government affair work platform, using input keyword in the database into Row full-text search;
Step S8:Return retrieves official document picture corresponding with keyword and accessories list.
One kind being based on government document picture retrieval system, which includes:
Input module:For inputting electronic government documents picture, and by internet, local area network or telecommunication network transport to backstage Server;
Monitor module:For in background server monitor government affairs terminal whether have picture input and to the picture of input into Row Text region, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly pair Text on the official document picture of upload is identified;
Non-standard character elemental recognition module:Non-standard character element in official document picture for identification, and pass through comparison Non-standard character element is converted into special text and stored by algorithm;
Database module:For text, non-standard character element and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture and the text, non-that identifies Standard character element and other auxiliary informations carry out mapping processing;
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using the keyword of input in data Full-text search is carried out in library;
Result return module:For returning to the official document picture retrieved and accessories list.
The technical scheme adopted by the invention to solve the technical problem further comprises:
The government affairs terminal is government affair work platform, and government affair work platform includes computer terminal, mobile phone terminal, hand-held sets Standby terminal or fixed equipment terminal.
The oracle listener uses multithreading oracle listener.
The input module is government affair work platform, and government affair work platform includes computer terminal, mobile phone terminal, holds eventually End or fixed terminal.
The system further includes word segmentation module:For carrying out word segmentation processing to the text for monitoring module identification.
The beneficial effects of the invention are as follows:Official document picture retrieval technology in the present invention, in E-Government industry government affairs Blit piece feature, most of picture are all texts, and the copy for the official document that part picture still scans can identify skill by ORC Art, algorithm verification complete picture character and identify converting text data, establish full-text index and picture corresponding relationship, official document picture After global search technology is released, greatly facilitates government staff in recall precision, improve retrieval recall ratio, recall ratio reaches 99.11%, precision ratio>95%, it solves official document text in picture very well and is unable to search problem.
Below in conjunction with the drawings and specific embodiments, the present invention will be further described.
Detailed description of the invention
Fig. 1 is flow chart of the present invention.
Specific embodiment
The present embodiment is the preferred embodiment for the present invention, other its all principles and basic structure are identical or close as the present embodiment As, within that scope of the present invention.
The present invention is mainly a kind of based on government document picture retrieval method, is mainly included the following steps:
Step S1:Electronic government documents picture (document, reference scanned picture) is uploaded by government affairs terminal, the present embodiment In, government affairs terminal is usually government affair work platform (including computer terminal, mobile phone terminal or other hand-held or fixed terminals etc.), is led to The electronic government documents picture for crossing terminal upload passes through internet, local area network or telecommunication network transport to background server.
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception After the picture file uploaded to government affairs terminal, by calling OCR program assembly (OCR, that is, OpticalCharacter Recognition, also known as optical character identification refer to that electronic equipment (such as scanner or digital camera) checks and print on paper Character, determine its shape by detecting dark, bright mode, shape then translated into computword with character identifying method Process;That is, being directed to printed character, the text conversion in paper document is become to the figure of black and white lattice using optical mode As file, and pass through identification software for the text conversion in image into text formatting, further edits and add for word processor The technology of work) text on the official document picture of upload is identified, in the present embodiment, multithreading monitoring is can be used in oracle listener Program, when it is implemented, single thread oracle listener can also be used.
Step S3:It identifies the non-standard characters elements such as generation E-seal and the handwritten signature in official document picture, and passes through Non-standard character element is converted into special text and stored by alignment algorithm.
Step S4:Carrying out word segmentation processing to the text identified in step S2, (participle technique is exactly search engine for user It is segmented according to the crucial word string of user with various matching process after the query processing for submitting the crucial word string of inquiry to carry out A kind of technology).
Step S5:By the E-Government official document ID identified in the text identified in step S4, step S3, (identification is A kind of proof of identification, the i.e. mark of official document) and other auxiliary informations be saved in database;
Step S6:Establish full-text index library in the database, and by official document picture and the text, the special text that identify This text, official document ID and other auxiliary informations carry out mapping processing.
Step S7:Keyword to be retrieved is inputted in government affair work platform, using input keyword in the database into Row full-text search.
Step S8:Return to the official document picture corresponding with keyword retrieved and accessories list (including but not limited to phase Answer the text identified in picture, official document ID and other auxiliary informations etc.).
The present invention also protects one kind based on government document picture retrieval system simultaneously, mainly includes:
Input module:For inputting electronic government documents picture, in the present embodiment, passes through government affairs terminal and upload electronic government documents picture (document, reference scanned picture), in the present embodiment, government affairs terminal is usually government affair work platform (including computer terminal, hand Machine terminal or other hand-held or fixed terminals etc.), internet, local area network or electricity are passed through by the electronic government documents picture that terminal uploads Communication network is transferred to background server.
Monitor module:For in background server monitor government affairs terminal whether have picture input and to the picture of input into Row Text region, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly (OCR, that is, Optical Character Recognition, also known as optical character identification refer to that electronic equipment (such as scans Instrument or digital camera) check the character printed on paper, its shape is determined by the mode for detecting dark, bright, then uses character recognition Shape is translated into the process of computword by method;That is, it is directed to printed character, it will be in paper document using optical mode Text conversion become the image file of black and white lattice, and by identification software by the text conversion in image at text formatting, The technology further edited and processed for word processor) text on the official document picture of upload is identified, the present embodiment In, multithreading oracle listener can be used in oracle listener, when it is implemented, single thread oracle listener can also be used.
Non-standard character elemental recognition module:The generation E-seal in official document picture and handwritten signature etc. are non-for identification Standard character element, and non-standard character element is converted by special text by alignment algorithm and is stored.
Word segmentation module:For carrying out word segmentation processing to the text for monitoring module identification, (participle technique is exactly search engine needle It is carried out according to the crucial word string of user with various matching process after the query processing for submitting the crucial word string of inquiry to carry out user A kind of technology of participle).
Database module:For storing the text after word segmentation module word segmentation processing, non-standard character elemental recognition module is known A kind of other E-Government official document ID (identification is proof of identification, i.e. the mark of official document) and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture with identify text, spy Different text, official document ID and other auxiliary informations carries out mapping processing.
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using the keyword of input in data Full-text search is carried out in library.
Result return module:For returning to the official document picture retrieved and accessories list (including but not limited to corresponding figure Text, official document ID and other auxiliary informations for being identified in piece etc.).
Official document picture retrieval technology in the present invention, it is most of to scheme for E-Government industry government affairs uploading pictures feature Piece is all text, and the copy for the official document that part picture still scans can complete picture by ORC identification technology, algorithm verification Text region converting text data establish full-text index and picture corresponding relationship, after official document picture global search technology is released, pole Facilitate government staff in recall precision greatly, improve retrieval recall ratio, recall ratio reaches 99.11%, precision ratio>95%, very It solves official document text in picture well and is unable to search problem.

Claims (7)

1. one kind is based on government document picture retrieval method, it is characterized in that:The search method includes the following steps:
Step S1:Electronic government documents picture is uploaded by government affairs terminal, by internet, local area network or telecommunication network transport to backstage Server;
Step S2:Background server uploads content to government affairs terminal by oracle listener and monitors, when its monitoring reception to political affairs After the picture file that terminal of being engaged in uploads, by calling OCR program assembly to identify the text on the official document picture of upload;
Step S3:It identifies the non-standard character element in official document picture, and is converted non-standard character element by alignment algorithm It is stored at special text;
Step S4:Word segmentation processing is carried out to the text of identification;
Step S5:The text of identification and non-standard character element are saved in database;
Step S6:Full-text index library is established in the database, and official document picture and the text identified, non-standard character is first Element carries out mapping processing;
Step S7:Keyword to be retrieved is inputted in government affair work platform, is carried out in the database entirely using the keyword of input Text retrieval;
Step S8:Return retrieves official document picture corresponding with keyword and accessories list.
2. according to claim 1 be based on government document picture retrieval method, it is characterized in that:The government affairs terminal is political affairs Business office platform, government affair work platform includes computer terminal, mobile phone terminal, hand held equipment terminal or fixed equipment terminal.
3. according to claim 1 be based on government document picture retrieval method, it is characterized in that:The oracle listener uses Multithreading oracle listener.
4. one kind is based on government document picture retrieval system, it is characterized in that:The system includes:
Input module:For inputting electronic government documents picture, and by internet, local area network or telecommunication network transport to background service Device;
Monitor module:For monitoring whether government affairs terminal has picture to input and carry out text to the picture of input in background server Word identification, after monitoring the picture file that module monitoring reception is uploaded to government affairs terminal, by calling OCR program assembly to upload Official document picture on text identified;
Non-standard character elemental recognition module:Non-standard character element in official document picture for identification, and pass through alignment algorithm Non-standard character element is converted into special text to store;
Database module:For text, non-standard character element and other auxiliary informations;
Mapping block:For establishing full-text index library in the database, and by official document picture and the text, non-standard that identifies Character element and other auxiliary informations carry out mapping processing;
Retrieval module:For inputting keyword to be retrieved in government affair work platform, using input keyword in the database Carry out full-text search;
Result return module:For returning to the official document picture retrieved and accessories list.
5. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The input module is political affairs Business office platform, government affair work platform includes computer terminal, mobile phone terminal, handheld terminal or fixed terminal.
6. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The oracle listener uses Multithreading oracle listener.
7. according to claim 4 be based on government document picture retrieval system, it is characterized in that:The system further includes point Word module:For carrying out word segmentation processing to the text for monitoring module identification.
CN201810705428.4A 2018-07-02 2018-07-02 One kind being based on government document picture retrieval method and system Pending CN108897862A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810705428.4A CN108897862A (en) 2018-07-02 2018-07-02 One kind being based on government document picture retrieval method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810705428.4A CN108897862A (en) 2018-07-02 2018-07-02 One kind being based on government document picture retrieval method and system

Publications (1)

Publication Number Publication Date
CN108897862A true CN108897862A (en) 2018-11-27

Family

ID=64347397

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810705428.4A Pending CN108897862A (en) 2018-07-02 2018-07-02 One kind being based on government document picture retrieval method and system

Country Status (1)

Country Link
CN (1) CN108897862A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175256A (en) * 2019-05-30 2019-08-27 上海联影医疗科技有限公司 A kind of image data retrieval method, apparatus, equipment and storage medium
CN110516037A (en) * 2019-07-29 2019-11-29 广东鼎义互联科技股份有限公司 A kind of bidding document analysis system in government affairs field
CN112949471A (en) * 2021-02-27 2021-06-11 浪潮云信息技术股份公司 Domestic CPU-based electronic official document identification reproduction method and system
CN113806472A (en) * 2020-06-17 2021-12-17 中国人寿资产管理有限公司 Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece
CN114611507A (en) * 2022-03-10 2022-06-10 北京思源智通科技有限责任公司 Text keyword analysis method, system and computer readable medium
CN117688162A (en) * 2024-01-16 2024-03-12 广东铭太信息科技有限公司 Full text retrieval method and system based on OCR (optical character recognition)
CN110175256B (en) * 2019-05-30 2024-06-07 上海联影医疗科技股份有限公司 Image data retrieval method, device, equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN102262640A (en) * 2010-05-31 2011-11-30 中国移动通信集团贵州有限公司 Method and device for full-text retrieval of document database
CN107545391A (en) * 2017-09-07 2018-01-05 安徽共生物流科技有限公司 A kind of logistics document intellectual analysis and automatic storage method based on image recognition

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
CN102262640A (en) * 2010-05-31 2011-11-30 中国移动通信集团贵州有限公司 Method and device for full-text retrieval of document database
CN107545391A (en) * 2017-09-07 2018-01-05 安徽共生物流科技有限公司 A kind of logistics document intellectual analysis and automatic storage method based on image recognition

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175256A (en) * 2019-05-30 2019-08-27 上海联影医疗科技有限公司 A kind of image data retrieval method, apparatus, equipment and storage medium
CN110175256B (en) * 2019-05-30 2024-06-07 上海联影医疗科技股份有限公司 Image data retrieval method, device, equipment and storage medium
CN110516037A (en) * 2019-07-29 2019-11-29 广东鼎义互联科技股份有限公司 A kind of bidding document analysis system in government affairs field
CN113806472A (en) * 2020-06-17 2021-12-17 中国人寿资产管理有限公司 Method and equipment for realizing full-text retrieval of character, picture and image type scanning piece
CN113806472B (en) * 2020-06-17 2023-12-26 中国人寿资产管理有限公司 Method and equipment for realizing full-text retrieval of text picture and image type scanning piece
CN112949471A (en) * 2021-02-27 2021-06-11 浪潮云信息技术股份公司 Domestic CPU-based electronic official document identification reproduction method and system
CN114611507A (en) * 2022-03-10 2022-06-10 北京思源智通科技有限责任公司 Text keyword analysis method, system and computer readable medium
CN117688162A (en) * 2024-01-16 2024-03-12 广东铭太信息科技有限公司 Full text retrieval method and system based on OCR (optical character recognition)

Similar Documents

Publication Publication Date Title
CN108897862A (en) One kind being based on government document picture retrieval method and system
US9767379B2 (en) Systems, methods and computer program products for determining document validity
CN102622592B (en) Name card recognition method based on cloud technology
US10192279B1 (en) Indexed document modification sharing with mixed media reality
US7933453B2 (en) System and method for capturing and processing business data
US9530050B1 (en) Document annotation sharing
US20050100216A1 (en) Method and apparatus for capturing paper-based information on a mobile computing device
US20070047008A1 (en) System and methods for use of voice mail and email in a mixed media environment
WO2013004036A1 (en) Business card recognition method combining character recognition and image matching
CN114445836A (en) Information auditing method and device combining RPA and AI and electronic equipment
CN116665228B (en) Image processing method and device
CN112418813A (en) AEO qualification intelligent rating management system and method based on intelligent analysis and identification and storage medium
US10579653B2 (en) Apparatus, method, and computer-readable medium for recognition of a digital document
CN112464907A (en) Document processing system and method
CN114238731A (en) Domestic CPU retrieval method, system, device and computer readable medium
US20080144106A1 (en) Automated processing of paper forms using remotely-stored form content
CN113971810A (en) Document generation method, device, platform, electronic equipment and storage medium
US20150030241A1 (en) Method and system for data identification and extraction using pictorial representations in a source document
CN115640952B (en) Method and system for importing and uploading data
KR101659886B1 (en) business card ordering system and method
CN116152480A (en) Data extraction and structuring processing system and implementation method
CN113516044A (en) Paper contract credit enhancement method and system based on OCR and Hash algorithm
CN115392209A (en) Method, equipment and medium for automatically generating civil case legal documents
CN113536831A (en) Reading assisting method, device, equipment and computer readable medium based on image recognition
WO2024115773A1 (en) Computer implemented method for an automated search of an article of a printed medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination