CN114328804A - Method and system for searching key words containing character pictures - Google Patents

Method and system for searching key words containing character pictures Download PDF

Info

Publication number
CN114328804A
CN114328804A CN202011029418.7A CN202011029418A CN114328804A CN 114328804 A CN114328804 A CN 114328804A CN 202011029418 A CN202011029418 A CN 202011029418A CN 114328804 A CN114328804 A CN 114328804A
Authority
CN
China
Prior art keywords
document
picture
key words
positioning
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011029418.7A
Other languages
Chinese (zh)
Inventor
邓裕强
朱志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Jiubang Digital Technology Co Ltd
Original Assignee
Guangzhou Jiubang Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Jiubang Digital Technology Co Ltd filed Critical Guangzhou Jiubang Digital Technology Co Ltd
Priority to CN202011029418.7A priority Critical patent/CN114328804A/en
Publication of CN114328804A publication Critical patent/CN114328804A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a keyword retrieval method and a keyword retrieval system containing text pictures, which can easily retrieve keywords required by a user and specific page numbers required by the user on an uneditable pdf document and a picture through three stages of text identification, text retrieval and target word positioning, solve the problem that vast users cannot find out specific pages and specific contents in key words in a large number of pictures and uneditable text pdf documents, and can easily meet the requirement that the specific pages required by the user are positioned through the keywords in the outdoor learning process.

Description

Method and system for searching key words containing character pictures
Technical Field
The invention relates to the field of character recognition and retrieval, in particular to a method and a system for retrieving key words containing character pictures.
Background
With the rise of palm reading, many working groups habitually like online learning, reading novels and the like to spend on-duty and off-duty hours, but the existing application software or mobile terminal cannot jump to a corresponding page and standard at any time and any place according to key words, like a dox format file, a user can find the content of a document which is completely the same as the key words, long sentences, punctuation marks and numbers through a 'searching' function and transfer marks, but the reading terminal software, pdf and the like contain picture types and cannot edit the document, so that the user cannot search by using the 'searching' function, and cannot directly transfer key word search character strings like a search engine, thereby bringing much inconvenience to the user.
Disclosure of Invention
The invention provides a method and a system for searching key words containing text pictures, which solve the problems, save the time of a user and bring convenience to the user.
The technical scheme disclosed by the invention is as follows:
a method and a system for searching key words and phrases containing character pictures are characterized in that a camera of a mobile terminal or other equipment is utilized to shoot a plurality of pictures or a PDF file can not be edited, character elements in the pictures are identified through an OCR character identification technology and a deep learning-based system, and key words and phrases needed by a user are searched.
The first stage is to identify the element content on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology and arrange the element content in sequence; in the second stage, required key words are retrieved from the text documents through a deep learning network; and in the third stage, the keyword positioning retrieval of the picture and the content of the non-editable pdf document is realized through the positioning coordinates and the page number.
Further, the OCR character recognition model is mainly used for continuously training and recognizing similar pictures through a deep learning network and extracting element information such as characters, numbers, letters and the like in the pictures; and saved in a.doc or.docx document;
in the first stage, OCR character recognition comprises the following steps:
step 1: reading the picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal;
the picture set comprises a jpg, png or pdf document picture set and the like.
Step 2: and marking the coordinates of each element in the picture and recording.
And step 3: and generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture sets.
Furthermore, the document page number corresponds to only one picture.
In the second stage, the key words are searched by utilizing the deep learning network, and the method comprises the following steps;
step 1: the deep learning network identifies and accurately memorizes the characters in the dox or docx document.
Step 2: and determining key words to be retrieved, checking and inputting the key words into the deep learning network model.
And step 3: positioning specific key words through a trained deep learning network, marking, and recording key word coordinates and document page numbers;
furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
The third stage, positioning the original picture by using the coordinates and the page number, comprising the following steps:
step 1: and identifying the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network.
Step 2: and positioning the position of the keyword in the original picture through the key word coordinates.
In order to realize the method, the invention also discloses a keyword retrieval system containing the character pictures, which comprises three modules:
an OCR recognition module: identifying element contents on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology, and sequentially arranging the element contents according to the sequence;
the key term retrieval module: and searching out the required key terms from the text documents through a deep learning network.
A key word location module: and by positioning the coordinates and the page number, the retrieval of the pictures and the keywords of the contents of the non-editable pdf document is realized.
The OCR recognition module mainly comprises the following modules through OCR character recognition:
an element acquisition module: reading the picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal.
The picture set comprises a jpg, png or pdf document picture set and the like.
A coordinate marking module: and marking the coordinates of each element in the picture and recording.
The document generation module: and generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture sets.
Furthermore, the document page number corresponds to only one picture.
The key word retrieval module is used for retrieving key words by utilizing a deep learning network and comprises the following modules;
the deep learning training module: the deep learning network identifies and accurately memorizes the characters in the dox or docx document.
A checking module: and determining key words to be retrieved, checking and inputting the key words into the deep learning network model.
A keyword tagging module: positioning specific key words through a trained deep learning network, marking, and recording key word coordinates and document page numbers;
furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
The key word positioning module is used for positioning the original picture by utilizing coordinates and page numbers, and comprises the following modules:
a page number positioning module: and identifying the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network.
The word coordinate positioning module: and positioning the position of the keyword in the original picture through the key word coordinates.
The invention discloses a keyword retrieval method and a keyword retrieval system containing text pictures, which can easily retrieve keywords required by a user and specific page numbers required by the user on an uneditable pdf document and a picture through three stages of text identification, text retrieval and target word positioning, solve the problem that vast users cannot find out specific pages and specific contents in key words in a large number of pictures and uneditable text pdf documents, and can easily meet the requirement that the specific pages required by the user are positioned through the keywords in the outdoor learning process.
Drawings
Fig. 1 shows a flowchart of a keyword search method including a text image according to the present invention.
Fig. 2 shows a flowchart of a keyword search method including text images according to the present invention.
Fig. 3 shows a flow chart of a keyword retrieval system with text pictures according to the present invention.
Detailed Description
The method comprises the following steps that more and more electronic books appear in the public view along with the application of reading software, some electronic books are presented in an application terminal through a software background, and are presented in txt, HTML, HLP and other formats, and can be edited, copied, pasted, labeled and the like through reading software; however, if a paper text is scanned by a scanning machine and stored in a pdf file format, the text content, especially the text content scanned as a picture, cannot be retrieved by such a reader, and all pages and contents containing the word in the pdf document cannot be retrieved by the key word.
Similarly, when a student in a classroom browses news through PPT learning or a microblog website, some messages only exist in the form of pictures and texts, a terminal user can only store pictures in a local folder, if a small number of pictures and texts are available, the content of the searched pictures cannot be too much, if tens of pictures, hundreds of pictures and thousands of pictures are available, and how to search out a target picture from a huge picture set is a problem to be solved by the invention.
As shown in fig. 1, a method and a system for retrieving a keyword including a text image take a plurality of photos or a PDF file is not editable by using a mobile terminal camera or other devices, identify text elements in the image by using an OCR text recognition technology and a deep learning-based system, and retrieve a keyword required by a user.
The first stage is to identify the element content on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology and arrange the element content in sequence; in the second stage, required key words are retrieved from the text documents through a deep learning network; and in the third stage, the keyword positioning retrieval of the picture and the content of the non-editable pdf document is realized through the positioning coordinates and the page number.
Further, the OCR character recognition model is mainly used for continuously training and recognizing similar pictures through a deep learning network and extracting element information such as characters, numbers, letters and the like in the pictures; and saved in a.doc or.docx document.
As shown in fig. 2, in the first stage, OCR character recognition includes the following steps:
s101: reading the picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal.
The picture set comprises a jpg, png or pdf document picture set and the like.
S102: and marking the coordinates of each element in the picture and recording.
S103: and generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture sets.
Furthermore, the document page number corresponds to only one picture.
In the second stage, the key words are searched by utilizing the deep learning network, and the method comprises the following steps;
s201: the deep learning network identifies and accurately memorizes the characters in the dox or docx document.
S202: and determining key words to be retrieved, checking and inputting the key words into the deep learning network model.
S203: specific key words are positioned through a trained deep learning network, marking is carried out, and coordinates of the key words and document page numbers are recorded.
Furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
The third stage, positioning the original picture by using the coordinates and the page number, comprising the following steps:
s301: and identifying the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network.
S302: and positioning the position of the keyword in the original picture through the key word coordinates.
In order to realize the method, the invention also discloses a keyword retrieval system containing the character pictures, which comprises three modules:
an OCR recognition module: and identifying the element contents on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology, and sequentially arranging the element contents according to the sequence.
The key term retrieval module: and searching out the required key terms from the text documents through a deep learning network.
A key word location module: and by positioning the coordinates and the page number, the retrieval of the pictures and the keywords of the contents of the non-editable pdf document is realized.
The OCR recognition module mainly comprises the following modules through OCR character recognition:
an element acquisition module: reading the picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal.
The picture set comprises a jpg, png or pdf document picture set and the like;
a coordinate marking module: and marking the coordinates of each element in the picture and recording.
The document generation module: and generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture sets.
Furthermore, the document page number corresponds to only one picture.
The key word retrieval module for retrieving the key words by using the deep learning network comprises the following modules.
The deep learning training module: the deep learning network identifies and accurately memorizes the characters in the dox or docx document.
A checking module: and determining key words to be retrieved, checking and inputting the key words into the deep learning network model.
A keyword tagging module: specific key words are positioned through a trained deep learning network, marking is carried out, and coordinates of the key words and document page numbers are recorded.
Furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
The key word positioning module is used for positioning the original picture by utilizing coordinates and page numbers, and comprises the following modules:
a page number positioning module: and identifying the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network.
The word coordinate positioning module: and positioning the position of the keyword in the original picture through the key word coordinates.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the present invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (8)

1. A method for searching key words containing text pictures comprises the following steps:
s1, identifying the element contents on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology, and sequentially arranging the element contents according to the sequence;
s2: searching out required key words from the text documents through a deep learning network;
and S3, realizing the keyword positioning retrieval of the picture and the non-editable pdf document content through positioning coordinates and page numbers.
2. The method as claimed in claim 1, wherein the step S1 comprises the following steps:
s101, reading picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal, wherein the picture set comprises the forms of jpg, png or pdf document picture sets and the like;
s102, marking and recording the coordinates of each element in the picture;
s103, generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture set; furthermore, the document page number corresponds to only one picture.
3. The method as claimed in claim 1, wherein the step S2 comprises the following steps:
s201: identifying characters in the dox or docx document by the deep learning network, and accurately memorizing the characters;
s202: determining key words to be retrieved, checking and inputting the key words into the deep learning network model;
s203: positioning specific key words through a trained deep learning network, marking, and recording key word coordinates and document page numbers; furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
4. The method as claimed in claim 1, wherein the step S3 comprises the following sub-steps
S301: recognizing the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network;
s302: and positioning the position of the keyword in the original picture through the key word coordinates.
5. A keyword retrieval system comprising textual images, the system comprising:
an OCR recognition module: identifying element contents on each page in the to-be-processed picture set or the non-editable PDF document by using an OCR (optical character recognition) technology, and sequentially arranging the element contents according to the sequence; the key term retrieval module: searching out required key words from the text documents through a deep learning network; a key word location module: and positioning and searching the picture and the keywords of the contents of the non-editable pdf document are realized through positioning coordinates and page numbers.
6. The system of claim 5, wherein the OCR recognition module comprises:
an element acquisition module: reading the picture set elements to be identified according to the sequence, and performing character, format correction and interference element removal; the picture set comprises a jpg, png or pdf document picture set and the like;
a coordinate marking module: marking and recording the coordinates of each element in the picture;
the document generation module: and generating a doc or docx document, correspondingly, marking page numbers on the document, and sequentially converting the document into document pages according to the sequence of the picture sets.
7. The system of claim 5, wherein the keyword search module comprises the following modules:
the deep learning training module: identifying characters in the dox or docx document by the deep learning network, and accurately memorizing the characters; a checking module: determining key words to be retrieved, checking and inputting the key words into the deep learning network model;
a keyword tagging module: positioning specific key words through a trained deep learning network, marking, and recording key word coordinates and document page numbers; furthermore, at least 1 coordinate corresponding to the key words is provided; the document page number at least comprises 1 key term coordinate; the key terms are located on at least one document page number.
8. The system of claim 5, wherein the keyword spotting module comprises the following modules:
a page number positioning module: recognizing the coordinates of the key words and the document page number, and positioning the original picture according to the document page number through a deep learning network;
the word coordinate positioning module: and positioning the position of the keyword in the original picture through the key word coordinates.
CN202011029418.7A 2020-09-27 2020-09-27 Method and system for searching key words containing character pictures Pending CN114328804A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011029418.7A CN114328804A (en) 2020-09-27 2020-09-27 Method and system for searching key words containing character pictures

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011029418.7A CN114328804A (en) 2020-09-27 2020-09-27 Method and system for searching key words containing character pictures

Publications (1)

Publication Number Publication Date
CN114328804A true CN114328804A (en) 2022-04-12

Family

ID=81011918

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011029418.7A Pending CN114328804A (en) 2020-09-27 2020-09-27 Method and system for searching key words containing character pictures

Country Status (1)

Country Link
CN (1) CN114328804A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131301A (en) * 2023-10-24 2023-11-28 苏州阿基米德网络科技有限公司 Webpage end browsing method of medical equipment document

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
JP2011170393A (en) * 2010-01-20 2011-09-01 Int Kk Image search system, image search method and image search program
CN106777185A (en) * 2016-12-23 2017-05-31 浙江大学 A kind of across media Chinese herbal medicine image search methods based on deep learning
CN107346325A (en) * 2016-05-04 2017-11-14 中国石油集团长城钻探工程有限公司 Information query method and device
CN109800761A (en) * 2019-01-25 2019-05-24 厦门商集网络科技有限责任公司 Method and terminal based on deep learning model creation paper document structural data
CN110197175A (en) * 2019-04-28 2019-09-03 南京邮电大学 A kind of method and system of books title positioning and part-of-speech tagging

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7970213B1 (en) * 2007-05-21 2011-06-28 A9.Com, Inc. Method and system for improving the recognition of text in an image
CN101464903A (en) * 2009-01-09 2009-06-24 江阴明伦科技有限公司 OCR picture and text recognition and retrieval method and system through web mode
JP2011170393A (en) * 2010-01-20 2011-09-01 Int Kk Image search system, image search method and image search program
CN107346325A (en) * 2016-05-04 2017-11-14 中国石油集团长城钻探工程有限公司 Information query method and device
CN106777185A (en) * 2016-12-23 2017-05-31 浙江大学 A kind of across media Chinese herbal medicine image search methods based on deep learning
CN109800761A (en) * 2019-01-25 2019-05-24 厦门商集网络科技有限责任公司 Method and terminal based on deep learning model creation paper document structural data
CN110197175A (en) * 2019-04-28 2019-09-03 南京邮电大学 A kind of method and system of books title positioning and part-of-speech tagging

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
范兵;吉立新;于洪涛;: "基于Office OCR组件的文档图象检索研究", 通信技术, no. 06, 10 June 2009 (2009-06-10) *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117131301A (en) * 2023-10-24 2023-11-28 苏州阿基米德网络科技有限公司 Webpage end browsing method of medical equipment document
CN117131301B (en) * 2023-10-24 2024-01-05 苏州阿基米德网络科技有限公司 Webpage end browsing method of medical equipment document

Similar Documents

Publication Publication Date Title
JP4260790B2 (en) Filing / retrieval apparatus and filing / retrieval method
CN109614504B (en) Internet electronic book management system and method
CN110083805B (en) Method and system for converting Word file into EPUB file
US8156115B1 (en) Document-based networking with mixed media reality
US8244037B2 (en) Image-based data management method and system
US8990235B2 (en) Automatically providing content associated with captured information, such as information captured in real-time
US9075779B2 (en) Performing actions based on capturing information from rendered documents, such as documents under copyright
US8418055B2 (en) Identifying a document by performing spectral analysis on the contents of the document
EP1672473A2 (en) Stamp sheet
US20050232484A1 (en) Image processing device, image processing method, and storage medium storing program therefor
CN112800848A (en) Structured extraction method, device and equipment of information after bill identification
US7536649B2 (en) Apparatus, system, and server capable of effectively specifying information in document
Ugale et al. Document management system: A notion towards paperless office
JP5205028B2 (en) Handwritten annotation management device and interface
CN115116082A (en) One-key filing system based on OCR recognition algorithm
CN114328804A (en) Method and system for searching key words containing character pictures
CN112418875A (en) Cross-platform tax intelligent customer service corpus migration method and device
JP2000020549A (en) Device for assisting input to document database system
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
Taghva et al. Autotag: A tool for creating structured document collections from printed materials
JP7086424B1 (en) Patent text generator, patent text generator, and patent text generator
JPH1021043A (en) Icon generating method, document retrieval method, and document server
KR102363769B1 (en) System and method for classifying and providing digitalized documents in stages and computer-readable recording medium thereof
Taghva et al. Autotag: A tool for creating structured document collections from printed materials
Hast et al. TexT-Text Extractor Tool for Handwritten Document Transcription and Annotation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination