CN112446297B - Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same - Google Patents

Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same Download PDF

Info

Publication number
CN112446297B
CN112446297B CN202011198818.0A CN202011198818A CN112446297B CN 112446297 B CN112446297 B CN 112446297B CN 202011198818 A CN202011198818 A CN 202011198818A CN 112446297 B CN112446297 B CN 112446297B
Authority
CN
China
Prior art keywords
text
image
recognition
frame
vision aid
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011198818.0A
Other languages
Chinese (zh)
Other versions
CN112446297A (en
Inventor
郑雅羽
王豪
张子涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202011198818.0A priority Critical patent/CN112446297B/en
Publication of CN112446297A publication Critical patent/CN112446297A/en
Application granted granted Critical
Publication of CN112446297B publication Critical patent/CN112446297B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

The invention relates to an electronic vision aid and a suitable intelligent mobile phone text auxiliary reading method thereof, wherein when a gesture is not recognized, a frame of image is captured and stored to obtain a detection recognition result of a text in the image, if the detected recognition result is a first frame of image, the text is output to a voice playing module for playing, otherwise, the text is used for carrying out similar text recognition with the recognition text of the previous frame, and based on whether similar selection is carried out for re-broadcasting or page re-selection is required by a user; the closing or iteration of the program is performed based on whether the closing information is received. According to the invention, the smooth reading of texts can be realized through the electronic vision aid under the use scene of the mobile phone by the crowd with low vision, the text reading experience of visually impaired users when using the smart mobile phone is greatly improved, the problem of text reading when the mobile phone is applied in a non-obstacle mode is solved to a certain extent, and the texts can be identified and broadcasted under the self selection of the users, so that visually impaired users can be liberated from busy operation, the text reading experience is improved, and the reading efficiency is improved.

Description

Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same
Technical Field
The present invention relates to digital data processing, and in particular to scanning, transmission or reproduction of documents or the like, such as facsimile transmissions; the technical field of parts thereof, in particular to an electronic vision aid and a suitable intelligent mobile phone text auxiliary reading method thereof.
Background
Along with the development of image processing technology and the popularization of intelligent terminal equipment application, the era has also made an age of everything interconnected express way, and smart phones as important equipment entrance have become an indispensable part in daily life, and mobile payment, health codes, navigation, online shopping, smart home and the like all rely on the terminal equipment of smart phones. Even if vision-impaired users are visually impaired users, a large proportion of the users can use smart phones every day, so that convenience brought by the era transformation is enjoyed, and all equipment manufacturers also consider the blind user group, and the smart phones mainly based on graphic interaction are used more conveniently and more conveniently by the vision-impaired users assisted by the barrier-free mode. However, the barrier-free mode at the present stage still has a certain limitation, and brings barriers to the use of smart phones for visually impaired people.
At present, at least 1700 ten thousand visually impaired people exist in China, most of the visually impaired people use smart phones in life, and meanwhile, the visually impaired people face considerable inconvenience. The limitations of the barrier-free mode on smartphones are mainly manifested in:
(1) The operation mode is complex, different from the operation of single finger of normal person, the visually impaired user depends on the alternate use of single finger and double finger;
(2) The popup advertisement cannot be identified and can not guide the visually impaired user to close, and the special condition that even normal users can not close the advertisement exists, so that the use experience of the visually impaired user is poor;
(3) The application range of text recognition is small, most software application providers do not perform special optimization aiming at a barrier-free mode, and the text recognition cannot be recognized by screen reading software;
(4) The voice playing has no sequency, and particularly has great obstacle to complete and smooth reading of the text.
In the patent application number 202010132964.7, in the auxiliary control method of a blind computer and a smart phone adapted to the blind computer, the judgment of whether the smart phone can use a barrier-free mode screen reading is added, and the gesture indication position is taken as a text recognition range; however, the manner of using gestures to guide a blind computer to perform character recognition on gesture positions and broadcast the gesture positions cannot solve the problem of incoherent text reading, and a visually impaired user still cannot read the text completely.
In the patent with application number 201811129893.4, a text conversion method and a system for assisting the blind person to read are provided, a process of splicing texts in the sight of the blind person user and finally delivering voice output is added, and the improvement of smooth reading experience of the blind person user is facilitated; however, this method still cannot solve the problem of reading logicality, and the text content in the sight of the blind user is possibly irrelevant, and cannot be ensured to be text content in the sight of the blind user.
Disclosure of Invention
The invention solves the problems in the prior art, provides an optimized electronic vision aid and a suitable intelligent mobile phone text auxiliary reading method thereof, focuses on smooth text reading with the greatest contact among vision-impaired people, and has positive significance for improving the use experience and life acquisition sense of the vision-impaired people.
In order to achieve the purpose of the invention, the technical scheme adopted by the invention is that a method for assisting in reading the text of a smart phone suitable for an electronic vision aid comprises the following steps:
step 1: starting an electronic vision aid, and initializing a first frame identifier;
step 2: detecting gestures by using a trained video gesture recognition network, entering a step 3 when the gestures are not recognized, otherwise, repeating the step 2;
step 3: capturing a frame of image and storing the image in a local place, preprocessing the image, extracting characteristics of a text by using a trained classifier to obtain a detection and identification result of the text, and storing the detection and identification result into a file;
step 4: judging whether the text is the identification text of the characters of the first frame image according to the first frame identifier, if so, outputting the text to a voice playing module for playing, executing the step 8, otherwise, executing the step 5 by taking the text as the identification text of the current frame;
step 5: performing similar text recognition on the recognition text of the current frame and the recognition text of the previous frame, if the recognition text of the current frame and the recognition text of the previous frame are not similar, executing the step 6, otherwise, executing the step 7;
step 6: discarding the current frame and the text containing the detection and identification result after identification, prompting the user to slide upwards appropriately by voice, and returning to the step 2;
step 7: removing similar text content contained in the text content identified by the current frame, and transmitting the final file to a voice playing module;
step 8: if the closing information of the electronic vision aid is received, the program is closed, otherwise, the voice prompt is finished when the current playing is finished, the user is prompted to continue to slide the page, and the step 2 is returned.
Preferably, in the step 1, the first frame identifier is used for identifying whether a frame of image is stored currently, if yes, directly identifying to play, otherwise, performing similarity judgment.
Preferably, the step 2 includes the steps of:
step 2.1: extracting features of a video shot by a camera by using a light flow method, detecting a moving object, and separating the moving object from a background by comparing motion changes between adjacent frames to obtain a light flow information data flow;
step 2.2: obtaining a video key frame by using a key frame extraction method based on clustering;
step 2.3: inputting the optical flow information data stream obtained in the step 2.1 and the image data stream of the key frame obtained in the step 2.2 into a trained gesture recognition network, detecting whether a gesture appears, judging whether a user selects the barrier-free function of the mobile phone or reads a text by using an electronic vision aid, and if the gesture is not recognized, entering the step 3, otherwise repeating the step 2.
Preferably, the step 3 includes the steps of:
step 3.1: capturing a frame of image and storing the image in a local place, and cutting the upper and lower boundaries of the image;
step 3.2: sequentially carrying out operations of histogram equalization, median filtering treatment and foreground isolated point removal on the cut image to complete image pretreatment;
step 3.3: inputting the preprocessed image into a trained image feature extraction network, and extracting and fusing features of different dimensions of the image to obtain a feature map;
step 3.4: inputting the feature map into a convolution layer, generating different text prediction results based on a text box from small to large, and solving the problem of separating adjacent texts by using a PSENT-based progressive scale expansion method to obtain a predicted text detection result of an input picture;
step 3.5: for the predicted text detection result, removing irrelevant text recognition boxes through a minimum rectangular area;
step 3.6: and obtaining a final identification result after the input picture passes through the coding network and the decoding network, and storing the result into a file.
Preferably, in the step 3.1, the upper and lower boundaries of the image are each set back by 1/20 of the original length.
Preferably, the step 3.4 comprises the steps of:
step 3.4.1: the feature map obtains n text prediction results W with different sizes from small to large through a convolution layer 1 ,W 2 ,……,W n
Step 3.4.2: the prediction result W of the minimum size 1 Dividing the text into different text areas according to lines or text blocks;
step 3.4.3: w is prioritized using breadth-first algorithm 1 Pixel-by-pixel expansion to W 2 And then newly obtained W 2 Continue to extend to W 3 And so on, extend to W n And obtaining a final text detection result.
Preferably, the step 5 includes the steps of:
step 5.1: respectively carrying out word segmentation on the identification content of the previous frame and the identification content of the current frame, dividing a character string with a certain length into a plurality of parts to obtain a characteristic item set, and removing words and punctuation marks with the use frequency exceeding a threshold value in each word segmentation;
step 5.2: transforming each acquired characteristic item into a signature value by using a Hash algorithm to obtain a 128-bit Hash digital string;
step 5.3: calculating the weight of each digital string in the text representation vector by adopting a TF-IDF similarity algorithm, and adding up the weight by utilizing the signature data generated in the step 5.2 after giving the weight to obtain a 128-bit non-dimensionality-reduced Simhash value;
step 5.4: reducing the dimension of the non-dimension-reduced Simhash value obtained in the step 5.3 into binary numbers consisting of 0 and 1 to obtain a Simhash signature;
step 5.5: and comparing the Hamming distances of the two text signatures to obtain the similarity of the two texts, and if the similarity exceeds a set similarity threshold, obtaining similar texts.
Preferably, in the step 5.3, the method for calculating the weight of any feature item includes the following steps:
step 5.3.1: calculating TF weights TF of the feature items N Representing the frequency of occurrence of said feature item in the text,wherein l represents a set of all features, N is any feature item in l, and N is the number of times that the N feature item appears;
step 5.3.2: calculating IDF weight, representing the frequency of occurrence of the feature item in two texts, IDF N =log(2/(n+1));
Step 5.3.3: the weight calculation formula of the characteristic term is w N =TF N ×IDF N
Preferably, the step 7 includes the steps of:
step 7.1: dividing the current text into a plurality of word sets taking paragraphs as units;
step 7.2: forming a mapping relation between the divided word sets and the text of the previous frame;
step 7.3: intercepting out paragraphs with mapping proportion larger than a threshold value, and storing the paragraphs into a file;
step 7.4: the text of similar paragraphs in the current frame identification content is deleted by using the existing command, and the final text is delivered to the voice identification module.
The electronic vision aid comprises a vision aid body, one or more cameras are arranged in the cooperation of the vision aid, a key function area is arranged on the outer side of the vision aid, and a controller is arranged in the vision aid in the cooperation of the cameras and the key function area:
the controller includes:
the image preprocessing unit is used for preprocessing the image acquired by the camera;
the storage unit is used for caching the acquired images and storing the content after character recognition into a file;
the gesture recognition unit is used for detecting a user operation gesture;
an OCR unit for character detection and character recognition;
the text similarity detection unit is used for detecting the similarity condition of the current text recognition text and the last text recognition text;
and the voice playing unit is used for converting the processed final text recognition content into voice playing.
The invention provides an optimized electronic vision aid and a suitable intelligent mobile phone text auxiliary reading method thereof, which are characterized in that a trained video gesture recognition network is used for detecting gestures, when the gestures are not recognized, a frame of image is captured and stored locally, the image is preprocessed, a trained classifier is used for extracting characteristics of a text to obtain a text detection and recognition result, and the detection and recognition result is stored in a file; judging whether the text is the identification text of the first frame image or not according to the first frame identifier, if yes, outputting the text to a voice playing module for playing, otherwise, taking the text as the identification text of the current frame, carrying out similar text identification with the identification text of the previous frame, if not, discarding the current frame and the text containing the detection identification result after identification, and prompting a user to slide upwards appropriately, otherwise, removing the similar text content contained in the identification text content of the current frame, and transmitting the final file to the voice playing module; and closing or repeating the program based on whether the closing information of the electronic vision aid is received.
The invention has the beneficial effects that:
(1) The gestures are identified by using a machine learning method, so that the detection and identification speed is high, the identification accuracy is high under a complex use environment, most gestures for operating the smart phone can be identified, and the adaptation degree to different mobile phones is high;
(2) The caching operation is provided, so that the text content after the collected pictures and the characters are identified can be stored, and the time correlation among the texts is reserved;
(3) Under the condition that the intelligent mobile phone does not support the barrier-free mode, the text can be detected and analyzed, and characters irrelevant to the text, such as a state column, comments, comment boxes and the like of the intelligent mobile phone, are removed, so that the intelligent mobile phone has high recognition efficiency and accuracy;
(4) Providing correlation, such as similarity detection of time correlation texts, carrying out content correlation analysis on a current text and a last adjacent text, if the current text and the last adjacent text are similar, deleting repeated text contents of the current text and the last adjacent text, otherwise judging that adjacent text contents are not linked up due to excessive sliding of a user, and feeding back up adjustment until the contexts are linked up, so that a visually impaired user does not need to repeatedly listen to the same contents and read incomplete texts losing a part in the middle when reading the texts, the continuity of text reading is improved, and the problems of disordered text reading, content incoherence and excessive useless information inclusion in the text contents when the visually impaired user uses a smart phone are solved.
According to the invention, the text recognition technology is combined with the text similarity detection technology to perform multi-frame text image operation, so that a low-vision crowd can realize smooth text reading through the electronic vision aid under the use scene of the mobile phone, the text reading experience of visually impaired users when using the smart mobile phone is greatly improved, the text reading problem of the mobile phone application under the condition of not supporting an unobstructed mode is solved to a certain extent, and the text can be recognized and broadcasted under the condition of supporting the unobstructed mode under the self-selection of the user, so that visually impaired users can be liberated from busy operation, the text reading experience is improved, and the reading efficiency is improved.
Drawings
FIG. 1 is a flow chart of the present invention;
fig. 2 is a schematic view of the vision aid of the present invention during use.
Detailed Description
The present invention will be described in further detail with reference to examples, but the scope of the present invention is not limited thereto.
The invention relates to a smart phone text auxiliary reading method suitable for an electronic vision aid, which is used for identifying gestures of a user currently operating a smart phone in real time and judging whether the user is operating the smart phone or waiting for voice broadcasting text content according to the gestures. Performing text detection and recognition on the acquired images by a machine learning and image processing method, and delivering recognition contents to a subsequent text similarity detection module, if the recognition contents are similar, deleting repeated text contents of the current text and the last adjacent text; if the text content is dissimilar, the adjacent text content is not linked up due to excessive sliding of the user, upward stroke adjustment is fed back until the context is linked up, and finally the text content is played by a voice playing module, so that the text reading consistency of the smart phone of the visually impaired user is better.
The method comprises the following steps.
Step 1: the electronic vision aid is started, and the first frame identifier is initialized.
In the step 1, the first frame identifier is used for identifying whether a frame of image is stored currently, if yes, directly identifying and playing, otherwise, performing similarity judgment.
In the invention, the first frame identifier is used for identifying whether an image is captured before the moment (whether a frame of image is stored currently or not) so as to confirm whether the current frame is a first frame, if the current frame is the first frame, the first frame is identified as 0, which indicates that no repeated previous content exists, and the playing can be directly identified, and if the current frame is not the first frame, the first frame is identified as 1, and the possibility that a text is identified before exists, so that similarity judgment is needed.
In the present invention, the identifier may be manually reset to 0 when the user reads a new text.
Step 2: and (3) detecting the gesture by using the trained video gesture recognition network, entering a step (3) when the gesture is not recognized, otherwise, repeating the step (2).
The step 2 comprises the following steps:
step 2.1: extracting features of RGB video shot by a camera by using a light flow method, detecting a moving object, and separating the moving object from a background by comparing motion changes between adjacent frames to obtain a light flow information data flow;
step 2.2: obtaining a video key frame by using a key frame extraction method based on clustering;
step 2.3: inputting the optical flow information data stream obtained in the step 2.1, namely the data stream obtained after separating an object and a background, and the RGB image data stream of the key frame obtained in the step 2.2 into a trained gesture recognition network, detecting whether a gesture appears, judging whether a user selects the barrier-free function of the mobile phone or uses an electronic vision aid to read a text, and entering the step 3 if the gesture is not recognized, otherwise repeating the step 2.
According to the method, aiming at the use principle of the electronic vision aid and the coherent operation scene of the intelligent mobile phone, the video gesture recognition mode is adopted, so that the real-time performance of recognition can be improved, when no gesture is detected, the current operation of a user is finished, the user can take a candid photograph, and if the gesture is detected, the user is operated, and therefore repeated and continuous judgment is needed; the related gestures include, but are not limited to, gestures of touching volume keys or on-off keys, double-hand operation gestures of the smart phone under holding, and single-hand operation gestures of the smart phone on a desktop.
In the present invention, the key frame extraction method based on clustering in step 2.2 generally presets a threshold value, which is 0.9.
Step 3: capturing a frame of image and storing the image in a local place, preprocessing the image, extracting the characteristics of the text by using a trained classifier to obtain a detection and identification result of the text, and storing the detection and identification result into a file.
The step 3 comprises the following steps:
step 3.1: capturing a frame of image and storing the image in a local place, and cutting the upper and lower boundaries of the image;
in the step 3.1, the upper and lower boundaries of the image are respectively retracted by 1/20 of the original length.
Step 3.2: sequentially carrying out operations of histogram equalization, median filtering treatment and foreground isolated point removal on the cut image to complete image pretreatment;
step 3.3: inputting the preprocessed image into a trained image feature extraction network, and extracting and fusing features of different dimensions of the image to obtain a feature map;
step 3.4: inputting the feature map into a convolution layer, generating different text prediction results based on a text box from small to large, and solving the problem of separating adjacent texts by using a PSENT-based progressive scale expansion method to obtain a predicted text detection result of an input picture;
the step 3.4 comprises the following steps:
step 3.4.1: the feature map obtains n text prediction results W with different sizes from small to large through a convolution layer 1 ,W 2 ,……,W n
Step 3.4.2: the prediction result W of the minimum size 1 Dividing the text into different text areas according to lines or text blocks;
step 3.4.3: w is prioritized using breadth-first algorithm 1 Pixel-by-pixel expansion to W 2 And then newly obtained W 2 Continue to extend to W 3 And so on, extend to W n And obtaining a final text detection result.
Step 3.5: for the predicted text detection result, removing irrelevant text recognition boxes through a minimum rectangular area;
step 3.6: and obtaining a final identification result after the input picture passes through the coding network and the decoding network, and storing the result into a file.
In the invention, in step 3.1, the images are generally stored by taking the timestamp as a name, so that the time sequence is convenient to confirm; meanwhile, the cutting of the upper and lower boundaries is used for cutting out the status bar of the mobile phone image and the comment frame which possibly appears at the lowest part, so that repeated identification of the mobile phone status bar and the comment frame is avoided, repeated voice broadcasting is realized, and subsequent actual text operation is facilitated.
In the invention, in step 3.2, the camera collects images, and certain noise and inconspicuous contrast can be introduced due to illumination intensity, temperature and the like of a scene, the histogram equalization can enable the images to obtain better contrast, the images have richer features, the median filtering enables the images to remove some useless information and keep edge information from being destroyed, and after the completion, the images are enhanced and denoised.
In the invention, in step 3.4, the size of the text box from small to large refers to the size of the text box detected by the text, and n is larger than 0.
In the invention, in step 3.5, the minimum rectangular area refers to the minimum size of the text boxes which can be incorporated into the text recognition pair according to the use scene, and the text recognition boxes with the length and the width having the minimum value are all irrelevant data and can be removed; generally, the minimum area rectangular frame is 1/20 of the length of the image, and is 3/14 of the original width of the image, so as to filter out parts irrelevant to text content, such as comments, related recommendations and the like.
In the invention, the part before the step 3.6 is a text detection part, the area with text is framed into the content in the form of a rectangular frame, and then the content in the rectangular area frame is identified in the step 3.6.
Step 4: and judging whether the text is the identification text of the characters of the first frame image according to the first frame identifier, if so, outputting the text to a voice playing module for playing, executing the step 8, otherwise, executing the step 5 by taking the text as the identification text of the current frame.
Step 5: and (3) carrying out similar text recognition on the recognition text of the current frame and the recognition text of the previous frame, if the recognition text of the current frame and the recognition text of the previous frame are dissimilar, executing the step (6), otherwise, executing the step (7).
Said step 5 comprises the steps of:
step 5.1: respectively carrying out word segmentation on the identification content of the previous frame and the identification content of the current frame, dividing a character string with a certain length into a plurality of parts to obtain a characteristic item set, and removing words and punctuation marks with the use frequency exceeding a threshold value in each word segmentation;
step 5.2: transforming each acquired characteristic item into a signature value by using a Hash algorithm to obtain a 128-bit Hash digital string;
step 5.3: calculating the weight of each digital string in the text representation vector by adopting a TF-IDF similarity algorithm, and adding up the weight by utilizing the signature data generated in the step 5.2 after giving the weight to obtain a 128-bit non-dimensionality-reduced Simhash value;
in the step 5.3, the weight calculation method of any feature item comprises the following steps:
step 5.3.1: calculating TF weights TF of the feature items N Representing the frequency of occurrence of said feature item in the text,wherein l represents a set of all features, N is any feature item in l, and N is the number of times that the N feature item appears;
step 5.3.2: calculating IDF weight, representing the frequency of occurrence of the feature item in two texts, IDF N =log(2/(n+1));
Step 5.3.3: the weight calculation formula of the characteristic term is w N =TF N ×IDF N
Step 5.4: reducing the dimension of the non-dimension-reduced Simhash value obtained in the step 5.3 into binary numbers consisting of 0 and 1 to obtain a Simhash signature;
step 5.5: and comparing the Hamming distances of the two text signatures to obtain the similarity of the two texts, and if the similarity exceeds a set similarity threshold, obtaining similar texts.
In the invention, in the step 5.1, the parallel segmentation method is used as the word segmentation method, so that the segmentation speed of the document is relatively high. Removing words and punctuations with the use frequency exceeding a threshold value in each word segmentation, for example, people often use words such as 'we', 'in', 'woolen' and the like in texts, the words with high use frequency in the texts can cause interference in text similarity detection, and the words which are frequently used in daily life need to be removed; the threshold here may be set by the person skilled in the art.
In the present invention, the overall process of step 5 is to calculate the hash value of each word, such as {1, 0, … …,1}, convert 0 to-1, and leave unchanged {1, -1, … …,1}, so that the word is converted to a 128-bit numerical string, and perform weighted addition on different sequence values, and combine into a numerical string such as {1, -2,3, -4,5, … …, -128}, perform dimension reduction according to signs, place 1 in positive and place 0 in non-positive.
In the invention, n in the step 5.5 is more than or equal to 0.
In the present invention, in step 5.5, the discrimination threshold of hamming distance is generally set to 3.
Step 6: discarding the current frame and the text containing the detection and identification result after identification, and returning to the step 2 by appropriately sliding upwards by the voice prompt user.
In the invention, after the similarity detection is carried out on the texts in the step 5, dissimilarity indicates that the user is likely to slide the smart phone too fast, so that a part of texts are missed between the two texts to cause discontinuous upper and lower texts, and therefore, the user needs to prompt up-drawing in the step 6, and the two texts can be intersected to a certain extent after proper upward sliding adjustment, namely, page jump is certainly avoided.
Step 7: and removing similar text content contained in the text content identified by the current frame, and transmitting the final file to a voice playing module.
The step 7 comprises the following steps:
step 7.1: dividing the current text into a plurality of word sets taking paragraphs as units;
step 7.2: forming a mapping relation between the divided word sets and the text of the previous frame;
step 7.3: intercepting out paragraphs with mapping proportion larger than a threshold value, and storing the paragraphs into a file;
step 7.4: the text of similar paragraphs in the current frame identification content is deleted by using the existing command, and the final text is delivered to the voice identification module.
In the invention, step 7.2 uses MD5 to map paragraphs to be queried and comparison text.
IN the invention, the set of words with the text mapping is IN= { IN 1 ,in 2 ,in 3 ,...,in k Set nin= { NIN with no mapped words 1 ,nin 2 ,nin 3 ,...,nin k }, where in k ,nin k Respectively representing the positions of the kth word in the ordered word set in the ordered mapping set; in step 7.3, the mapping is performedThe emission proportion calculation formula isThe numerator is the accumulation of the number of all the mapped words, and the whole expression represents the proportion of the words with the mapping relation in all the words. Typically, r has a threshold of 0.8.
In the present invention, the existing command in step 7.4 refers to the comm command in the linux system.
Step 8: if the closing information of the electronic vision aid is received, the program is closed, otherwise, the voice prompt is finished when the current playing is finished, the user is prompted to continue to slide the page, and the step 2 is returned.
The invention also relates to an electronic vision aid adopting the intelligent mobile phone text auxiliary reading method suitable for the electronic vision aid, the electronic vision aid comprises a vision aid 1 body, one or more cameras 2 are arranged in cooperation with the vision aid, a key function area (not shown in the figure and in a conventional key structure) is arranged on the outer side of the vision aid 1, and a controller is arranged in the vision aid 1 in cooperation with the cameras 2 and the key function area:
the controller includes:
an image preprocessing unit for preprocessing the image acquired by the camera 2;
the storage unit is used for caching the acquired images and storing the content after character recognition into a file;
the gesture recognition unit is used for detecting a user operation gesture;
an OCR unit for character detection and character recognition;
the text similarity detection unit is used for detecting the similarity condition of the current text recognition text and the last text recognition text;
and the voice playing unit is used for converting the processed final text recognition content into voice playing.
According to the invention, under the use scene of the smart phone, the electronic vision aid 1 realizes low power consumption by reducing the resolution of the acquired image to 720p on the premise of ensuring clear shooting, prolongs the endurance time of the electronic vision aid 1, reduces the data volume of the video stream processed by the system, and improves the instantaneity.
In the invention, the existing electronic vision aid 1 can be provided with a plurality of cameras 2, for example, the electronic vision aid comprises a front camera and a rear camera 2, and only one camera 2 is used for collecting text images under a non-motion scene, for example, the rear camera 2.
In the invention, the electronic vision aid 1 needs to be placed vertically on a desktop and keep a proper distance from the smart phone 3, the application range is suitable for but not limited to the smart phone 3, and paper documents can also be used for identifying texts by using the electronic vision aid 1.
In the invention, as shown in fig. 2, a schematic diagram of the vision aid 1 in the use process of the invention is shown, a camera 2 is used for shooting a smart phone 3 to obtain a view-finding frame 4 of the camera 2, and auxiliary reading is performed after filtering parts irrelevant to text content, such as comments, related recommendations and the like of a rectangular frame 5 in the minimum area of the smart phone 3.
The method can identify gesture operation of the user in real time by using the image processing and machine learning methods. The intelligent mobile phone detection method is high in detection speed, applicable to various complex environments, high in recognition precision for intelligent mobile phones of different models, capable of recognizing various gestures for operating the intelligent mobile phones through a recognition network trained in advance, and high in robustness.
According to the invention, the text detection module and the text similarity detection module are combined for use, the text content and the text similarity detection are respectively carried out under the condition that the intelligent mobile phone does not support no obstacle, and the text is broadcasted through the voice playing module, wherein the text detection removes irrelevant text of non-text content, and the processed identified text content is delivered to the text detection module, so that the accuracy of the text content is ensured. Then through similarity analysis, if the content is similar, removing the repeated content in the last identified text contained in the current identified text, if the content is dissimilar, preventing adjacent text contents from being linked up due to excessive sliding of a user, and feeding back the up-stroke adjustment until the context is linked up; the method can enable the visually impaired user to accurately read the text, can avoid repeated listening to similar text content and cause logic confusion of voice playing after text recognition, improves continuity of the text read by the visually impaired user, and solves the problems that the visually impaired user encounters logic confusion of text reading, content incoherency and excessive useless information is mixed in the text content when using the smart phone.

Claims (7)

1. A method for reading the text of a smart phone for an electronic vision aid is characterized in that: the method comprises the following steps:
step 1: starting an electronic vision aid, and initializing a first frame identifier;
step 2: detecting gestures by using a trained video gesture recognition network, entering a step 3 when the gestures are not recognized, otherwise, repeating the step 2;
step 3: the method comprises the following steps:
step 3.1: capturing a frame of image and storing the image in a local place, and cutting the upper and lower boundaries of the image;
step 3.2: sequentially carrying out operations of histogram equalization, median filtering treatment and foreground isolated point removal on the cut image to complete image pretreatment;
step 3.3: inputting the preprocessed image into a trained image feature extraction network, and extracting and fusing features of different dimensions of the image to obtain a feature map;
step 3.4: inputting the feature map into a convolution layer, generating different text prediction results based on a text box from small to large, solving the problem of separating adjacent texts by using a PSENT-based progressive scale expansion method, and obtaining a predicted text detection result of an input picture, wherein the method comprises the following steps of:
step 3.4.1: the feature map obtains n text prediction results W with different sizes from small to large through a convolution layer 1 ,W 2 ,……,W n
Step 3.4.2: the prediction result W of the minimum size 1 Dividing the text into different text areas according to lines or text blocks;
step 3.4.3: w is prioritized using breadth-first algorithm 1 Pixel-by-pixel expansion to W 2 And then newly obtained W 2 Continue to extend to W 3 And so on, extend to W n Obtaining the final textThe detection result;
step 3.5: for the predicted text detection result, removing irrelevant text recognition boxes through a minimum rectangular area;
step 3.6: the input picture is processed by an encoding network and a decoding network to obtain a final identification result, and the result is stored in a file;
step 4: judging whether the text is the identification text of the characters of the first frame image according to the first frame identifier, if so, outputting the text to a voice playing module for playing, executing the step 8, otherwise, executing the step 5 by taking the text as the identification text of the current frame;
step 5: performing similar text recognition on the recognition text of the current frame and the recognition text of the previous frame, if the recognition text of the current frame and the recognition text of the previous frame are not similar, executing the step 6, otherwise, executing the step 7;
step 6: discarding the current frame and the text containing the detection and identification result after identification, prompting the user to slide upwards appropriately by voice, and returning to the step 2;
step 7: the method comprises the following steps:
step 7.1: dividing the current text into a plurality of word sets taking paragraphs as units;
step 7.2: forming a mapping relation between the divided word sets and the text of the previous frame;
step 7.3: intercepting out paragraphs with mapping proportion larger than a threshold value, and storing the paragraphs into a file; step 7.4: deleting the text of the similar paragraph in the current frame identification content by using the existing command, and delivering the final text to a voice identification module;
step 8: if the closing information of the electronic vision aid is received, the program is closed, otherwise, the voice prompt is finished when the current playing is finished, the user is prompted to continue to slide the page, and the step 2 is returned.
2. The method for assisting in reading text of a smart phone, which is applicable to an electronic vision aid, according to claim 1, is characterized in that: in the step 1, the first frame identifier is used for identifying whether a frame of image is stored currently, if yes, directly identifying and playing, otherwise, performing similarity judgment.
3. The method for assisting in reading text of a smart phone, which is applicable to an electronic vision aid, according to claim 1, is characterized in that: the step 2 comprises the following steps:
step 2.1: extracting features of a video shot by a camera by using a light flow method, detecting a moving object, and separating the moving object from a background by comparing motion changes between adjacent frames to obtain a light flow information data flow;
step 2.2: obtaining a video key frame by using a key frame extraction method based on clustering;
step 2.3: inputting the optical flow information data stream obtained in the step 2.1 and the image data stream of the key frame obtained in the step 2.2 into a trained gesture recognition network, detecting whether a gesture appears, judging whether a user selects the barrier-free function of the mobile phone or reads a text by using an electronic vision aid, and if the gesture is not recognized, entering the step 3, otherwise repeating the step 2.
4. The method for assisting in reading text of a smart phone, which is applicable to an electronic vision aid, according to claim 1, is characterized in that: in the step 3.1, the upper and lower boundaries of the image are respectively retracted by 1/20 of the original length.
5. The method for assisting in reading text of a smart phone, which is applicable to an electronic vision aid, according to claim 1, is characterized in that: said step 5 comprises the steps of:
step 5.1: respectively carrying out word segmentation on the identification content of the previous frame and the identification content of the current frame, dividing a character string with a certain length into a plurality of parts to obtain a characteristic item set, and removing words and punctuation marks with the use frequency exceeding a threshold value in each word segmentation;
step 5.2: transforming each acquired characteristic item into a signature value by using a Hash algorithm to obtain a 128-bit Hash digital string;
step 5.3: calculating the weight of each digital string in the text representation vector by adopting a TF-IDF similarity algorithm, and adding up the weight by utilizing the signature data generated in the step 5.2 after giving the weight to obtain a 128-bit non-dimensionality-reduced Simhash value;
step 5.4: reducing the dimension of the non-dimension-reduced Simhash value obtained in the step 5.3 into binary numbers consisting of 0 and 1 to obtain a Simhash signature;
step 5.5: and comparing the Hamming distances of the two text signatures to obtain the similarity of the two texts, and if the similarity exceeds a set similarity threshold, obtaining similar texts.
6. The method for assisting in reading text of a smart phone, which is applicable to an electronic vision aid, according to claim 5, is characterized in that: in the step 5.3, the weight calculation method of any feature item comprises the following steps:
step 5.3.1: calculating TF weights TF of the feature items N Representing the frequency of occurrence of said feature item in the text,wherein l represents a set of all features, N is any feature item in l, and N is the number of times that the N feature item appears;
step 5.3.2: calculating IDF weight, representing the frequency of occurrence of the feature item in two texts, IDF N =log(2/(n+1));
Step 5.3.3: the weight calculation formula of the characteristic term is w N =TF N ×IDF N
7. An electronic vision aid adopting the intelligent mobile phone text auxiliary reading method suitable for the electronic vision aid as set forth in any one of claims 1-6, which is characterized in that: the electronic vision aid comprises a vision aid body, one or more cameras are arranged in cooperation with the vision aid, a key function area is arranged on the outer side of the vision aid, and a controller is arranged in the vision aid in cooperation with the cameras and the key function area:
the controller includes:
the image preprocessing unit is used for preprocessing the image acquired by the camera;
the storage unit is used for caching the acquired images and storing the content after character recognition into a file;
the gesture recognition unit is used for detecting a user operation gesture;
an OCR unit for character detection and character recognition;
the text similarity detection unit is used for detecting the similarity condition of the current text recognition text and the last text recognition text;
and the voice playing unit is used for converting the processed final text recognition content into voice playing.
CN202011198818.0A 2020-10-31 2020-10-31 Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same Active CN112446297B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011198818.0A CN112446297B (en) 2020-10-31 2020-10-31 Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011198818.0A CN112446297B (en) 2020-10-31 2020-10-31 Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same

Publications (2)

Publication Number Publication Date
CN112446297A CN112446297A (en) 2021-03-05
CN112446297B true CN112446297B (en) 2024-03-26

Family

ID=74735532

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011198818.0A Active CN112446297B (en) 2020-10-31 2020-10-31 Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same

Country Status (1)

Country Link
CN (1) CN112446297B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113011503B (en) * 2021-03-17 2021-11-23 彭黎文 Data evidence obtaining method of electronic equipment, storage medium and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509479A (en) * 2011-10-08 2012-06-20 沈沾俊 Portable character recognition voice reader and method for reading characters
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109377834A (en) * 2018-09-27 2019-02-22 成都快眼科技有限公司 A kind of text conversion method and system of helping blind people read
CN110162750A (en) * 2019-01-24 2019-08-23 腾讯科技(深圳)有限公司 Text similarity detection method, electronic equipment and computer readable storage medium
CN111399638A (en) * 2020-02-29 2020-07-10 浙江工业大学 Blind computer and intelligent mobile phone auxiliary control method adapted to same

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102509479A (en) * 2011-10-08 2012-06-20 沈沾俊 Portable character recognition voice reader and method for reading characters
CN108628825A (en) * 2018-04-10 2018-10-09 平安科技(深圳)有限公司 Text message Similarity Match Method, device, computer equipment and storage medium
CN109377834A (en) * 2018-09-27 2019-02-22 成都快眼科技有限公司 A kind of text conversion method and system of helping blind people read
CN110162750A (en) * 2019-01-24 2019-08-23 腾讯科技(深圳)有限公司 Text similarity detection method, electronic equipment and computer readable storage medium
CN111399638A (en) * 2020-02-29 2020-07-10 浙江工业大学 Blind computer and intelligent mobile phone auxiliary control method adapted to same

Also Published As

Publication number Publication date
CN112446297A (en) 2021-03-05

Similar Documents

Publication Publication Date Title
CN109117777B (en) Method and device for generating information
CN110446063B (en) Video cover generation method and device and electronic equipment
JP4469905B2 (en) Telop collection device and telop collection method
CN109977942B (en) Scene character recognition method based on scene classification and super-resolution
CN113010703B (en) Information recommendation method and device, electronic equipment and storage medium
CN103488983A (en) Business card OCR data correction method and system based on knowledge base
US20230245455A1 (en) Video processing
CN109033060B (en) Information alignment method, device, equipment and readable storage medium
CN115994230A (en) Intelligent archive construction method integrating artificial intelligence and knowledge graph technology
WO2018103450A1 (en) Image recognition-based communication method and device
EP3204872A1 (en) Linking thumbnail of image to web page
CN111753923A (en) Intelligent photo album clustering method, system, equipment and storage medium based on human face
CN112446297B (en) Electronic vision aid and intelligent mobile phone text auxiliary reading method applicable to same
CN111428710A (en) File classification collaboration robot and image character recognition method based on same
Chen et al. An optical music recognition system for traditional Chinese Kunqu Opera scores written in Gong-Che Notation
CN114372172A (en) Method and device for generating video cover image, computer equipment and storage medium
CN111723653B (en) Method and device for reading drawing book based on artificial intelligence
CN113705300A (en) Method, device and equipment for acquiring phonetic-to-text training corpus and storage medium
WO2018120575A1 (en) Method and device for identifying main picture in web page
CN112464907A (en) Document processing system and method
Hsueh Interactive text recognition and translation on a mobile device
CN116644228A (en) Multi-mode full text information retrieval method, system and storage medium
CN113486171B (en) Image processing method and device and electronic equipment
CN111178409B (en) Image matching and recognition system based on big data matrix stability analysis
US20150379751A1 (en) System and method for embedding codes in mutlimedia content elements

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant