WO2020233378A1 - 版面分析方法、阅读辅助设备、电路和介质 - Google Patents

版面分析方法、阅读辅助设备、电路和介质 Download PDF

Info

Publication number
WO2020233378A1
WO2020233378A1 PCT/CN2020/087877 CN2020087877W WO2020233378A1 WO 2020233378 A1 WO2020233378 A1 WO 2020233378A1 CN 2020087877 W CN2020087877 W CN 2020087877W WO 2020233378 A1 WO2020233378 A1 WO 2020233378A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
text data
scan line
paragraph
image
Prior art date
Application number
PCT/CN2020/087877
Other languages
English (en)
French (fr)
Inventor
蔡海蛟
冯歆鹏
周骥
Original Assignee
上海肇观电子科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海肇观电子科技有限公司 filed Critical 上海肇观电子科技有限公司
Publication of WO2020233378A1 publication Critical patent/WO2020233378A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/414Extracting the geometrical structure, e.g. layout tree; Block segmentation, e.g. bounding boxes for graphics or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/158Segmentation of character regions using character size, text spacings or pitch estimation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/416Extracting the logical structure, e.g. chapters, sections or page numbers; Identifying elements of the document, e.g. authors
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present disclosure relates to the field of data processing, and in particular to a layout analysis method, reading auxiliary equipment, electronic equipment, and corresponding chip circuits and computer-readable storage media.
  • the layout analysis technology in related technologies mainly relies on the image data of the text or the semantic information of the text, and uses technologies such as image processing, clustering algorithms, or semantic analysis algorithms to divide the text in the image into multiple paragraphs. This type of technology usually has more complex algorithms and a large amount of calculation.
  • a layout analysis method including: obtaining coordinate information of one or more text lines in an image; and using an area corresponding to the coordinate information of the one or more text lines in a data structure Setting text data in the image, generating a layout model corresponding to the image, the text data including data indicating that there are texts; and scanning the generated layout model to read the text data in the layout model, and Based on the read relative position relationship of the text data in the layout model, the layout model is divided into paragraphs.
  • a chip circuit including: a circuit unit configured to perform the steps of the method described in the present disclosure.
  • a reading aid device including: a sensor configured to acquire the image; the chip circuit as described above, the chip circuit further comprising: configured to perform text on the image Recognizing a circuit unit to obtain text data; and a circuit unit configured to convert text data in paragraph by paragraph into sound data according to the result of paragraph division.
  • the reading aid device further includes a sound output device configured to output the sound data.
  • an electronic device including: a processor; and a memory storing a program, the program including instructions that when executed by the processor cause the processor to execute the present disclosure The method described in.
  • a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor of an electronic device, cause the electronic device to execute the Methods.
  • FIG. 1 is a flowchart showing a layout analysis method according to an exemplary embodiment of the present disclosure
  • FIG. 2 is a schematic diagram showing an example of an image containing text lines and its corresponding layout model according to an exemplary embodiment of the present disclosure
  • FIG. 3 is a flowchart illustrating an exemplary method of obtaining coordinate information of a text line according to an exemplary embodiment of the present disclosure
  • FIG. 4 is a flowchart illustrating an exemplary method of generating a layout model according to an exemplary embodiment of the present disclosure
  • FIG. 5 is a schematic diagram showing an example of an area corresponding to coordinate information of a text line in the data structure of a layout model according to an exemplary embodiment of the present disclosure
  • FIG. 6 is a flowchart illustrating an exemplary method of scanning a layout model for paragraph division according to an exemplary embodiment of the present disclosure
  • FIG. 7 is a schematic diagram showing an example of an exemplary layout model for illustrating paragraph division according to an exemplary embodiment of the present disclosure
  • FIG. 8 is a schematic diagram illustrating the calculation of the overlap ratio of two text data sequences according to an exemplary embodiment of the present disclosure
  • 9(a) and 9(b) are schematic diagrams showing examples of exemplary layout models for illustrating paragraph division according to an exemplary embodiment of the present disclosure
  • FIG. 10 is a schematic diagram for illustrating paragraph coordinate information update processing according to an exemplary embodiment of the present disclosure
  • FIG. 11 is a structural block diagram showing a reading aid device according to an exemplary embodiment of the present disclosure.
  • Fig. 12 is a structural block diagram showing an exemplary computing device that can be applied to an exemplary embodiment.
  • first, second, etc. to describe various elements is not intended to limit the positional relationship, timing relationship, or importance relationship of these elements. Distinguish one element from another.
  • first element and the second element may refer to the same instance of the element, and in some cases, based on the description of the context, they may also refer to different instances.
  • horizontal refers to the direction substantially parallel to the image edge of the text line (for example, the included angle is less than 45 degrees), and “vertical” refers to the other image edge perpendicular to the "horizontal” Direction.
  • layout model refers to the row direction of the data structure of the layout model, which corresponds to the “horizontal” of the image, and “vertical” refers to the column direction of the data structure of the layout model, which is “vertical” to the image. Corresponding.
  • the following description of the present disclosure is mainly based on the situation that the text line extends in substantially the left and right directions relative to the reader (ie, horizontal reading), but the technical solution of the present disclosure is not limited to this, and the technical solution of the present disclosure is also applicable to the text line relative to
  • the situation where the reader extends basically in the up and down direction that is, the vertical version reading
  • the algorithm of the present disclosure is also applicable to the vertical version reading.
  • the horizontal direction in the present disclosure may mean a substantially up and down direction
  • the vertical direction may mean a substantially left and right direction.
  • Reading materials such as books or magazines usually have a certain typesetting. For example, the content of the reading materials will be divided into different paragraphs (for example, including upper and lower sections and left and right columns, etc.).
  • paragraphs for example, including upper and lower sections and left and right columns, etc.
  • people capture images in the field of vision through vision, recognize paragraphs in the images through the brain, and read the text in the paragraphs.
  • a machine is used to "read” these readings, it is not only necessary to recognize the words in the image, but also to divide these words into paragraphs through a certain algorithm, so as to "read” the readings in the correct paragraph order. Text.
  • paragraph division refers to dividing text in an image or text data in a layout model into different paragraphs.
  • the upper and lower paragraph divisions can also be called subsections, and the left and right paragraph divisions can also be called column divisions.
  • the present disclosure provides a paragraph division method, which avoids directly performing complex image processing on text images and does not need to perform semantic analysis. Instead, it converts an image containing text into a layout that simulates the text distribution in the image but has a simpler structure.
  • a model where the data included in the layout model may not contain semantic content, but only simple data representing where text exists, and then position analysis of the data in the layout model to divide paragraphs.
  • FIG. 1 is a flowchart showing a layout analysis method according to an exemplary embodiment of the present disclosure.
  • the layout analysis method may include the following steps: obtaining coordinate information of a text line (step S101), generating a layout model (step S103), and scanning the layout model for paragraph division (step S105).
  • step S101 the coordinate information of one or more text lines in the image is obtained.
  • the exemplary method of the present disclosure mainly uses the coordinate information of the text instead of the original image of the text itself for layout analysis, in this step, the coordinate information of the text line is obtained from the image for subsequent processing.
  • the image may be electronic image data acquired by an image sensor.
  • the image sensor may be arranged on the user's wearable device or glasses, etc., so that the image may be an image of the layout of a reading (for example, a book or a magazine, etc.) held by the user captured by the image sensor.
  • the image may include text (which may include text, numbers, characters, punctuation marks, etc. of various countries and regions), pictures, and the like.
  • the image may be a pre-processed image, and the pre-processing may include, but is not limited to, tilt correction, blur removal, and the like, for example.
  • the image may be stored in a storage device or storage medium after being acquired by an image sensor, and read out for processing.
  • the text line refers to a continuous line of text, which may be, for example, a sequence of texts whose spacing between adjacent texts in the horizontal direction is less than a threshold spacing.
  • the foregoing adjacent character spacing may be, for example, the distance between equipotential coordinates of adjacent characters, such as the distance between the upper left corner coordinates, the lower right corner coordinates, or the centroid coordinates in the direction of the character line of adjacent characters.
  • the adjacent characters can be considered continuous, so that they are divided into the same character line; and if the spacing between adjacent characters is greater than the threshold spacing, then It can be considered that the adjacent characters are not continuous (for example, they may belong to the left and right columns respectively), so that they are divided into different lines of characters.
  • the coordinate information of a text line may be a rectangle containing the text line (for example, the smallest rectangle containing the text line, or the smallest rectangle containing the text line is expanded up, down, left, and/or right by a certain multiple The rectangle obtained later) coordinate information.
  • the coordinate information of the text line may include, for example, the coordinate information of the four vertices of the rectangle, and the coordinate information of the text line may also include the coordinate information of any vertex of the rectangle and the height information and length information of the rectangle.
  • the definition of the coordinate information of the text line is not limited to this, as long as it can represent the space position and size occupied by the text line.
  • the coordinate information of the text line can be obtained from other machines (such as remote servers or cloud computing devices) or other applications (such as text recognition applications such as optical character recognition OCR), but can also be passed in local applications. Word recognition processing to obtain.
  • FIG. 2 is a schematic diagram showing an example of an image containing text lines and their corresponding layout models according to an exemplary embodiment of the present disclosure, in which text lines TL1 to TL6 in the image 201 are shown, and the dashed frame shows A rectangle containing each line of characters in the image 201.
  • step S103 by setting text data in an area corresponding to the coordinate information of the one or more text lines in the data structure, a layout model corresponding to the image is generated.
  • a “text data” that is simpler than the text image itself is set in the area corresponding to the text line obtained in the previous step, so as to construct a layout model that simulates the text distribution in the image for subsequent processing use.
  • the layout model referred to in the present disclosure is a model constructed for simulating the position distribution of text lines in an image, in which data at each position constitutes a correspondence and a mapping relationship with pixels at corresponding positions in the image.
  • the layout model is constructed by setting data representing the existence of text at the corresponding position in the image in the position of the data structure.
  • the data structure may be a file in a memory (for example, a memory, a cache, etc.), or an image expressed by pixels, or a table or a data array.
  • the data structure is not limited to any specific data structure, as long as the data therein can simulate the text lines in the image.
  • the size of the data structure can be the same as the size of the image, or it can have a size scaled in proportion to the size of the image.
  • the data structure (and the corresponding layout model) can have the same size as the image (that is, it has 3840 x 2160 pixels or data) and can be scaled only in the horizontal direction (For example, it has 1920 x 2160 pixels or data), it can be zoomed only in the vertical direction (for example, it has 3840 x 1080 pixels or data), or it can be zoomed in both the horizontal and vertical directions (for example, It has 1920 x 1080 pixels or data, or 1280 x 1080 pixels or data) and so on.
  • the size of the data structure is the same as the image size or has a size scaled in proportion to the image size
  • the data or pixels of the data structure can establish a correspondence or mapping relationship with the pixels of the image according to the location of the region in the image.
  • the text data includes data indicating that there is a text, which can indicate whether there is a text in an area corresponding to the coordinate information of the text, regardless of the semantics or content of the text.
  • blank data may be set in an area corresponding to the non-text area of the image in the data structure, and the blank data is data indicating that there is no text.
  • the text data may be "1", and the blank data may be "0", for example.
  • the text data is not limited to "0" and "1", and can also be any other data, as long as it can distinguish whether there are text or text lines in the area.
  • the coordinates of the area corresponding to the coordinate information of the text line in the data structure of the layout model can also be proportional to the coordinates of the text line area of the image To zoom.
  • the size of the data structure of the layout model is smaller than the size of the image, multiple pixels in the image should be mapped to one data or pixel in the layout model according to the mapping rule. If a plurality of pixels in an image include both pixels in a text line and pixels in a blank area, the mapping rule may be specified as mapping the plurality of pixels into text data, for example, and the mapping rule may also be specified as The plurality of pixels are mapped to blank data.
  • the mapping rule may include, for example, if the ratio of the number of pixels in the text line to the number of pixels in the blank area in the multiple pixels in the image is not less than a predetermined ratio, then the multiple pixels are mapped into text data , Otherwise it is mapped to blank data.
  • the mapping rule may also include: for example, if N pixel rows are mapped to one data or pixel row in the layout model, then one pixel row is extracted from every N pixel rows and mapped to one data in the layout model. Or pixel rows.
  • FIG. 2 shows the layout model 203 corresponding to the image 201. It can be seen that text data are set in the regions R1 to R6 corresponding to the text lines TL1 to TL6 in the layout model 203 (in this example, " 1”), and other blank areas are set with blank data ("0" in this example). It can be seen that the position layout of the text data in the layout model 203 well simulates the position layout of the text lines in the image 201. In the layout model 203, the zoom ratio relative to the image 201 is such that one data row (pixel row) in the data structure of the layout model 203 exactly corresponds to a text row in the image 201.
  • FIG. 5 shows an example in which one character line in the image 501 is represented by two data lines (pixel lines) in the layout model 503.
  • the data structure in the layout model 203 can also use 5 or 10 data rows (pixel rows) to represent a text row.
  • step S105 the generated layout model is scanned to read the text data in the layout model, and based on the relative positional relationship of the read text data in the layout model, the The layout model is divided into paragraphs.
  • the text data in the layout model is divided into paragraphs by scanning and reading the data in the layout model obtained in the previous step.
  • the scanning may be a scanning and reading of the data structure of the layout model data by data or pixel by pixel.
  • the scanning may be, for example, a progressive scan of the another image or data array.
  • a scan line may be, for example, a data line or a pixel line running through the layout model in the horizontal direction.
  • One text line may correspond to multiple scan lines, as shown in the two scan lines SL1 and SL2 corresponding to the text line TL1 in FIG. 5.
  • a scan line may also involve multiple different text lines separated in the horizontal direction, that is, it may contain multiple text data sequences. In the example of FIG.
  • the scan line SL in the layout model 203 relates to the text lines TL1 and TL2, that is, contains the corresponding text data sequence in the regions R1 and R2.
  • the text data sequence refers to a sequence of continuous text data (that is, there is no blank data between text data), or a sequence of text data in which the number of blank data between adjacent text data is less than a threshold.
  • the threshold may be several Text data, for example, 3 text data or 5 text data.
  • a typical text data sequence may be, for example, a continuous string of "1"s, as shown in FIGS. 2 and 5.
  • the value of the text data (and blank data) in the generated layout model can be read and the relative position relationship can be analyzed, so that the layout model can be divided into paragraphs based on this relative position relationship.
  • the layout model containing simple data indicating whether there is a text line in a certain area is generated based on an image containing text (line of text), that is, a layout model containing simpler information is used to perform the original image Simulation, therefore, the layout analysis problem is transformed from a complex image processing problem to a relatively simple position analysis problem, which can significantly reduce the algorithm complexity and calculation amount while basically maintaining the layout analysis accuracy, and reduce the computational load when the computer analyzes the layout problem .
  • the size of the layout model is reduced relative to the image size, the amount of data to be processed can be further reduced, thereby further reducing the computational load when the computer analyzes layout problems.
  • FIG. 3 shows an exemplary embodiment of the processing of obtaining a text line in step S101
  • FIG. 4 shows An exemplary embodiment of the process of generating a layout model in step S103 is shown
  • FIG. 6 shows an exemplary embodiment of the process of scanning the layout model to obtain paragraph division in step S105.
  • FIG. 3 is a flowchart showing an exemplary method for obtaining coordinate information of a text line according to an exemplary embodiment of the present disclosure, which can be used as an exemplary implementation of the aforementioned step S101, that is, step S101 may include FIG. 3 The steps in the flowchart.
  • step S301 character recognition is performed on the image to obtain coordinate information of each character.
  • Various character recognition technologies in related technologies such as optical character recognition (OCR) technology can be used in this step.
  • OCR optical character recognition
  • the coordinate information of the text may include, but is not limited to, the coordinates of the four vertices of a rectangle containing the text and/or the width and height information of the text, for example.
  • the text coordinates in any related technology can be used as the coordinate information here, as long as it can reflect the position of the text in the image and the area occupied by it.
  • each character is processed sequentially based on the coordinate information of the character to obtain a character line.
  • step S303 it is determined whether the distance between the currently processed character and the previous character is less than a threshold distance. If it is not less than the threshold distance (step S303, "No"), it is determined that the current text belongs to a new text line (step S305); otherwise (step S303, "Yes"), the current text is divided into the text line of the previous text Medium (step S309).
  • the threshold distance may be determined according to application requirements (for example, language, character type, etc.), for example, it may be set to a specific multiple of the average text width (for example, 1.2 to 3.5 times), or it may be set in advance. A certain multiple (for example, 1.5 times to 3.5 times) of the average distance between adjacent characters in the same paragraph.
  • the method for determining the threshold distance is not limited to this, as long as it can be used to distinguish whether adjacent characters belong to the same paragraph.
  • the threshold pitch is set to, for example, 2.5 times the average width of the text. Since the spacing between adjacent characters “e” and “a” in “be” and “as” in the character line TL1 is smaller than the threshold spacing, they are divided into the same character line TL1. Since the distance between the “r” in “your” at the end of the text line TL1 and the “A” at the beginning of the text line TL2 is greater than the threshold distance, they are divided into different text lines TL1 and TL2, respectively.
  • step S311 After dividing the current text into the previous text line or the new text line, in step S311, it is determined whether there is the next text in the image, and if there is a next text (step S311, "Yes"), the next text is regarded as the current text, And continue to divide it into character lines through the processing from step S303; if there is no next character (step S311, "No"), it means that all characters in the image have been divided into character lines.
  • step S305 after it is determined in step S305 that the current text belongs to a new text line, it means that the previous text is the last text in the previous text line. Therefore, for example, the coordinates of the rectangle containing the previous text line can be determined in step S307.
  • the information is used as the coordinate information of the previous character line.
  • step S307 may not be performed, but after all the text in the image is divided into corresponding text lines in steps S305, S309, and S311, it is determined in step S313 that each text line contains the text line
  • the coordinate information of the rectangle is used as the coordinate information of the text line.
  • each side of the rectangle is parallel to each side of the image, that is, has a horizontal direction and a vertical direction.
  • the text line in the image has a certain slope with respect to the horizontal side of the rectangle containing the text line (which is parallel to the horizontal side of the image).
  • the threshold inclination for example, 20 degrees or 30 degrees
  • the image can be preprocessed to correct the inclination of the text, and the steps described in FIG. 3 and subsequent processing can be performed based on the inclination-corrected image.
  • the user may also be prompted to improve the posture of holding the reading material to reduce the inclination of the image.
  • FIG. 4 is a flowchart showing an exemplary method of generating a layout model according to an exemplary embodiment of the present disclosure, which can be used as an exemplary implementation of the aforementioned step S103, that is, step S103 may include the process in FIG. 4 Diagram of the steps.
  • the obtained character lines are processed one by one starting from the first character line in the image.
  • step S401 the coordinate information of the current character line is read.
  • step S403 the area corresponding to the coordinate information of the current character line is determined.
  • step S405 text data is set in an area corresponding to the coordinate information of the current text line in the data structure.
  • step S407 it is judged whether there is a next character line. If there is another character line (step S407, "Yes"), the next character line is regarded as the current character line, and the processing from step S401 is continued. Division of text lines; if there is no next text line (step S407, "No"), it means that all text lines in the image have been modeled into the layout model.
  • the area in the data structure corresponding to the coordinate information of the one or more text lines may include an area in the data structure determined by the coordinate information of each text line.
  • setting text data in the area corresponding to the coordinate information of each text line in the data structure means that the rectangle at the corresponding position in the layout model Set text data in the data area or pixel area in.
  • text data “1” is set in the regions R1 to R6 determined based on the coordinate information of the text lines TL1 to TL6 to form the layout model 203 of the image 201.
  • the area corresponding to the coordinate information of the one or more text lines not only includes the area determined by the coordinate information of each text line, but also includes the coordinate information from the text line in the vertical direction. (E.g., upwards and/or downwards) extend the area by a certain distance.
  • it may also include a step of extending a specific distance from the coordinate information of the text line in the vertical direction (for example, upward and/or downward).
  • the specific distance depends on the line spacing of adjacent text lines in a direction perpendicular to the text line in the image (ie, the height of the white space between the upper text line and the adjacent lower text line).
  • the specific distance can for example cover all the white space between context lines in the same paragraph, the specific distance can be, for example, 1 times the average line spacing of adjacent lines in the image ⁇ 1.5 times. If expanding upwards and downwards at the same time, the specific distance can, for example, cover part of the white space between context lines in the same paragraph, and the specific distance can be, for example, 0.5 to 0.7 times the average line spacing of adjacent lines in the image. Therefore, the upper text line and the lower text line are expanded to cover the blank space in between.
  • FIG. 5 is a schematic diagram showing an example of an area corresponding to coordinate information of a character line in a data structure of a layout model according to an exemplary embodiment of the present disclosure.
  • TL1 and TL2 are two text lines in the image 501.
  • the area R1 corresponding to the coordinate information of the text line TL1 in the data structure of the layout model 503 includes the area 513 determined by the coordinate information of the text line TL1 in the data structure, and is similar for the text line TL2 and the area R2.
  • the area R1 corresponding to the coordinate information of the text line TL1 in the data structure of the layout model 505 includes not only the area 513 determined by the coordinate information of the text line TL1 in the data structure, but also the area from the text line TL1
  • the coordinate information extends down the area 515 of 2 pixel rows, and the same is true for the text line TL2 and the area R2.
  • the area corresponding to the coordinate information of these text lines in the data structure of the layout model can not only cover the text lines themselves, but also cover the spaces between them.
  • the gap between the lines makes the corresponding text data of the upper and lower two adjacent text lines in the same paragraph in the layout model can be integrated without blank data, which helps to simplify the subsequent scanning processing algorithm.
  • this expansion is not necessary, and the blank data between the upper and lower adjacent text lines in the same paragraph can also be processed in the subsequent scanning processing algorithm.
  • step S105 is a flowchart showing an exemplary method of scanning a layout model for paragraph division according to an exemplary embodiment of the present disclosure, which can be used as an exemplary implementation of the aforementioned step S105, that is, step S105 may include a diagram Steps of the flowchart in 6.
  • FIG. 7 is a schematic diagram showing an example of an exemplary layout model for illustrating paragraph division according to an exemplary embodiment of the present disclosure, in which the text data in the layout model 701 is divided into paragraphs. Divide. In FIG. 7, "1" is used to represent text data, and the illustration of possible blank data is omitted.
  • step S601 the current scan line is read, for example, the first scan line shown in FIG. 7.
  • step S603 it is determined whether there is text data in the current scan line. If there is text data in the current scan line (step S603, "Yes"), go to step S605; otherwise (step S603, "No"), determine whether there is a next scan line (step S613). If it is determined in step S613 that there is a next scan line (step S613, "Yes"), the next scan line is regarded as the current scan line, and the processing from step S601 is continued. If it is determined in step S613 that there is no next scan line (step S613, "No"), it is determined that the scanning of the layout model has ended. Since there are consecutive text data sequences (ie, "text data sequences”) STDS1 and STDS2 in the first scan line in FIG. 7, it is determined that text data exists, and the process proceeds to step S605.
  • step 605 for the current text data sequence in the current scan line (for example, the text data sequence STDS1 in the first scan line in FIG. 7), it is determined whether there is a text data sequence and the current scan in the adjacent previous scan line.
  • the overlap ratio of the text data sequence in the row in the horizontal direction is greater than the threshold overlap ratio (rule (a)). If it exists (step S605, "Yes"), divide the text data sequence of the current scan line into the paragraphs to which the text data sequence of the adjacent previous scan line belongs (step S609); if it does not exist (step S605, "No "), it is determined that the text data sequence in the current scan line belongs to a new paragraph (step S607).
  • step S605 if it is determined in step S605 that there is no text data sequence in the adjacent previous scan line and the overlap rate of the text data sequence in the current scan line is greater than the threshold overlap rate, it is equivalent to (1) determining that the adjacent previous scan line There is no text data sequence in a scan line at all, or (2) it is determined that there is a text data sequence in an adjacent previous scan line but its overlap rate with the current text data sequence in the current scan line is not greater than a threshold overlap rate.
  • the situation (1) means that the adjacent previous scan line is a blank scan line without text data, and the current text data sequence in the current scan line is likely to represent the starting text data of the new paragraph; and the situation (2) means Although there is text data in the adjacent previous scan line, the text data sequence has less overlap with the current text data sequence in the horizontal direction, so the current text data sequence may not belong to the text data sequence in the previous scan line It belongs to the paragraph, so the current text data sequence is likely to belong to a new paragraph (for example, another paragraph or another column).
  • the overlapping of two character data sequences in the horizontal direction means that the projections of the two character data sequences on the coordinate axes in the horizontal direction have a common part.
  • the length of the overlap (for example, the number of text data or the number of pixels), L1 and L2 respectively represent the length of the two text data sequences (for example, the number of text data or the number of pixels).
  • the concept and calculation method of the overlap rate are given here, it should be understood that the concept and calculation method of the overlap rate are not limited to this, as long as it can express the overlap of the two columns in the horizontal direction.
  • the threshold overlap ratio can be arbitrarily predetermined according to specific application requirements. According to some embodiments, the threshold overlap ratio can be set to any value from 0.5 to 0.8, for example.
  • step S611 After it is determined in steps S607 and S609 that the current scan line is divided into the paragraph to which the text data sequence in the adjacent previous scan line belongs or is divided into a new paragraph, it is determined in step S611 whether there is a next paragraph in the current scan line Text data sequence. If there is another text data sequence in the current scan line (step S611, "Yes"), go to step S605 to continue processing the next text data sequence in the current scan line. If there is no next text data sequence in the current scan line (step S611, "No"), it means that all text data sequences in the current scan line have been processed, and step S613 is performed to determine whether there is a next scan line.
  • the processing of the second scan line is continued. Since the overlap rate between the text data sequence STDS3 in the second scan line and the text data sequence STDS1 in the adjacent previous scan line, that is, the first scan line, is 1, which is greater than the threshold overlap rate (for example, 0.75), so The text data sequence STDS3 is divided into the paragraph P1 to which the text data sequence STDS1 belongs. Similarly, the text data sequence STDS4 is divided into the paragraph P2 to which the text data sequence STDS2 belongs, and the text data sequence STDS5 is divided into the paragraph P1 to which the text data sequence STDS3 belongs.
  • the text data sequence STDS7 in the sixth scan line is adjacent to the previous scan line, that is, the fifth scan line, and there is no text data sequence, so the text data sequence STDS7 is divided into a new paragraph P4.
  • the text data sequence STDS8 in the seventh scan line is also divided into paragraph P4 because the overlap rate with the text data sequence STDS7 is greater than the threshold overlap rate.
  • the text data may include data representing the height of a text line.
  • the data representing the height of the text line can be normalized based on a certain preset height (such as but not limited to a fraction of the average text height, such as one tenth), and can be rounded (such as rounding). Or round up, etc.).
  • the preset height can be set to 1 pixel
  • the text data of a text line with a text line height of 10 pixels can be set to 10.
  • a judgment rule (rule (b)) can be added in step S605: that is, if the value of the text data in the text data sequence in the current scan line is equal to the value of the text data sequence in the adjacent previous scan line If the difference between the values of the text data is greater than the threshold height difference, it is determined that the text data sequence of the current scan line belongs to a new paragraph.
  • step S605 “the difference between the value of the text data of the text data sequence in the current scan line and the value of the text data of the text data sequence in the adjacent previous scan line is not greater than the threshold height difference” can be determined in step S605.
  • a condition is used as a necessary condition for dividing the character data sequence of the current scan line into the paragraph to which the character data sequence in the adjacent previous scan line belongs.
  • the threshold height difference may be a preset number of pixels, such as 3 pixels or 5 pixels, etc., or may be a ratio, for example, may be a fraction of the height of the smaller text line in the comparison object.
  • a judgment rule (rule (c)) can be added in step S605: that is, if the text data sequence in the current scan line is different from the multiple text data sequences in the adjacent previous scan line in the horizontal direction If the overlap rate is greater than the threshold overlap rate, it is determined that the text data sequence in the current scan line belongs to a new paragraph.
  • FIG. 9(a) is a schematic diagram showing an example of an exemplary layout model for illustrating paragraph division according to an exemplary embodiment of the present disclosure, in which it is shown that the text data sequences STDS1 and STDS3 are divided into paragraph P1, and text The data sequences STDS2 and STDS4 are divided into paragraph P2.
  • a judgment rule (rule (d)) can be added in step S605: that is, if there are multiple text data sequences in the current scan line and the same text data sequence in the adjacent previous scan line is in the horizontal direction If the overlap ratios are greater than the threshold, it is determined that the multiple text data sequences in the current scan line belong to their respective new paragraphs.
  • FIG. 9(b) is a schematic diagram showing an example of an exemplary layout model for illustrating paragraph division according to an exemplary embodiment of the present disclosure, in which it is shown that the text data sequences STDS1 and STDS2 are divided into paragraph P1.
  • the text data sequences STDS3 and STDS4 in the current scan line that is, the third scan line
  • the adjacent previous scan line that is, the second scan line
  • there is a text data sequence STDS2 with an overlap rate greater than the threshold overlap rate but if you consider the aforementioned rules (d), because the multiple text data sequences STDS3 and STDS4 in the third scan line and the text data sequence STDS2 in the second scan line have a horizontal overlap rate greater than the threshold overlap rate, in rule (a) Together with rule (d), the text data sequence STDS3 and STDS4 are divided into new paragraphs P2 and P3, respectively.
  • rule (c) and rule (d) if the typesetting form changes (for example, one scan line reflects non-column typesetting and adjacent scan lines reflect column-divided typesetting), then it can be regarded as scan lines of different types
  • the typesetting form changes for example, one scan line reflects non-column typesetting and adjacent scan lines reflect column-divided typesetting
  • each of these rules used in combination is a sufficient condition for dividing the current text data sequence into a new paragraph; that is, If any rule is established, the current text data sequence is divided into new paragraphs. In other words, when these rules are used in combination, the current text data sequence is divided into the text data sequence in the adjacent previous scan line only when the combined rules are not valid.
  • the coordinate information of the text data sequence of the current scan line may be set as the coordinate information of the new paragraph .
  • the "paragraph coordinate information" is, for example, the coordinate information of the smallest rectangle that can contain all the text data sequences in the paragraph. For example, it can pass the upper left coordinates (X1, Y1), the upper right coordinates (X2, Y1), and the lower left coordinates (X1, Y2). ) And the lower right coordinates (X2, Y2).
  • the positive direction of the X coordinate axis is rightward, and the positive direction of the Y coordinate axis is downward.
  • the method of the present disclosure can also be implemented using coordinate systems in other directions, as long as the positive and negative signs of the coordinates are adjusted according to the coordinate axis direction. If it is determined that the current text data sequence belongs to a new paragraph, then only the current text data sequence is currently included in the new paragraph, and the upper left coordinate and the lower left coordinate of the new paragraph overlap, and the upper right coordinate and the lower right coordinate overlap.
  • the coordinate information of the new paragraph is: upper left coordinates (CX1, CY1) , Upper right coordinates (CX2, CY1), lower left coordinates (CX1, CY1) and lower right coordinates (CX2, CY1).
  • step S609 when it is determined in step S609 that the text data sequence of the current scan line is divided into the paragraph to which the text data sequence in the adjacent previous scan line belongs, it may be based on the ability to include the current paragraph and all the paragraphs.
  • the coordinate information of the smallest rectangle of the text data sequence in the current scan line is updated, and the current coordinate information of the paragraph is updated.
  • the current paragraph has upper left coordinates (X1, Y1), upper right coordinates (X2, Y1), lower left coordinates (X1, Y2), and lower right coordinates (X2, Y2)
  • the current text data sequence has start coordinates ( CX1, CY1) and end point coordinates (CX2, CY1)
  • the coordinate information of the updated paragraph by including the current text data sequence should be: upper left coordinate (min(X1, CX1), Y1), upper right coordinate (max( X2, CX2), Y1), lower left coordinates (min(X1, CX1), CY1) and lower right coordinates (max(X2, CX2), CY1), where min represents the minimum value and max represents the maximum value.
  • Fig. 10 is a schematic diagram showing an example of an exemplary layout model for illustrating paragraph coordinate information update processing according to an exemplary embodiment of the present disclosure.
  • the text data sequence STDS3 current text data sequence
  • the third scan line current scan line
  • the second scan line the adjacent previous scan line
  • the paragraph P1 current paragraph
  • the coordinates of the paragraph P1 can be updated according to the coordinates of the text data sequence STDS3 in the above-mentioned manner.
  • the current coordinate information of the paragraph P1 can be updated to the coordinate information of the smallest rectangle (ie rectangle P1_UD) that can contain both the current paragraph P1 and the text data sequence STDS3 in the current scan line . Since CX1 ⁇ X1 and CX2>X2 in this example, the updated paragraph P1 (ie rectangle P1_UD) has the following coordinate information: upper left coordinates (CX1, Y1), upper right coordinates (CX2, Y1), lower left coordinates (CX1, CY1) ) And the lower right coordinates (CX2, CY1).
  • each paragraph instead of generating or updating the coordinates of a paragraph after dividing a text data sequence into paragraphs as described above, after dividing all text data sequences in the layout model into corresponding paragraphs, Each paragraph generates paragraph coordinate information.
  • the coordinate information of the smallest rectangle that can contain all the character data sequences in a paragraph is used as the coordinate information of the paragraph.
  • the paragraph can have the following coordinate information: upper left Coordinates (min(CX1i), min(CY1i)), upper right coordinates (max(CX2i), min(CY1i)), lower left coordinates (min(CX1i), max(CY1i)) and lower right coordinates (max(CX2i), max(CY1i)).
  • the paragraph division of the layout model can be obtained.
  • the algorithm is simpler and the calculation amount is small.
  • the method of the present disclosure may further include: after completing the paragraph division of the layout model, mapping the coordinate information of each paragraph obtained by the paragraph division in the layout model to the image to obtain the image Division in paragraphs.
  • the size of the layout model is the same as the size of the image, the coordinate information of the paragraph in the image is consistent with the coordinate information of the paragraph in the layout model.
  • the coordinate information of the paragraph in the image can also be reversely scaled relative to the coordinate information of the paragraph in the layout model.
  • the exemplary method of layout analysis according to the present disclosure has been described above with reference to the accompanying drawings.
  • subsequent processing can be performed.
  • the text recognized in each paragraph can be converted into audio data according to the results of the text recognition in combination with the results of the text recognition. This can be used, for example, in applications related to audiobooks and videos. Disability assistance application.
  • FIG. 11 is a structural block diagram showing a reading aid device according to an exemplary embodiment of the present disclosure.
  • the reading aid 1100 includes: a sensor 1101 (for example, can be realized as a camera, a camera, etc.), configured to obtain the aforementioned image (for example, the image may be a static image or a video image, and the image may contain text ); and a chip circuit 1103, the chip circuit being configured as a circuit unit that performs the steps according to any of the foregoing methods.
  • the chip circuit may further include a circuit unit configured to perform text recognition on the image to obtain text, and a circuit unit configured to convert text in paragraph by paragraph into sound data according to the result of paragraph division.
  • the circuit unit configured to perform character recognition on the image to obtain characters can, for example, use any character recognition (such as optical character recognition OCR) software or circuit, and the circuit unit is configured to divide the characters in each paragraph according to the result of paragraph division.
  • any text-to-speech conversion software or circuit can be used for the circuit unit for converting into voice data.
  • These circuit units can be realized by, for example, an ASIC chip or an FPGA chip.
  • the reading aid device 1100 may also include a sound output device 1105 (for example, a speaker, earphones, etc.) configured to output the sound data (ie, voice data).
  • An aspect of the present disclosure may include an electronic device, which may include a processor; and a memory storing a program, the program including instructions that, when executed by the processor, cause the processor to execute the foregoing Any method.
  • the program may further include an instruction to convert words in each paragraph into sound data according to the result of paragraph division when executed by the processor.
  • such an electronic device may be a reading aid device, for example.
  • this electronic device may be another device (such as a mobile phone, a computer, a server, etc.) that communicates with a reading aid device.
  • this electronic device is another device that communicates with the reading aid device
  • the reading aid device may send the captured image to the other device, and the other device will execute any of the foregoing methods, and then the method Processing results (such as layout analysis results, text recognition results, and/or voice data converted from text, etc.) are returned to the reading aid device, and the reading aid device performs subsequent processing (for example, playing the sound data to the user ).
  • the reading aid device can be implemented as a wearable device, for example, can be implemented as a device that can be worn as glasses, a head-mounted device (such as a helmet or a hat, etc.), and can be worn on the ear , Accessories that can be attached to glasses (such as eyeglass frames, temples, etc.), accessories that can be attached to hats, etc.
  • the reading aid device automatically divides the captured layout image into paragraphs according to the method in the foregoing embodiment, and sequentially converts the text in the paragraphs into sounds according to the order after the paragraphs are divided, and outputs them through speakers or headphones, etc. The device sends out for the user to listen.
  • a layout analysis method including: obtaining coordinate information of one or more text lines in an image; by setting the coordinate information of the one or more text lines in a data structure.
  • Text data generating a layout model corresponding to the image, where the text data includes data indicating that there is text and has nothing to do with the semantics of the text; and scanning the generated layout model line by scan to read the layout model
  • the text data in the text data and based on the relative position relationship of the text data read in the layout model, the layout model is divided into paragraphs, and the scan line runs through the layout model in the horizontal direction Data row.
  • a computer-implemented layout analysis method including the following operations performed by a processor: obtaining coordinate information of one or more text lines in an image containing text content; Set text data in an area corresponding to the coordinate information of one or more text lines to generate a layout model corresponding to the image, where the text data includes data indicating the existence of text; scan the generated layout model to retrieve Obtain the text data from the layout model; divide the layout model into paragraphs based on the relative positional relationship of the obtained text data in the layout model; perform text recognition on the image to obtain text; and The paragraph division result converts the text in each paragraph into sound data.
  • a reading aid device including: a sensor configured to obtain an image containing text content; an integrated circuit including a first circuit unit configured to perform the following operations: obtaining The coordinate information of one or more text lines in the image; by setting text data in an area corresponding to the coordinate information of the one or more text lines in the data structure, a layout model corresponding to the image is generated, The text data includes data representing the existence of text; scanning the generated layout model to obtain the text data from the layout model; and based on the relative position of the obtained text data in the layout model Relationship, divide the layout model into paragraphs.
  • the integrated circuit further includes a second circuit unit configured to perform character recognition on the image to obtain characters.
  • the integrated circuit further includes a third circuit unit configured to convert text in paragraph by paragraph into sound data according to the paragraph division result.
  • the integrated circuit also includes a sound output device configured to output the sound data.
  • a non-transitory computer-readable storage medium storing executable instructions that, when executed by a processor of an electronic device, cause the electronic device to perform the following operations: The coordinate information of one or more text lines in the image of the content; by setting text data in an area corresponding to the coordinate information of the one or more text lines in the data structure, a layout model corresponding to the image is generated,
  • the text data includes data indicating the presence of text; scanning the generated layout model to obtain the text data from the layout model; based on the relative position of the read text data in the layout model Relations, dividing the layout model into paragraphs; performing text recognition on the image to obtain text; and converting the text in each paragraph into sound data according to the result of the paragraph division.
  • An aspect of the present disclosure may include a computer-readable storage medium storing a program, the program including instructions that, when executed by a processor of an electronic device, cause the electronic device to perform any of the foregoing methods.
  • a computing device 2000 will now be described, which is an example of a hardware device that can be applied to various aspects of the present disclosure.
  • the computing device 2000 can be any machine configured to perform processing and/or calculation, and can be, but is not limited to, a workstation, a server, a desktop computer, a laptop computer, a tablet computer, a personal digital assistant, a smart phone, a vehicle-mounted computer, a wearable Equipment or any combination thereof.
  • the aforementioned reading aids or electronic devices may also be implemented in whole or at least in part by the computing device 2000 or similar devices or systems.
  • the computing device 2000 may include elements connected to or in communication with the bus 2002 (possibly via one or more interfaces).
  • the computing device 2000 may include a bus 2002, one or more processors 2004 (which may be used to implement the processors or chip circuits included in the aforementioned reading aid device), one or more input devices 2006, and one or more Output device 2008.
  • the one or more processors 2004 may be any type of processor, and may include, but are not limited to, one or more general-purpose processors and/or one or more special-purpose processors (for example, special processing chips).
  • the input device 2006 may be any type of device capable of inputting information to the computing device 2000, and may include, but is not limited to, a sensor (such as the image capturing sensor described above), a mouse, a keyboard, a touch screen, a microphone, and/or a remote control.
  • the output device 2008 can be any type of device that can present information, and can include, but is not limited to, a display, a speaker (for example, an output device that can be used to implement the aforementioned output sound data), a video/audio output terminal, a vibrator, and/ Or printer.
  • the computing device 2000 may also include a non-transitory storage device 2010 or be connected to a non-transitory storage device 2010, and the non-transitory storage device (for example, may be used to implement the computer-readable storage medium described above) may be non-transitory And any storage device that can realize data storage, and can include, but is not limited to, magnetic disk drives, optical storage devices, solid-state memory, floppy disks, flexible disks, hard disks, tapes or any other magnetic media, optical disks or any other optical media, ROM (only Read memory), RAM (random access memory), cache memory and/or any other memory chip or cartridge, and/or any other medium from which the computer can read data, instructions and/or code.
  • the non-transitory storage device 2010 can be detached from the interface.
  • the non-transitory storage device 2010 may have data/programs (including instructions)/code for implementing the above-mentioned methods and steps.
  • the computing device 2000 may also include a communication device 2012.
  • the communication device 2012 may be any type of device or system that enables communication with external devices and/or with the network, and may include, but is not limited to, a modem, a network card, an infrared communication device, a wireless communication device, and/or a chipset, such as a Bluetooth device , 1302.11 devices, WiFi devices, WiMax devices, cellular communication devices and/or the like.
  • the computing device 2000 may also include a working memory 2014 (which can be used to implement the memory included in the aforementioned reading aid device), which can be any program (including instructions) and/or data that can store useful programs (including instructions) for the work of the processor 2004 Type of working memory, and may include but not limited to random access memory and/or read-only memory device.
  • a working memory 2014 which can be used to implement the memory included in the aforementioned reading aid device
  • Type of working memory and may include but not limited to random access memory and/or read-only memory device.
  • the software elements may be located in the working memory 2014, including but not limited to the operating system 2016, one or more application programs 2018, drivers, and/or other data and codes. Instructions for performing the above-mentioned methods and steps may be included in one or more application programs 2018.
  • the executable code or source code of the instructions of the software element (program) can be stored in a non-transitory computer-readable storage medium (such as the above-mentioned storage device 2010), and can be stored in the working memory 2014 (may be compiled And/or installation).
  • the executable code or source code of the instructions of the software element (program) can also be downloaded from a remote location.
  • the memory 2014 may store the program code used to execute the flowchart of the present disclosure and/or the image containing text content to be recognized, where the application 2018 It may include optical character recognition applications (such as Adobe), voice conversion applications, editable word processing applications, etc. provided by third parties.
  • the input device 2006 may be a sensor for acquiring images containing text content. The stored image containing text content or the acquired image can be processed by the OCR application into an output result containing text.
  • the output device 2008 is, for example, a speaker or earphone for voice broadcast, and the processor 2004 is configured to perform according to the information in the memory 2014
  • the program code executes the method steps according to various aspects of the present disclosure.
  • the components of the computing device 2000 may be distributed on a network. For example, one processor may be used to perform some processing, while at the same time, another processor far away from the one processor may perform other processing. Other components of computing system 2000 can also be similarly distributed. In this way, the computing device 2000 can be interpreted as a distributed computing system that performs processing in multiple locations.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computer Graphics (AREA)
  • Geometry (AREA)
  • Computational Linguistics (AREA)
  • Acoustics & Sound (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Input (AREA)

Abstract

本申请公开一种版面分析方法、阅读辅助设备、电路和介质。所述版面分析方法包括:获得图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;以及对所生成的版面模型进行扫描以读取所述版面模型中的所述文字数据,并且基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分。

Description

版面分析方法、阅读辅助设备、电路和介质 技术领域
本公开涉及数据处理领域,特别涉及一种版面分析方法、阅读辅助设备、电子设备以及相应的芯片电路和计算机可读存储介质。
背景技术
相关技术中存在对图像进行版面分析的技术,例如将图像中的文字进行段落划分而获得多个段落,并将所获得的段落用于后续的处理。这种版面技术可用于电子书的生成以及有声读物的生成等应用中。相关技术中的版面分析技术主要依赖于文字的图像数据或文字的语义信息,利用图像处理、聚类算法或语义分析算法等技术将图像中的文字划分到多个段落中。这类技术通常算法较复杂且计算量较大。
在此部分中描述的方法不一定是之前已经设想到或采用的方法。除非另有指明,否则不应假定此部分中描述的任何方法仅因其包括在此部分中就被认为是现有技术。类似地,除非另有指明,否则此部分中提及的问题不应认为在任何现有技术中已被公认。
发明内容
根据本公开的一个方面,提供一种版面分析方法,包括:获得图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;以及对所生成的版面模型进行扫描以读取所述版面模型中的所述文字数据,并且基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分。
根据本公开的另一个方面,提供一种芯片电路,包括:被配置为执行根据本公开中所述的方法的步骤的电路单元。
根据本公开的另一个方面,提供一种阅读辅助设备,包括:传感器,被配置为获取所述图像;如前所述的芯片电路,所述芯片电路还包括:被配置对所述图像进行文字识别以获得文字数据的电路单元;以及被配置为按照段落划分结果而将逐个段落中的文字数据转换成声音数据的电路单元。所述阅读辅助设备还包括声音输出设备,被配置为输出所述声音数据。
根据本公开的另一个方面,提供一种电子设备,包括:处理器;以及存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行本公开中所述的方法。
根据本公开的另一个方面,提供一种存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行本公开中所述的方法。
从下面结合附图描述的示例性实施例中,本公开的更多特征和优点将变得清晰。
附图说明
附图示例性地示出了实施例并且构成说明书的一部分,与说明书的文字描述一起用于讲解实施例的示例性实施方式。所示出的实施例仅出于例示的目的,并不限制权利要求的范围。在所有附图中,相同的附图标记指代类似但不一定相同的要素。
图1是示出根据本公开的示例性实施例的版面分析方法的流程图;
图2是示出根据本公开的示例性实施例的包含文字行的图像及其相应版面模型的例子的示意图;
图3是示出根据本公开的示例性实施例的获得文字行的坐标信息的示例性方法的流程图;
图4是示出根据本公开的示例性实施例的生成版面模型的示例性方法的流程图;
图5是示出根据本公开的示例性实施例的版面模型的数据结构中与文字行的坐标信息对应的区域的例子的示意图;
图6是示出根据本公开的示例性实施例的扫描版面模型以进行段落划分的示例性方法的流程图;
图7是示出用于例示根据本公开的示例性实施例的段落划分的示例性版面模型的例子的示意图;
图8是示出用于例示根据本公开的示例性实施例的两个文字数据序列的重叠率的计算的示意图;
图9(a)和图9(b)是示出用于例示根据本公开的示例性实施例的段落划分的示例性版面模型的例子的示意图;
图10是用于例示根据本公开的示例性实施例的段落坐标信息更新处理的示意图;
图11是示出根据本公开的示例性实施例的阅读辅助设备的结构框图;
图12是示出能够应用于示例性实施例的示例性计算设备的结构框图。
具体实施方式
在本公开中,除非另有说明,否则使用术语“第一”、“第二”等来描述各种要素不意图限定这些要素的位置关系、时序关系或重要性关系,这种术语只是用于将一个元件与另一元件区分开。在一 些示例中,第一要素和第二要素可以指向该要素的同一实例,而在某些情况下,基于上下文的描述,它们也可以指代不同实例。
在本公开中对各种所述示例的描述中所使用的术语只是为了描述特定示例的目的,而并非旨在进行限制。除非上下文另外明确地表明,如果不特意限定要素的数量,则该要素可以是一个也可以是多个。此外,本公开中所使用的术语“和/或”涵盖所列出的项目中的任何一个以及全部可能的组合方式。
在本公开中,对于图像,“水平”是指基本上平行于文字行(例如,夹角小于45度)的图像边的方向,而“垂直”是指垂直于“水平”的另一图像边的方向。对于版面模型,“水平”是指版面模型的数据结构的行方向,其与图像的“水平”相对应,而“垂直”是指版面模型的数据结构的列方向,其与图像的“垂直”相对应。
对本公开的以下描述主要基于文本行相对于读者在基本上左右方向上延伸(即横版读物)的情况,但是本公开的技术方案不限于此,本公开的技术方案也适用于文本行相对于读者在基本上上下方向上延伸(即竖版读物)的情况,即本公开的算法也适用于竖版读物的情况。在文本行在基本上上下方向上延伸的情况下,本公开中的水平方向可意味着基本上上下方向,而垂直方向可意味着基本上左右方向。换言之,本公开中的“水平”、“垂直”等术语并不具有绝对的含义,而是只要其是两种相互垂直的方向即可。在文字行具有基本上为上下方向的情况下,本公开中的“上下”与“左右”方向互换即可。
诸如书籍或杂志之类的读物通常会有一定的排版,例如,读物中的内容会被分成不同的段落(例如包括上下的分段和左右的分栏等)。阅读这些读物时,人们通过视觉捕获视野中的图像,通过大脑识别出图像中的段落并阅读段落中的文字。然而,如果是由机器来“阅读”这些读物,则不仅需要对图像中的文字进行文字识别,还要通过一定算法对这些文字进行段落划分,从而能够以正确的段落次序“阅读”读物中的文字。例如,在将纸质书转换成电子书的应用中,或者在将图像中的文字转换成声音信号并输出该声音信号的应用中,可能会用到这种段落划分技术。在本公开中,“段落划分”是指将图像中的文字或版面模型中的文字数据划分为成不同段落。上下的段落划分也可称为分段,而左右的段落划分也可称为分栏。
本公开提供了一种段落划分方法,其避免对文字图像直接进行复杂的图像处理,也无需进行语义分析,而是将包含文字的图像转化为模拟图像中的文字分布但结构更为简单的版面模型,其中版面模型中包括的数据例如可不包含语义内容而仅包含代表何处有文字存在的简单数据,进而对该版面模型中的数据进行位置分析以进行段落划分。以下将结合附图对本公开的版面分析方法的示例性实施例进行进一步描述。
图1是示出根据本公开的示例性实施例的版面分析方法的流程图。如图1所示,该版面分析方法例如可以包括以下步骤:获得文字行的坐标信息(步骤S101),生成版面模型(步骤S103)以及对版面模型进行扫描以进行段落划分(步骤S105)。
在步骤S101中,获得图像中的一个或多个文字行的坐标信息。
由于本公开的示例性方法主要采用文字的坐标信息而非文字的原图像本身进行版面分析,因此在此步骤中,从图像中获得文字行的坐标信息,供后续处理使用。
所述图像可以是通过图像传感器获取的电子图像数据。根据一些实施例,图像传感器可以设置于用户的可穿戴设备或眼镜等物品上,从而所述图像可以是由该图像传感器拍摄的由用户握持的读物(例如书籍或杂志等)的版面的图像。所述图像可包含文字(可以包括各种国家和地区的文字、数字、字符、标点符号等)、图片等内容。根据一些实施例,所述图像可以是经过了预处理的图像,所述预处理例如可以包括但不限于倾斜校正、模糊去除等等。根据一些实施方式,所述图像可以在被图像传感器获取之后存储在存储设备或存储介质中,并被读出以供处理。
所述文字行是指连续的一行文字,其例如可以是水平方向上相邻文字间距小于阈值间距的文字的序列。上述的相邻文字间距例如可以是相邻文字的等位坐标之间的距离,例如相邻文字左上角坐标之间、右下角坐标之间、或质心坐标之间在文字行方向上的距离等。根据一些实施例,如果相邻文字间距不大于所述阈值间距,则可认为所述相邻文字连续,从而将其划分到同一文字行中;而如果相邻文字间距大于所述阈值间距,则可认为所述相邻文字不连续(例如有可能分别属于左右两栏),从而将其划分到不同的文字行中。
根据一些实施例,一个文字行的坐标信息可以是包含该文字行的矩形(例如包含该文字行的最小矩形,或者将包含该文字行的最小矩形向上、下、左和/或右膨胀一定倍数后得到的矩形)的坐标信息。文字行的坐标信息例如可以包括所述矩形的四个顶点的坐标信息,所述文字行的坐标信息也可包括所述矩形的任一顶点的坐标信息以及该矩形的高度信息和长度信息。然而,文字行的坐标信息定义不限于此,只要其能够代表文字行占据的空间位置和尺寸即可。
根据一些实施例,文字行的坐标信息例如可以从其他机器(例如远程服务器或云计算设备)或其他应用(例如光学文字识别OCR之类的文字识别应用)获得,但也可以在本地应用中通过文字识别处理来获得。
图2是示出根据本公开的示例性实施例的包含文字行的图像及其相应版面模型的例子的示意图,其中示出了图像201中的文字行TL1~TL6,以及用虚线框示出了包含图像201中各文字行的矩形。
在步骤S103中,通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型。
此步骤通过与在前一步骤中获得的文字行对应的区域中设置比文字图像本身更为简单的“文字数据”,从而构建对所述图像中的文字分布进行模拟的版面模型,供后续处理使用。
本公开中所称的版面模型是为了对图像中的文字行的位置分布进行模拟而构建的模型,其中各位置的数据与图像中的相应位置的像素构成对应和映射关系。版面模型是通过在数据结构的位置中设置表示图像中的相应位置处存在文字的数据而被构建的。
根据一些实施例,所述数据结构可以是存储器(例如,内存、缓存等)中的一个文件,或者是用像素表达的图像,也可以是一个表格或者数据阵列。所述数据结构不限于任何具体的数据结构,只要其中的数据能够对图像中的文字行进行模拟即可。该数据结构的尺寸可以与图像尺寸相同,也可以具有相对于图像尺寸按照比例缩放的尺寸。例如,如果图像具有3840 x 2160的像素尺寸,则该数据结构(以及相应的版面模型)可以与图像具有相同尺寸(即,具有3840 x 2160个像素或数据),可以仅在水平方向上进行缩放(例如,具有1920 x 2160个像素或数据),可以仅在垂直方向上进行缩放(例如,具有3840 x 1080个像素或数据),也可以在水平方向和垂直方向两者上进行缩放(例如,具有1920 x 1080个像素或数据,或具有1280 x 1080个像素或数据)等等。无论数据结构的尺寸与图像尺寸相同或具有相对于图像尺寸按照比例缩放的尺寸,该数据结构的数据或像素均可以与图像的像素按照图像中的区域位置建立对应或映射关系。
根据一些实施例,所述文字数据包括表示存在文字的数据,其可以表示与文字的坐标信息对应的区域中是否存在文字,而与文字的语义或内容无关。此外,还可以在所述数据结构中与所述图像的非文字区域对应的区域中设置空白数据,所述空白数据是表示不存在文字的数据。根据一些实施例,所述文字数据例如可以是“1”,所述空白数据例如可以是“0”。然而文字数据不限于“0”和“1”,也可以是任何其他数据,只要能区分在该区域中是否存在文字或文字行即可。
根据一些实施例,在版面模型的尺寸相对于图像尺寸有缩放的情况下,版面模型的数据结构中与文字行的坐标信息对应的区域的坐标相对于图像的文字行区域的坐标也可成比例进行缩放。在版面模型的数据结构的尺寸小于图像尺寸的情况下,要将图像中的多个像素根据映射规则映射为版面模型中的一个数据或像素。如果图像中的多个像素既包括文字行中的像素也包括空白区域中的像素,则该映射规则例如可以规定为将所述多个像素映射为文字数据,该映射规则例如也可以规定为将所述多个像素映射为空白数据。作为替换方案,映射规则例如可以包括:如果图像中的多个像素中文字行中的像素数量与空白区域中的像素数量之比不小于一个预定比值,则将所述多个像素映射为文字数据,否则映射为空白数据。作为替换方案,映射规则例如也可以包括:例如如果将N个像素 行映射为版面模型中的一个数据或像素行,则每N个像素行中抽取一个像素行并映射为版面模型中的一个数据或像素行。
图2所示的例子示出了与图像201对应的版面模型203,可以看出,在版面模型203中与文字行TL1~TL6对应的区域R1~R6中设置了文字数据(此例中为“1”),而其他空白区域设置了空白数据(此例中为“0”)。能够看出,版面模型203中文字数据的位置布局很好地模拟了图像201中的文字行的位置布局。在版面模型203中,相对于图像201的缩放比例使得版面模型203的数据结构中的一个数据行(像素行)恰好对应于图像201中的一个文字行。然而应理解,在很多实施例中,如果采用其他缩放比例或根本不进行缩放,则可使得通过版面模型203中的多个数据行(像素行)来表示图像201中的一个文字行。例如,图5中示出了通过版面模型503中的2个数据行(像素行)来表示图像501中的一个文字行的例子。作为另一个例子,如果图像201中的文字高度例如为10个像素,则版面模型203中的数据结构也可以用5个或10个数据行(像素行)来表示一个文字行。
在步骤S105中,对所生成的版面模型进行扫描以读取所述版面模型中的所述文字数据,并且基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分。
此步骤通过对前一步骤中获得的版面模型中的数据进行扫描读取,对版面模型中的文字数据进行段落划分。
根据一些实施例,扫描可以是对版面模型的数据结构进行逐数据或逐像素的扫描读取。例如,版面模型是与所述图像对应的另一幅图像或数据阵列时,所述扫描例如可以是对于所述另一幅图像或数据阵列的逐行扫描。一个扫描行例如可以是在水平方向上贯穿版面模型的一个数据行或者像素行。一个文字行可能对应多个扫描行,如图5中文字行TL1对应的两个扫描行SL1和SL2所示。一个扫描行也可能会涉及在水平方向上隔开的多个不同的文字行,即可能包含多个文字数据序列。在图2的例子中,版面模型203中的扫描行SL涉及文字行TL1和TL2,即包含区域R1和R2中的相应的文字数据序列。这里,文字数据序列是指连续文字数据的序列(即文字数据之间不存在空白数据),或者相邻文字数据之间的空白数据数量小于阈值的文字数据的序列,此阈值例如可以是若干个文字数据,例如3个文字数据或5个文字数据。在文字数据用“1”表示的情况下,典型的文字数据序列例如可以是连续的“1”的串,如图2和图5所示。
通过上述扫描,能够读取所生成的版面模型中的文字数据(及空白数据)的值并分析出其相对位置关系,从而基于这种相对位置关系对版面模型进行段落划分。
根据图1所示的方法,由于通过基于包含文字(文字行)的图像生成包含表示某区域中是否存在文字行的简单数据的版面模型,即用包含更简单的信息的版面模型对原图像进行模拟,因此将版 面分析问题从复杂的图像处理问题转化成相对简单的位置分析问题,从而可在基本保持版面分析精度的同时显著降低算法复杂度和计算量,减轻计算机分析版面问题时的运算负荷。此外,在版面模型的尺寸相对于图像尺寸有缩小的情况下,则能进一步减小待处理的数据量,从而进一步减轻计算机分析版面问题时的运算负荷。
以上已结合图1~2对于本公开的版面分析方法进行了描述。下面将结合图3~10进一步详细描述上述步骤S101、S103和S105的示例性实施方式以及其他实施例,其中图3示出步骤S101中的获得文字行的处理的示例性实施例,图4示出步骤S103中的生成版面模型的处理的示例性实施例,图6示出步骤S105中的对版面模型进行扫描以获得段落划分的处理的示例性实施例。需要注意的是,前文中参照图1-2描述的各种定义、实施例、实施方式和例子等也均可适用于之后描述的示例性实施例或与其进行组合。
图3是示出根据本公开的示例性实施例的获得文字行的坐标信息的示例性方法的流程图,其可作为前述的步骤S101的一种示例性实施方式,即步骤S101可包含图3中的流程图的步骤。
在步骤S301中,对图像进行文字识别,以获得各个文字的坐标信息。相关技术中的诸如光学文字识别(OCR)技术的各种文字识别技术均可用于此步骤中。文字的坐标信息例如可以包括但不限于包含文字的矩形的四个顶点的坐标和/或文字的宽高信息等。任何相关技术中的文字坐标均可作为此处的坐标信息,只要其能够体现文字在图像中的位置以及所占据的区域即可。
在步骤S301之后的各步骤中,从图像中的第一个文字开始,基于文字的坐标信息来依次处理各个文字,以获得文字行。
在步骤S303中,判断当前处理的文字与前一文字之间的间距是否小于阈值间距。如果不小于该阈值间距(步骤S303,“否”),则确定为当前文字属于新的文字行(步骤S305);否则(步骤S303,“是”)将当前文字划分到前一文字所属的文字行中(步骤S309)。
根据一些实施例,所述阈值间距例如可以根据应用需求(例如语种、字符种类等)来确定,例如可以设为平均文字宽度的特定倍数(例如1.2倍~3.5倍),也可以设为被事先确定的同一段落中的平均相邻文字间距的特定倍数(例如1.5倍~3.5倍)。但阈值间距的确定方法不限于此,只要其能够用来区分相邻文字是否属于相同段落即可。
在图2的例子中,阈值间距例如被设置为文字平均宽度的2.5倍。由于文字行TL1中“be”和“as”中的相邻文字“e”和“a”之间的间距小于阈值间距,因此它们被划分到同一文字行TL1中。而由于文字行TL1结尾处的“your”中的“r”和文字行TL2开头处的“A”之间的间距大于阈值间距,因此它们被分别划分到不同文字行TL1和TL2中。
在将当前文字划分到前一文字行或新文字行之后,在步骤S311中,判断图像中是否有下一文字,如果有下一文字(步骤S311,“是”),则将该下一文字作为当前文字,并继续通过从步骤S303开始的处理对其进行文字行的划分;如果没有下一文字(步骤S311,“否”),则意味着图像中所有的文字均已被划分到文字行中。
根据一些实施例,在步骤S305中确定当前文字属于新的文字行之后,即意味着前一文字是前一文字行中的最后一个文字,因此例如可在步骤S307中将包含前一文字行的矩形的坐标信息作为所述前一文字行的坐标信息。根据另一些实施例,可以不执行步骤S307,而是在步骤S305、S309和S311中将图像中的所有文字均划分到相应文字行中之后,在步骤S313中对于各文字行确定包含该文字行的矩形的坐标信息作为该文字行的坐标信息。根据一些实施例,所述矩形的各边与图像的各边分别平行,即具有水平方向和垂直方向。
相关技术中的文字识别算法(例如光学文字识别OCR算法)中存在这样的技术,其能够对图像中的每个文字进行识别并得到该文字的坐标信息,并且还能够确定包含文字行的矩形及其坐标信息。各种文字识别算法中的相关技术都可以应用于此。因而,本公开的技术能够充分利用相关技术中的文字识别算法所能得到的结果,提高算法效率。
在一些情况下,图像中的文字行相对于包含该文字行的所述矩形的水平边(其平行于图像的水平边)具有一定倾斜度。在此倾斜度小于阈值倾斜度(例如20度或30度)的情况下,不会对段落划分结果产生实质性的影响。在此倾斜度大于或等于阈值倾斜度的情况下,可以对图像进行预处理以校正文字的倾斜,并基于倾斜校正后的图像进行图3中所述的步骤以及后续处理。根据另一些实施例,在此倾斜度大于或等于阈值倾斜度的情况下,也可以提示用户改善握持读物的姿势以减小图像的倾斜度。
图4是示出根据本公开的示例性实施例的生成版面模型的示例性方法的流程图,其可作为前述的步骤S103的一种示例性实施方式,即步骤S103可包含图4中的流程图的步骤。
例如根据步骤S101或图3的流程图在获得文字行的坐标信息之后,从图像中的第一个文字行开始逐一处理所获得的各个文字行。
在步骤S401中,读取当前文字行的坐标信息。在步骤S403中,确定与当前文字行的坐标信息对应的区域。在步骤S405中,在数据结构中与当前文字行的坐标信息对应的区域中设置文字数据。在步骤S407中,判断是否还是有下一文字行,如果还有下一文字行(步骤S407,“是”),则将下一文字行作为当前文字行,并继续通过从步骤S401开始的处理对其进行文字行的划分;如果没有下一文字行(步骤S407,“否”),则意味着图像中所有的文字行均已被建模到版面模型中。
根据一些实施例,所述数据结构中与所述一个或多个文字行的坐标信息对应的区域可以包括数据结构中由各文字行的坐标信息确定的区域。在图像中一个文字行的坐标信息为某一矩形的坐标信息的情况下,在数据结构中与各文字行的坐标信息对应的区域中设置文字数据就意味着在版面模型中相应位置处的矩形中的数据区域或像素区域中设置文字数据。在图2的例子中,在基于文字行TL1~TL6的坐标信息确定的区域R1~R6中设置文字数据“1”,以形成图像201的版面模型203。
根据一些实施例,所述与所述一个或多个文字行的坐标信息对应的区域不仅包括由各文字行的坐标信息确定的区域,而且还可以包括从该文字行的坐标信息在垂直方向上(例如,向上和/或向下)扩展特定距离的区域。换言之,在步骤S403之前,还可以包括从该文字行的坐标信息在垂直方向上(例如,向上和/或向下)扩展特定距离的步骤。根据一些实施例,所述特定距离取决于图像中在与文字行垂直的方向上相邻的文字行的行距(即,上文字行与相邻下文字行之间的空白的高度)。如果仅向上扩展或仅向下扩展,则该特定距离例如可覆盖同一段落中的上下文字行之间的全部空白,则该特定距离例如可以是图像中的相邻文字行的平均行距的1倍~1.5倍。如果同时向上和向下扩展,则该特定距离例如可覆盖同一段落中的上下文字行之间的部分空白,则该特定距离例如可以是图像中的相邻文字行的平均行距的0.5倍~0.7倍,从而上文字行和下文字行均进行扩展则能够共同覆盖其间的空白。
图5是示出根据本公开的示例性实施例的版面模型的数据结构中与文字行的坐标信息对应的区域的例子的示意图。在图5所示的例子中,TL1和TL2是图像501中的两个文字行。根据一些实施例,版面模型503的数据结构中与文字行TL1的坐标信息对应的区域R1包括数据结构中由文字行TL1的坐标信息确定的区域513,对于文字行TL2和区域R2也是类似。根据另一些实施例,版面模型505的数据结构中与文字行TL1的坐标信息对应的区域R1不仅包括数据结构中由文字行TL1的坐标信息确定的区域513,而且还包括从该文字行TL1的坐标信息向下扩展2个像素行的区域515,对于文字行TL2和区域R2也是类似。
可见,通过上述的扩展,如果上下两个文字行在同一段落中,则版面模型的数据结构中与这些文字行的坐标信息对应的区域不仅可以覆盖这些文字行本身,而且还可以覆盖它们之间的行间空白,使得在版面模型中,同一段落中的上下两个相邻文字行的对应文字数据之间不存在空白数据而是能够融为一体,从而有助于简化之后的扫描处理算法。然而这种扩展不是必须的,也可以在之后的扫描处理算法中处理同一段落中的上下两个相邻文字行之间的空白数据。
图6是示出根据本公开的示例性实施例的扫描版面模型以进行段落划分的示例性方法的流程图,其可作为前述的步骤S105的一种示例性实施方式,即步骤S105可包含图6中的流程图的步骤。
根据图6,例如根据步骤S103或图4的流程图在生成版面模型之后,对版面模型中的数据或像素进行逐行扫描。将参照图7描述图6的流程,图7是示出用于例示根据本公开的示例性实施例的段落划分的示例性版面模型的例子的示意图,其中对版面模型701中的文字数据进行段落划分。图7中用“1”来表示文字数据,并且省略了可能存在的空白数据的图示。
在步骤S601中,读取当前扫描行,例如图7中所示的第1扫描行。在步骤S603中判断当前扫描行中是否存在文字数据。如果当前扫描行中存在文字数据(步骤S603,“是”),则进入步骤S605;否则(步骤S603,“否”),判断是否有下一扫描行(步骤S613)。如果在步骤S613中判断有下一扫描行(步骤S613,“是”),则将所述下一扫描行作为当前扫描行,并继续进行从步骤S601开始的处理。如果在步骤S613中判断无下一扫描行(步骤S613,“否”),则判断为已结束对版面模型的扫描。由于图7中的第1扫描行中存在连续的文字数据的序列(即“文字数据序列”)STDS1和STDS2,因此判断为存在文字数据,并进入步骤S605。
在步骤605中,对于当前扫描行中的当前文字数据序列(例如,图7中的第1扫描行中的文字数据序列STDS1),判断相邻前一扫描行中是否存在文字数据序列与当前扫描行中的文字数据序列在水平方向上的重叠率大于阈值重叠率(规则(a))。如果存在(步骤S605,“是”),则将当前扫描行的文字数据序列划分到相邻前一扫描行的文字数据序列所属于的段落(步骤S609);如果不存在(步骤S605,“否”),则确定所述当前扫描行中的该文字数据序列属于新段落(步骤S607)。
特别地,如果在步骤S605中确定为相邻前一扫描行中不存在文字数据序列与当前扫描行中的文字数据序列的重叠率大于阈值重叠率,则相当于(1)确定为相邻前一扫描行中根本不存在文字数据序列,或者(2)确定为相邻前一扫描行中存在文字数据序列但其与当前扫描行中的当前文字数据序列的重叠率不大于阈值重叠率。(1)的情况意味着相邻前一扫描行是没有文字数据的空白扫描行,则当前扫描行中的当前文字数据序列很可能代表新段落的起始文字数据;而(2)的情况意味着虽然相邻前一扫描行中存在文字数据,但该文字数据序列由于在水平方向上与当前文字数据序列重叠较少,因此当前文字数据序列很可能不属于前一扫描行中的文字数据序列所属于的段落,因而当前文字数据序列很可能属于新段落(例如另一段落或另一栏)。在本公开中,两个文字数据序列在水平方向上重叠意味着这两个文字数据序列在水平方向上的坐标轴上的投影存在共同的部分。
根据一些实施例,两个文字数据序列的重叠率可以定义为OVR=max(OVL/L1,OVL/L2),其中max表示括号中较大的数,OVL表示这两个文字数据序列在水平方向上 的重叠的长度(例如文字数据个数或像素个数),L1和L2分别表示这两个文字数据序列的长度(例如,文字数据个数或像素个数)。图8是示出两个文字数据序列的重叠率的计算的示意图。在图8的例子中,OVL=12,L1=20,L2=17,因此这两个文字数据序列的重叠率OVR=max(OVL/L1,OVL/L2)=12/17。虽然这里给出了重叠率的概念和计算方法,但应理解,重叠率的概念和计算方法不限于此,只要其能够表达两个栏在水平方向上的重叠状况即可。
所述阈值重叠率可以根据具体应用需求来任意预先确定。根据一些实施方式,所述阈值重叠率例如可以设为0.5~0.8中的任意数值。
在步骤S607和S609中确定当前扫描行被划分到相邻前一扫描行中的文字数据序列所属于的段落还是被划分到新段落之后,在步骤S611中判断当前扫描行中是否还有下一个文字数据序列。如果当前扫描行中还有下一个文字数据序列(步骤S611,“是”),则前往步骤S605以继续处理当前扫描行中的下一个文字数据序列。如果当前扫描行中没有下一个文字数据序列(步骤S611,“否”),则意味着当前扫描行中的文字数据序列均已处理完,并且前往步骤S613以判断是否还有下一个扫描行。
对于图7的例子,由于第1扫描行不存在相邻前一扫描行,因此确定文字数据序列STDS1属于新段落P1。由于第1扫描行中除了文字数据序列STDS1以外,还有下一个文字数据序列STDS2,因此继续处理文字数据序列STDS2,类似地,确定文字数据序列STDS2属于新段落P2。
由于文字数据序列STDS2之后,第1扫描行中不再存在下一个文字数据序列,因此继续处理第2扫描行。由于第2扫描行中的文字数据序列STDS3与相邻前一扫描行即第1扫描行中的文字数据序列STDS1之间的重叠率为1,其大于阈值重叠率(例如,0.75),因此将文字数据序列STDS3划分到文字数据序列STDS1所属于的段落P1中。类似地,将文字数据序列STDS4划分到文字数据序列STDS2所属于的段落P2中,将文字数据序列STDS5划分到文字数据序列STDS3所属于的段落P1中。
虽然第4扫描行中的文字数据序列STDS6的相邻前一扫描行即第3扫描行中存在文字数据序列STDS5,但这两个文字数据序列之间的重叠率为0,因此将文字数据序列STDS6划分到新段落P3。
第6扫描行中的文字数据序列STDS7的相邻前一扫描行即第5扫描行中不存在文字数据序列,因此将文字数据序列STDS7划分到新段落P4。第7扫描行中的文字数据序列STDS8因其与文字数据序列STDS7的重叠率大于阈值重叠率因而也被划分到段落P4。
如上所述,根据图6的流程图中的示例性扫描方法,图7的例子中的版面模型被划分为段落P1~P4。
根据一些实施例,所述文字数据可以包含表示文字行的高度的数据。所述表示文字行的高度的数据例如可以基于某一预设高度(例如但不限于平均文字高度的若干分之一,例如十分之一)进行归一化,并可以进行取整(例如四舍五入或向上取整等)。例如,可以将所述预设高度设为1个像素,则文字行高度为10个像素的文字行的文字数据可以设为10。根据这样的实施例,可以在步骤S605中增加判断规则(规则(b)):即,如果当前扫描行中的文字数据序列的文字数据的值与相邻前一扫描行中的文字数据序列的文字数据的值之间的差大于阈值高度差,则确定所述当前扫描行的该文字数据序列属于新段落。换言之,可以在步骤S605中将“当前扫描行中的文字数据序列的文字数据的值与相邻前一扫描行中的文字数据序列的文字数据的值之间的差不大于阈值高度差”这一条件作为将当前扫描行的文字数据序列划分到相邻前一扫描行中的文字数据序列所属于的段落的必要条件。所述阈值高度差可以是预设的像素数,例如3个像素或5个像素等,也可以是一个比值,例如可以是比较对象中较小文字行的高度的若干分之一等等。
根据一些实施例,可以在步骤S605中增加判断规则(规则(c)):即,如果当前扫描行中的文字数据序列与相邻前一扫描行中的多个文字数据序列在水平方向上的重叠率均大于阈值重叠率,则确定所述当前扫描行中的该文字数据序列属于新段落。
图9(a)是示出用于例示根据本公开的示例性实施例的段落划分的示例性版面模型的例子的示意图,其中示出了文字数据序列STDS1和STDS3被划分到段落P1,且文字数据序列STDS2和STDS4被划分到段落P2。对于第3扫描行中的文字数据序列STDS5,虽然其相邻前一扫描行即第2扫描行中存在重叠率大于阈值重叠率的文字数据序列STDS3和STDS4,但如果考虑前述规则(c),则因为文字数据序列STDS5与第2扫描行中的多个文字数据序列STDS3和STDS4在水平方向上的重叠率均大于阈值重叠率,因此在规则(a)与规则(c)共同作用下,将文字数据序列STDS5划分到新段落P3。
根据一些实施例,可以在步骤S605中增加判断规则(规则(d)):即,如果当前扫描行中存在多个文字数据序列与相邻前一扫描行中的同一文字数据序列在水平方向上的重叠率均大于阈值,则确定所述当前扫描行中的所述多个文字数据序列分别属于各自的新段落。
图9(b)是示出用于例示根据本公开的示例性实施例的段落划分的示例性版面模型的例子的示意图,其中示出了文字数据序列STDS1和STDS2被划分到段落P1。对于当前扫描行即第3扫描行中的文字数据序列STDS3和STDS4,虽然其相邻前一扫描行即第2扫描行中存在重叠率大于阈值重叠率的文字数据序列STDS2,但如果考虑前述规则(d),则因为第3扫描行中的多个文字 数据序列STDS3和STDS4与第2扫描行中的文字数据序列STDS2在水平方向上的重叠率均大于阈值重叠率,因此在规则(a)与规则(d)共同作用下,将文字数据序列STDS3和STDS4分别划分到各自的新段落P2和P3。
根据规则(c)和规则(d),如果排版形式发生变化(例如一个扫描行体现出不分栏的排版而相邻扫描行体现出分栏的排版),则可认为不同排版形式的扫描行中的文字数据序列属于不同段落。
注意,如果组合使用前述规则(a)~(d)中的任何两个或更多个,则组合使用的这些规则中的每一个均为将当前文字数据序列划分到新段落的充分条件;即任何一个规则成立,则将当前文字数据序列划分到新段落。换言之,在组合使用这些规则的情况下,仅有在组合使用的规则都不成立的情况下,才将当前文字数据序列划分到相邻前一扫描行中的文字数据序列。通过上述的规则(b)~(d)中的任何一个或多个,可以将某些应用场景中虽然紧挨在一起但实际上可能分属于不同段落的文字区分开。
根据一些实施方式,在步骤S607中确定所述当前扫描行的文字数据序列属于新段落的情况下,可将所述当前扫描行的该文字数据序列的坐标信息设置作为所述新段落的坐标信息。“段落的坐标信息”例如是能够包含该段落中的所有文字数据序列的最小矩形的坐标信息,例如可通过左上坐标(X1,Y1)、右上坐标(X2,Y1)、左下坐标(X1,Y2)和右下坐标(X2,Y2)来表示。例如,可以假设X坐标轴以向右为正方向,Y坐标轴以向下为正方向。然而,使用其他方向的坐标系也可实施本公开的方法,只需按照坐标轴方向相应调整坐标的正负符号即可。如果确定当前文字数据序列属于新段落,则该新段落中当前仅包括所述当前文字数据序列,新段落的左上坐标与左下坐标重叠,右上坐标与右下坐标重叠。如果当前文字数据序列的起点(例如左)坐标和终点(例如右)坐标例如分别为(CX1,CY1)以及(CX2,CY1),则该新段落的坐标信息为:左上坐标(CX1,CY1)、右上坐标(CX2,CY1)、左下坐标(CX1,CY1)和右下坐标(CX2,CY1)。
根据一些实施方式,在步骤S609中确定将当前扫描行的文字数据序列划分到相邻前一扫描行中的文字数据序列所属于的段落的情况下,可基于能够包含当前的所述段落以及所述当前扫描行中的文字数据序列这两者的最小矩形的坐标信息,更新所述段落的当前坐标信息。根据一些实施例,假设当前段落具有左上坐标(X1,Y1)、右上坐标(X2,Y1)、左下坐标(X1,Y2)和右下坐标(X2,Y2),当前文字数据序列具有起点坐标(CX1,CY1)和终点坐标(CX2,CY1),则通过包括了当前文字数据序列而更新后的段落的坐标信息应为:左上坐标(min(X1,CX1),Y1)、右上坐标(max(X2,CX2),Y1)、左下坐标(min(X1,CX1),CY1)和右下坐标(max(X2,CX2),CY1),其中min表示取最小值,max表示取最大值。
图10是示出用于例示根据本公开的示例性实施例的段落坐标信息更新处理的示例性版面模型的例子的示意图。在图10的例子中,确定将第3扫描行(当前扫描行)的文字数据序列STDS3(当前文字数据序列)划分到第2扫描行(相邻前一扫描行)中的文字数据序列STDS2所属于的段落P1(当前段落)中。在这种情况下,由于段落P1新包括了文字数据序列STDS3,因此可按照上述方式根据文字数据序列STDS3的坐标更新段落P1的坐标。更特别地,可将所述段落P1的当前坐标信息更新为能够包含当前的所述段落P1以及所述当前扫描行中的文字数据序列STDS3这两者的最小矩形(即矩形P1_UD)的坐标信息。由于本例子中CX1<X1且CX2>X2,因此更新后的段落P1(即矩形P1_UD)具有如下坐标信息:左上坐标(CX1,Y1)、右上坐标(CX2,Y1)、左下坐标(CX1,CY1)和右下坐标(CX2,CY1)。
根据一些实施方式,可以不像上文中那样在每次将一个文字数据序列划分到段落之后生成或更新段落的坐标,而是在将版面模型中的所有文字数据序列都划分到相应段落之后,为各个段落生成段落的坐标信息。在这种情况下,将能够包含一个段落中的所有文字数据序列的最小矩形的坐标信息作为该段落的坐标信息。假设该段落中的所有文字数据序列具有起点坐标(CX1i,CY1i)和终点坐标(CX2i,CY1i),其中i表示该段落中的第i个文字数据序列,则该段落可以具有如下坐标信息:左上坐标(min(CX1i),min(CY1i))、右上坐标(max(CX2i),min(CY1i))、左下坐标(min(CX1i),max(CY1i))和右下坐标(max(CX2i),max(CY1i))。
根据图6流程图中的示例性方法,对版面模型扫描结束时,也即能够得到对版面模型的段落划分,算法较简单且计算量较小。
虽然附图中没有示出,本公开的方法还可以包括:在完成版面模型的段落划分之后,将版面模型中通过段落划分而得到的各个段落的坐标信息映射到图像中,以获得所述图像中的段落划分。在版面模型的尺寸与图像的尺寸相同的情况下,图像中的段落的坐标信息与版面模型中的段落的坐标信息一致。在版面模型的尺寸与图像的尺寸相比有缩放的情况下,图像中的段落的坐标信息也相对于版面模型中的段落的坐标信息进行相应的反向缩放即可。
以上已经结合附图描述了根据本公开的版面分析的示例性方法。在进行版面分析之后,还可以进行后续处理,例如可以结合文字识别结果,按照段落划分结果而将逐个段落中识别出的文字转换成声音数据,这可以用于例如与有声读物相关的应用以及视障辅助应用中。
本公开的一个方面可包括一种阅读辅助设备。图11是示出根据本公开的示例性实施例的阅读辅助设备的结构框图。如图11所示,所述阅读辅助设备1100包括:传感器1101(例如可实现为摄像头、照相机等),被配置为获取前述的图像(图像例如可以是静态图像或视频图像, 图像中可包含文字);以及芯片电路1103,所述芯片电路被配置为执行根据前述任何方法的步骤的电路单元。所述芯片电路还可以包括被配置对所述图像进行文字识别以获得文字的电路单元,以及被配置为按照段落划分结果而将逐个段落中的文字转换成声音数据的电路单元。所述被配置对所述图像进行文字识别以获得文字的电路单元例如可以利用任何文字识别(例如光学文字识别OCR)软件或电路,所述被配置为按照段落划分结果而将逐个段落中的文字转换成声音数据的电路单元例如可以利用任何文字语音转换软件或电路。这些电路单元例如可通过ASIC芯片或FPGA芯片来实现。所述阅读辅助设备1100还可以包括声音输出设备1105(例如扬声器、耳机等等),被配置为输出所述声音数据(即语音数据)。
本公开的一个方面可包括一种电子设备,该电子设备可包括处理器;以及存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行前述任何方法。根据一些实施例,所述程序还可以包括在由所述处理器执行时按照段落划分结果而将逐个段落中的文字转换成声音数据的指令。根据一些实施例,这种电子设备例如可以是阅读辅助设备。根据一些实施例,这种电子设备可以是与阅读辅助设备进行通信的另一设备(例如手机、计算机、服务器等)。在这种电子设备是与阅读辅助设备进行通信的另一设备的情况下,阅读辅助设备可以将拍摄到的图像发送到所述另一设备,由另一设备执行前述任何方法,再将方法的处理结果(例如版面分析结果、文字识别结果、和/或将文字转换而成的声音数据等等)返回到阅读辅助设备,并由阅读辅助设备执行之后的处理(例如,将声音数据播放给用户)。
根据一些实施方式,所述阅读辅助设备可以被实施为可穿戴设备,例如可以被实施为可作为眼镜形式而被佩戴的设备、头戴式设备(例如头盔或帽子等)、可佩戴在耳朵上的设备、可附接到眼镜(例如眼镜架、眼镜腿等)上的配件、可附接到帽子上的配件等等。
借助该阅读辅助设备,视力障碍用户可以与视力正常读者一样,采用类似的阅读姿势即可实现对常规读物(例如书本、杂志等)的“阅读”。在“阅读”过程中,阅读辅助设备按照前述实施例中的方法自动对捕获的版面图像进行段落划分,并依照段落划分后的顺序依次将段落中的文字转化为声音,通过扬声器或耳机等输出装置发出供用户聆听。
根据本公开,可以提供一种版面分析方法,包括:获得图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据并且与文字的语义无关;以及对所生成的版面模型进行逐扫描行的扫描以读取所述版面模型中的所述文字数据,并且基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分,所述扫描行是在水平方向上贯穿版面模型的数据行。
根据本公开,可以提供一种计算机实现的版面分析方法,包括由处理器执行的以下操作:获得包含文本内容的图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;对所生成的版面模型进行扫描以从所述版面模型中获得所述文字数据;基于所获得的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分;对所述图像进行文字识别以获得文字;以及按照段落划分结果将逐个段落中的所述文字转换成声音数据。根据本公开,还可以提供一种阅读辅助设备,包括:传感器,被配置为获取包含文本内容的图像;集成电路,包括第一电路单元,所述第一电路单元被配置为执行以下操作:获得所述图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;对所生成的版面模型进行扫描以从所述版面模型中获得所述文字数据;以及基于所获得的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分。所述集成电路还包括第二电路单元,所述第二电路单元被配置为对所述图像进行文字识别以获得文字。所述集成电路还包括第三电路单元,所述第三电路单元被配置为按照段落划分结果而将逐个段落中的文字转换成声音数据。所述集成电路还包括声音输出设备,被配置为输出所述声音数据。
根据本公开,还可以提供一种存储可执行指令的非暂时性计算机可读存储介质,所述可执行指令在由电子设备的处理器执行时,致使所述电子设备执行以下操作:获得包含文本内容的图像中的一个或多个文字行的坐标信息;通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;对所生成的版面模型进行扫描以从所述版面模型中获得所述文字数据;基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分;对所述图像进行文字识别以获得文字;以及按照段落划分结果而将逐个段落中的文字转换成声音数据。
本公开的一个方面可包括存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行前述任何方法。参照图12,现将描述计算设备2000,其是可以应用于本公开的各方面的硬件设备的示例。计算设备2000可以是被配置为执行处理和/或计算的任何机器,可以是但不限于工作站、服务器、台式计算机、膝上型计算机、平板计算机、个人数字助理、智能电话、车载计算机、可穿戴设备或其任何组合。根据一些实施方式,上述的阅读辅助设备或电子设备也可以全部或至少部分地由计算设备2000或类似设备或系统实现。
计算设备2000可以包括(可能经由一个或多个接口)与总线2002连接或与总线2002通信的元件。例如,计算设备2000可以包括总线2002、一个或多个处理器2004(其可以用于实施前述的阅读辅助设备所包含的处理器或芯片电路)、一个或多个输入设备2006以及一个或多个输出设备2008。一个或多个处理器2004可以是任何类型的处理器,并且可以包括但不限于一个或多个通用处理器和/或一个或多个专用处理器(例如特殊处理芯片)。输入设备2006可以是能向计算设备2000输入信息的任何类型的设备,并且可以包括但不限于传感器(例如前文所述的获取图像的传感器)、鼠标、键盘、触摸屏、麦克风和/或遥控器。输出设备2008可以是能呈现信息的任何类型的设备,并且可以包括但不限于显示器、扬声器(例如可用于实施前文所述的输出声音数据的输出设备)、视频/音频输出终端、振动器和/或打印机。计算设备2000还可以包括非暂时性存储设备2010或者与非暂时性存储设备2010连接,所述非暂时性存储设备(例如可以用于实施前文所述的计算机可读存储介质)可以是非暂时性的并且可以实现数据存储的任何存储设备,并且可以包括但不限于磁盘驱动器、光学存储设备、固态存储器、软盘、柔性盘、硬盘、磁带或任何其他磁介质,光盘或任何其他光学介质、ROM(只读存储器)、RAM(随机存取存储器)、高速缓冲存储器和/或任何其他存储器芯片或盒、和/或计算机可从其读取数据、指令和/或代码的任何其他介质。非暂时性存储设备2010可以从接口拆卸。非暂时性存储设备2010可以具有用于实现上述方法和步骤的数据/程序(包括指令)/代码。计算设备2000还可以包括通信设备2012。通信设备2012可以是使得能够与外部设备和/或与网络通信的任何类型的设备或系统,并且可以包括但不限于调制解调器、网卡、红外通信设备、无线通信设备和/或芯片组,例如蓝牙设备、1302.11设备、WiFi设备、WiMax设备、蜂窝通信设备和/或类似物。
计算设备2000还可以包括工作存储器2014(其可以用于实施前述的阅读辅助设备所包含的存储器),其可以是可以存储对处理器2004的工作有用的程序(包括指令)和/或数据的任何类型的工作存储器,并且可以包括但不限于随机存取存储器和/或只读存储器设备。
软件要素(程序)可以位于工作存储器2014中,包括但不限于操作系统2016、一个或多个应用程序2018、驱动程序和/或其他数据和代码。用于执行上述方法和步骤的指令可以被包括在一个或多个应用程序2018中。软件要素(程序)的指令的可执行代码或源代码可以存储在非暂时性计算机可读存储介质(例如上述存储设备2010)中,并且在执 行时可以被存入工作存储器2014中(可能被编译和/或安装)。软件要素(程序)的指令的可执行代码或源代码也可以从远程位置下载。
在将图12所示的计算设备2000应用于本公开的实施方式时,存储器2014可以存储用于执行本公开的流程图的程序代码和/或待识别的包含文字内容的图像,其中应用2018中可以包括由第三方提供的光学字符识别应用(例如Adobe)、语音转换应用、可编辑文字处理应用等等。输入设备2006可以是传感器用于获取包含文字内容的图像。其中所存储的包含文字内容的图像或者所获取的图像可以被OCR应用处理为包含文字的输出结果,输出设备2008例如是扬声器或耳机用于语音播报,其中处理器2004用于根据存储器2014中的程序代码来执行根据本公开的各方面的方法步骤。
还应该理解,可以根据具体要求而进行各种变型。例如,也可以使用定制硬件,和/或可以用硬件、软件、固件、中间件、微代码,硬件描述语言或其任何组合来实现特定元件(例如上述的芯片电路)。例如,所公开的方法和设备中的一些或全部(例如上述的芯片电路中的各个电路单元)可以通过使用根据本公开的逻辑和算法,用汇编语言或硬件编程语言(诸如VERILOG,VHDL,C++)对硬件(例如,包括现场可编程门阵列(FPGA)和/或可编程逻辑阵列(PLA)的可编程逻辑电路)进行编程来实现。
还应该理解,计算设备2000的组件可以分布在网络上。例如,可以使用一个处理器执行一些处理,而同时可以由远离该一个处理器的另一个处理器执行其他处理。计算系统2000的其他组件也可以类似地分布。这样,计算设备2000可以被解释为在多个位置执行处理的分布式计算系统。
虽然已经参照附图描述了本公开的实施例或示例,但应理解,上述的方法、系统和设备仅仅是示例性的实施例或示例,本发明的范围并不由这些实施例或示例限制,而是仅由授权后的权利要求书及其等同范围来限定。实施例或示例中的各种要素可以被省略或者可由其等同要素替代。此外,可以通过不同于本公开中描述的次序来执行各步骤。进一步地,可以以各种方式组合实施例或示例中的各种要素。重要的是随着技术的演进,在此描述的很多要素可以由本公开之后出现的等同要素进行替换。

Claims (20)

  1. 一种版面分析方法,包括:
    获得图像中的一个或多个文字行的坐标信息;
    通过在数据结构中与所述一个或多个文字行的坐标信息对应的区域中设置文字数据,生成与所述图像对应的版面模型,所述文字数据包括表示存在文字的数据;以及
    对所生成的版面模型进行扫描以读取所述版面模型中的所述文字数据,并且基于所读取的所述文字数据在所述版面模型中的相对位置关系,对所述版面模型进行段落划分。
  2. 如权利要求1所述的版面分析方法,其中,获得图像中的一个或多个文字行的坐标信息包括:
    对所述图像进行文字识别,以获得各个文字的坐标信息;
    将相邻文字间距小于阈值间距的文字的序列作为文字行;以及
    获得包含各个文字行的相应矩形的坐标信息,作为所述一个或多个文字行的相应坐标信息。
  3. 如权利要求1所述的版面分析方法,其中,与所述一个或多个文字行的坐标信息对应的区域包括:由各文字行的坐标信息确定的区域以及从该文字行的坐标信息在垂直方向上扩展特定距离的区域。
  4. 如权利要求1所述的版面分析方法,其中,生成与所述图像对应的版面模型还包括:在所述数据结构中与所述图像的非文字区域对应的区域中设置空白数据,所述空白数据是表示不存在文字的数据。
  5. 如权利要求4所述的版面分析方法,其中,所述文字数据为“1”,所述空白数据为“0”。
  6. 如权利要求1所述的版面分析方法,其中,对所述版面模型进行段落划分包括:如果相邻前一扫描行中不存在文字数据序列与当前扫描行中的文字数据序列在水平方向上的重叠率大于阈值重叠率,则确定所述当前扫描行中的该文字数据序列属于新段落。
  7. 如权利要求1所述的版面分析方法,
    其中,所述文字数据包含表示文字行的高度的数据,以及
    其中,对所述版面模型进行段落划分包括:如果当前扫描行中的文字数据序列的文字数据的值与相邻前一扫描行中的文字数据序列的文字数据的值之间的差大于阈值高度差,则确定所述当前扫描行中的该文字数据序列属于新段落。
  8. 如权利要求1所述的版面分析方法,其中,对所述版面模型进行段落划分包括:如果当前扫描行中的文字数据序列与相邻前一扫描行中的多个文字数据序列在水平方向上的重叠率均大于阈值重叠率,则确定所述当前扫描行中的该文字数据序列属于新段落。
  9. 如权利要求1所述的版面分析方法,其中,对所述版面模型进行段落划分包括:如果当前扫描行中存在多个文字数据序列与相邻前一扫描行中的同一文字数据序列在水平方向上的重叠率均大于阈值重叠率,则确定所述当前扫描行中的所述多个文字数据序列分别属于各自的新段落。
  10. 如权利要求6-9中任一项所述的版面分析方法,其中,确定所述当前扫描行中的文字数据序列属于新段落包括:将所述当前扫描行中的该文字数据序列的坐标信息设置作为所述新段落的坐标信息。
  11. 如权利要求1所述的版面分析方法,其中,在对所述版面模型进行段落划分过程中,将当前扫描行中的文字数据序列划分到相邻前一扫描行中的文字数据序列所属于的段落的必要条件包括:所述当前扫描行中的文字数据序列与所述相邻前一扫描行中的文字数据序列在水平方向上的重叠率大于阈值重叠率。
  12. 如权利要求1所述的版面分析方法,
    其中,所述文字数据包含表示文字行的高度的数据,以及
    其中,在对所述版面模型进行段落划分过程中,将当前扫描行中的文字数据序列划分到相邻前一扫描行中的文字数据序列所属于的段落的必要条件包括:当前扫描行中的文字数据序列的文字数据的值与相邻前一扫描行中的文字数据序列的文字数据的值之间的差不大于阈值高度差。
  13. 如权利要求11或12所述的版面分析方法,其中,将当前扫描行中的文字数据序列划分到相邻前一扫描行中的文字数据序列所属于的段落包括:基于能够包含当前的所述段落以及所述当前扫描行中的文字数据序列这两者的最小矩形的坐标信息,更新所述段落的当前坐标信息。
  14. 如权利要求11或12所述的版面分析方法,其中,对所述版面模型进行段落划分还包括:基于能够包含某一段落中的所有文字数据序列的最小矩形的坐标信息,确定该段落的坐标信息。
  15. 如权利要求1所述的版面分析方法,还包括:将版面模型中通过段落划分而得到的各个段落的坐标信息映射到所述图像中,以获得所述图像中的段落划分。
  16. 一种芯片电路,包括:
    被配置为执行根据权利要求1-15中任一项所述的方法的步骤的电路单元。
  17. 一种阅读辅助设备,包括:
    传感器,被配置为获取图像;以及
    如权利要求16所述的芯片电路,所述芯片电路还包括:被配置为对所述图像进行文字识别以获得文字的电路单元;以及被配置为按照段落划分结果而将逐个段落中的文字转换成声音数据的电路单元;
    所述阅读辅助设备还包括声音输出设备,所述声音输出设备被配置为输出所述声音数据。
  18. 一种电子设备,包括:
    处理器;以及
    存储程序的存储器,所述程序包括指令,所述指令在由所述处理器执行时使所述处理器执行根据权利要求1-15中任一项所述的方法。
  19. 如权利要求18所述的电子设备,其中,所述程序还包括在由所述处理器执行时按照段落划分结果而将逐个段落中的文字转换成声音数据的指令。
  20. 一种存储程序的计算机可读存储介质,所述程序包括指令,所述指令在由电子设备的处理器执行时,致使所述电子设备执行根据权利要求1-15中任一项所述的方法。
PCT/CN2020/087877 2019-05-17 2020-04-29 版面分析方法、阅读辅助设备、电路和介质 WO2020233378A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910408950.0 2019-05-17
CN201910408950.0A CN109934210B (zh) 2019-05-17 2019-05-17 版面分析方法、阅读辅助设备、电路和介质

Publications (1)

Publication Number Publication Date
WO2020233378A1 true WO2020233378A1 (zh) 2020-11-26

Family

ID=66991467

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/087877 WO2020233378A1 (zh) 2019-05-17 2020-04-29 版面分析方法、阅读辅助设备、电路和介质

Country Status (5)

Country Link
US (1) US10467466B1 (zh)
EP (1) EP3739505A1 (zh)
JP (1) JP6713141B1 (zh)
CN (1) CN109934210B (zh)
WO (1) WO2020233378A1 (zh)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6810892B2 (ja) * 2017-06-05 2021-01-13 京セラドキュメントソリューションズ株式会社 画像処理装置
US11386636B2 (en) 2019-04-04 2022-07-12 Datalogic Usa, Inc. Image preprocessing for optical character recognition
CN109934210B (zh) * 2019-05-17 2019-08-09 上海肇观电子科技有限公司 版面分析方法、阅读辅助设备、电路和介质
CN111126394A (zh) * 2019-12-25 2020-05-08 上海肇观电子科技有限公司 文字识别方法、阅读辅助设备、电路和介质
CN111062365B (zh) * 2019-12-30 2023-05-26 上海肇观电子科技有限公司 识别混合排版文字的方法、设备、芯片电路和计算机可读存储介质
US11776286B2 (en) 2020-02-11 2023-10-03 NextVPU (Shanghai) Co., Ltd. Image text broadcasting
CN110991455B (zh) * 2020-02-11 2023-05-05 上海肇观电子科技有限公司 图像文本播报方法及其设备、电子电路和存储介质
CN113836971B (zh) * 2020-06-23 2023-12-29 中国人寿资产管理有限公司 一种图像型扫描件识别后的视觉信息重现方法、系统及存储介质
US11367296B2 (en) 2020-07-13 2022-06-21 NextVPU (Shanghai) Co., Ltd. Layout analysis
CN111832476A (zh) * 2020-07-13 2020-10-27 上海肇观电子科技有限公司 版面分析方法、阅读辅助设备、电路和介质
CN113177532B (zh) * 2021-05-27 2024-04-05 中国平安人寿保险股份有限公司 图像中文字的段落边界的识别方法、装置、设备及介质
TWI826293B (zh) * 2023-03-22 2023-12-11 宏碁股份有限公司 自動調整視訊會議版面之方法及應用其之電子裝置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06203020A (ja) * 1992-12-29 1994-07-22 Hitachi Ltd テキストフォーマット認識生成方法および装置
CN103577818A (zh) * 2012-08-07 2014-02-12 北京百度网讯科技有限公司 一种图像文字识别的方法和装置
CN105512100A (zh) * 2015-12-01 2016-04-20 北京大学 一种版面分析方法及装置
CN109697414A (zh) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 一种文本定位方法及装置
CN109934210A (zh) * 2019-05-17 2019-06-25 上海肇观电子科技有限公司 版面分析方法、阅读辅助设备、电路和介质

Family Cites Families (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3683923B2 (ja) * 1994-11-17 2005-08-17 キヤノン株式会社 文字領域の順序付け方法
JPH096901A (ja) * 1995-06-22 1997-01-10 Oki Electric Ind Co Ltd 文書読取装置
US7031553B2 (en) * 2000-09-22 2006-04-18 Sri International Method and apparatus for recognizing text in an image sequence of scene imagery
US7668718B2 (en) * 2001-07-17 2010-02-23 Custom Speech Usa, Inc. Synchronized pattern recognition source data processed by manual or automatic means for creation of shared speaker-dependent speech user profile
US6768816B2 (en) * 2002-02-13 2004-07-27 Convey Corporation Method and system for interactive ground-truthing of document images
US7392472B2 (en) * 2002-04-25 2008-06-24 Microsoft Corporation Layout analysis
DE60330484D1 (de) * 2002-08-07 2010-01-21 Panasonic Corp Zeichenerkennungsverarbeitungseinrichtung, zeichenerkennungsverarbeitungsverfahren und mobilendgerät
JP4466564B2 (ja) * 2003-09-08 2010-05-26 日本電気株式会社 文書作成閲覧装置、文書作成閲覧ロボットおよび文書作成閲覧プログラム
JP3848319B2 (ja) * 2003-11-11 2006-11-22 キヤノン株式会社 情報処理方法及び情報処理装置
US7627142B2 (en) * 2004-04-02 2009-12-01 K-Nfb Reading Technology, Inc. Gesture processing with low resolution images with high resolution processing for optical character recognition for a reading machine
US9460346B2 (en) * 2004-04-19 2016-10-04 Google Inc. Handheld device for capturing text from both a document printed on paper and a document displayed on a dynamic display device
JP4227569B2 (ja) * 2004-07-07 2009-02-18 キヤノン株式会社 画像処理システム、画像処理装置の制御方法、プログラム及び記録媒体
US7675641B2 (en) * 2004-10-28 2010-03-09 Lexmark International, Inc. Method and device for converting scanned text to audio data via connection lines and lookup tables
US7769772B2 (en) * 2005-08-23 2010-08-03 Ricoh Co., Ltd. Mixed media reality brokerage network with layout-independent recognition
KR100576370B1 (ko) * 2005-09-13 2006-05-03 (주)드림투리얼리티 휴대용 디스플레이 디바이스에서의 컨텐츠 자동 최적화장치
US20070124142A1 (en) * 2005-11-25 2007-05-31 Mukherjee Santosh K Voice enabled knowledge system
JP2007213176A (ja) * 2006-02-08 2007-08-23 Sony Corp 情報処理装置および方法、並びにプログラム
US20070292026A1 (en) * 2006-05-31 2007-12-20 Leon Reznik Electronic magnification device
US8144361B2 (en) * 2008-03-18 2012-03-27 Konica Minolta Laboratory U.S.A., Inc. Creation and placement of two-dimensional barcode stamps on printed documents for storing authentication information
JP5321109B2 (ja) * 2009-02-13 2013-10-23 富士ゼロックス株式会社 情報処理装置及び情報処理プログラム
JP5663866B2 (ja) * 2009-08-20 2015-02-04 富士ゼロックス株式会社 情報処理装置及び情報処理プログラム
EP2490213A1 (en) * 2011-02-19 2012-08-22 beyo GmbH Method for converting character text messages to audio files with respective titles for their selection and reading aloud with mobile devices
CN102890826B (zh) * 2011-08-12 2015-09-09 北京多看科技有限公司 一种扫描版文档重排版的方法
US9911361B2 (en) * 2013-03-10 2018-03-06 OrCam Technologies, Ltd. Apparatus and method for analyzing images
US9466009B2 (en) * 2013-12-09 2016-10-11 Nant Holdings Ip. Llc Feature density object classification, systems and methods
CN106250830B (zh) 2016-07-22 2019-05-24 浙江大学 数字图书结构化分析处理方法
CN106484669B (zh) * 2016-10-14 2019-04-16 大连理工大学 一种面向分类信息广告报纸的自动排版方法
US10127673B1 (en) * 2016-12-16 2018-11-13 Workday, Inc. Word bounding box detection
CN106951400A (zh) * 2017-02-06 2017-07-14 北京因果树网络科技有限公司 一种pdf文件的信息抽取方法及装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH06203020A (ja) * 1992-12-29 1994-07-22 Hitachi Ltd テキストフォーマット認識生成方法および装置
CN103577818A (zh) * 2012-08-07 2014-02-12 北京百度网讯科技有限公司 一种图像文字识别的方法和装置
CN105512100A (zh) * 2015-12-01 2016-04-20 北京大学 一种版面分析方法及装置
CN109697414A (zh) * 2018-12-13 2019-04-30 北京金山数字娱乐科技有限公司 一种文本定位方法及装置
CN109934210A (zh) * 2019-05-17 2019-06-25 上海肇观电子科技有限公司 版面分析方法、阅读辅助设备、电路和介质

Also Published As

Publication number Publication date
JP2020191056A (ja) 2020-11-26
JP6713141B1 (ja) 2020-06-24
CN109934210A (zh) 2019-06-25
US10467466B1 (en) 2019-11-05
CN109934210B (zh) 2019-08-09
EP3739505A1 (en) 2020-11-18

Similar Documents

Publication Publication Date Title
WO2020233378A1 (zh) 版面分析方法、阅读辅助设备、电路和介质
JP7132654B2 (ja) レイアウト解析方法、読取り支援デバイス、回路および媒体
US20220237812A1 (en) Item display method, apparatus, and device, and storage medium
CN105184249A (zh) 用于人脸图像处理的方法和装置
WO2020233379A1 (zh) 版面分析方法、阅读辅助设备、电路及介质
US10296207B2 (en) Capture of handwriting strokes
US11526272B2 (en) Systems and methods for interactive image caricaturing by an electronic device
WO2020244074A1 (zh) 表情交互方法、装置、计算机设备及可读存储介质
WO2020248346A1 (zh) 文字的检测
CN111126394A (zh) 文字识别方法、阅读辅助设备、电路和介质
US11270485B2 (en) Automatic positioning of textual content within digital images
WO2022121842A1 (zh) 文本图像的矫正方法及装置、设备和介质
CN108111747A (zh) 一种图像处理方法、终端设备及计算机可读介质
CN109376618B (zh) 图像处理方法、装置及电子设备
US10796187B1 (en) Detection of texts
US9524656B2 (en) Sign language image input method and device
WO2022121843A1 (zh) 文本图像的矫正方法及装置、设备和介质
US11367296B2 (en) Layout analysis
US11776286B2 (en) Image text broadcasting
CN110969161B (zh) 图像处理方法、电路、视障辅助设备、电子设备和介质
CN112348069A (zh) 数据增强方法、装置、计算机可读存储介质及终端设备
CN112488909A (zh) 多人脸的图像处理方法、装置、设备及存储介质
US11417070B2 (en) Augmented and virtual reality object creation
JP6399221B2 (ja) プレゼンテーション支援装置、プレゼンテーション支援方法及びプレゼンテーション支援プログラム
JP2009245165A (ja) 顔認識装置、顔認識プログラム、顔認識方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20810242

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20810242

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20810242

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 25.07.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 20810242

Country of ref document: EP

Kind code of ref document: A1