WO2024131115A1 - 一种文本排序方法、装置、电子设备和存储介质 - Google Patents

一种文本排序方法、装置、电子设备和存储介质 Download PDF

Info

Publication number
WO2024131115A1
WO2024131115A1 PCT/CN2023/115049 CN2023115049W WO2024131115A1 WO 2024131115 A1 WO2024131115 A1 WO 2024131115A1 CN 2023115049 W CN2023115049 W CN 2023115049W WO 2024131115 A1 WO2024131115 A1 WO 2024131115A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
block
order
generate
reading order
Prior art date
Application number
PCT/CN2023/115049
Other languages
English (en)
French (fr)
Inventor
李晓川
郭振华
李仁刚
赵雅倩
范宝余
Original Assignee
苏州元脑智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州元脑智能科技有限公司 filed Critical 苏州元脑智能科技有限公司
Publication of WO2024131115A1 publication Critical patent/WO2024131115A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/16Image preprocessing
    • G06V30/162Quantising the image signal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to the technical field of character detection, and in particular to a text sorting method, a text sorting device, an electronic device and a non-volatile readable storage medium.
  • OCR Optical Character Recognition
  • the first is the position-based sorting algorithm.
  • This type of algorithm has low complexity and can achieve fast sorting, but this type of method has low accuracy and is more sensitive to the standardization of the text layout. For some special text images (such as ancient texts), the sorting accuracy is poor.
  • the other is a semantic-based sorting method. This type of method determines the order of text by the coherence between specific characters. It has high accuracy, but this type of algorithm is also limited by the degree of image standardization. For example, for some pages that read from right to left, reading each text line in the conventional way will reduce the accuracy of text sorting, resulting in overall disorder in the sorting.
  • embodiments of the present application are proposed to provide a text sorting method, a text sorting device, an electronic device and a non-volatile readable storage medium that overcome the above problems or at least partially solve the above problems.
  • the present application embodiment discloses a text sorting method, comprising:
  • the text image is recognized to obtain a text frame set, the text frame set includes a plurality of text frames;
  • Sort text based on the order of text blocks and the reading order within blocks.
  • clustering a plurality of text boxes to generate a text block set includes:
  • generating a text box mask includes:
  • dilating the text box mask includes:
  • determining the text block according to the overlap degree includes:
  • the first connected domain of the connected domain sequence is determined to be a text block.
  • the method before clustering the multiple text boxes to generate the text block set, the method further includes:
  • rotating the text image based on the offset angle includes:
  • determining the order of text blocks according to the text block set includes:
  • determining a target text block from a set of text blocks includes:
  • the first text block of the text block set after reverse reordering is determined as the target text block.
  • determining the text direction of the target text block includes:
  • the text direction is determined to be horizontal
  • the text orientation is determined to be vertical.
  • determining the reading order according to the text direction and the target text block includes:
  • the preset fluency scoring network is used to determine the probability of text fluency according to the horizontal target character string;
  • the reading order is determined to be left-right order
  • the reading order is determined to be right-left order.
  • determining the reading order according to the text direction and the target text block includes:
  • the preset start prediction scoring network is used to determine the left start probability according to the left vertical text column, and determine the right start probability according to the right vertical text column;
  • the reading order is determined to be left-right order
  • the reading order is determined to be right-left order.
  • the order of text blocks is determined in combination with the text direction and the reading order, including:
  • the reading order is cyclic and the text block order is generated.
  • determining the reading order within the block includes:
  • the text box is position-encoded to generate a text box position encoding vector
  • the preset coding model is used to output a coding feature group according to the input feature group;
  • the reading order within a block is determined based on the start probability within the block.
  • merging the text block sequence, the text box position encoding vector, and the text feature to generate a merged text feature includes:
  • determining the reading order within a block according to the start probability within the block includes:
  • the text boxes are sorted in reverse order based on the intra-block start probability to generate the intra-block reading order.
  • text sorting is performed based on the order of text blocks and the reading order within the blocks, including:
  • the present application also discloses a text sorting device, including:
  • a recognition module used for recognizing the text image when a text image input is detected, and obtaining a text frame set, wherein the text frame set includes a plurality of text frames;
  • a clustering module used for clustering multiple text boxes to generate a text block set, where the text block set includes multiple text blocks;
  • a first order determination module is used to determine the order of text blocks according to the text block set
  • the second order determination module is used to determine the reading order within any text block
  • the sorting module is used to sort text according to the order of text blocks and the reading order within the blocks.
  • An embodiment of the present application also discloses an electronic device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor. When the computer program is executed by the processor, the steps of the above text sorting method are implemented.
  • the present application also discloses a non-volatile readable storage medium, wherein the non-volatile readable storage medium stores computing A computer program, when executed by a processor, implements the steps of the above text sorting method.
  • the embodiment of the present application recognizes the text image when a text image input is detected, obtains a text box set, and the text box set includes multiple text boxes; clusters the multiple text boxes to generate a text block set, and the text block set includes multiple text blocks; determines the order of text blocks according to the text block set; determines the reading order within the block for any text block; and sorts the text according to the text block order and the reading order within the block.
  • the text boxes are clustered to generate a text block set, and the order of text blocks is determined for the text block set, and the semantics and layout are combined to obtain the order of text layout; the reading order within the block is determined by sorting the text blocks, and the text blocks are sorted based on the accuracy of the text boxes to improve the accuracy of text sorting; and the text block order and the reading order within the block are combined to obtain the overall reading order of the text, thereby improving the accuracy of text sorting.
  • FIG1 is a flowchart of a text sorting method embodiment of the present application.
  • FIG2 is a flowchart of another text sorting method embodiment of the present application.
  • FIG3 is a schematic diagram of text box recognition of the present application.
  • FIG4 is a schematic diagram of a rotated text image of the present application.
  • FIG5 is a schematic diagram of determining the order of text blocks of the present application.
  • FIG6 is a schematic diagram of steps of an example of a text sorting method of the present application.
  • FIG7 is a schematic diagram of text box clustering in an example of a text sorting method of the present application.
  • FIG8 is a schematic diagram of determining the order of text blocks in an example of a text sorting method of the present application.
  • FIG9 is a schematic diagram of determining the reading order within a block in an example of a text sorting method of the present application.
  • FIG10 is a structural block diagram of an embodiment of a text sorting apparatus of the present application.
  • FIG11 is a structural block diagram of an electronic device provided in an embodiment of the present application.
  • FIG. 12 is a structural block diagram of a non-volatile readable storage medium provided in an embodiment of the present application.
  • the text sorting method may specifically include the following steps:
  • Step 101 when a text image input is detected, the text image is recognized to obtain a text frame set, the text frame set includes a plurality of text frames;
  • a text image is an image with text.
  • text box recognition and prediction can be performed on the text image to obtain multiple text boxes, which form a text box set.
  • Step 102 clustering multiple text boxes to generate a text block set, where the text block set includes multiple text blocks;
  • All the obtained text boxes are clustered into several different regions according to their layout positions in the text.
  • a single region is a text block; multiple text blocks form a text block set.
  • text boxes are clustered into headers, body paragraphs and other text blocks.
  • Step 103 determining the order of the text blocks according to the text block set
  • the reading order of text blocks is determined according to the layout relationship between text blocks in the text block set, which is the text block sequence. sequence.
  • Step 104 for any text block, determine the reading order within the block
  • the reading order between the text boxes in any text block is determined, which is the intra-block reading order.
  • the reading order between the text boxes in any text block is determined, which is the intra-block reading order.
  • Step 105 sorting the text according to the order of the text blocks and the reading order within the blocks.
  • the text blocks of the text are sorted based on the text block order, and then in each text block, the text boxes are sorted based on the reading order within the block, thereby completing the sorting of the text.
  • the embodiment of the present application recognizes the text image when a text image input is detected, obtains a text box set, and the text box set includes multiple text boxes; clusters the multiple text boxes to generate a text block set, and the text block set includes multiple text blocks; determines the order of text blocks according to the text block set; determines the reading order within the block for any text block; and sorts the text according to the text block order and the reading order within the block.
  • the text boxes are clustered to generate a text block set, and the order of text blocks is determined for the text block set, and the semantics and layout are combined to obtain the order of text layout; the reading order within the block is determined by sorting the text blocks, and the text blocks are sorted based on the accuracy of the text boxes to improve the accuracy of text sorting; and the text block order and the reading order within the block are combined to obtain the overall reading order of the text, thereby improving the accuracy of text sorting.
  • the text sorting method may specifically include the following steps:
  • Step 201 when a text image input is detected, the text image is recognized to obtain a text frame set, the text frame set includes a plurality of text frames;
  • the text can be input into the text box detector, and the text image can be predicted to obtain multiple text boxes; multiple text boxes form a text box set.
  • the text box prediction method can use PixelLink (recognition algorithm), CRAFT (recognition algorithm), PSENet (recognition algorithm), etc.
  • PixelLink recognition algorithm
  • CRAFT recognition algorithm
  • PSENet recognition algorithm
  • Step 202 calculating the offset angles corresponding to the multiple text boxes
  • the offset angle corresponding to the recognized text box can be calculated, where the offset angle is the angle formed by the lower edge of the text box and the horizontal line.
  • Step 203 rotating the text image based on the offset angle
  • the text image is rotated based on the offset angle, and the angle of the entire text image is corrected.
  • the tilted text direction in the text image is rotated to a horizontal direction or a direction close to the horizontal direction, so that the text direction in the text image is more standardized.
  • the step of rotating the text image based on the offset angle may specifically include the following sub-steps:
  • Sub-step S2031 collecting the offset angles to generate a first offset array
  • the offset angles of all text boxes can be aggregated to generate a first offset array.
  • the offset angles of all text boxes are 2, 3, 3, 2, 2, 2, 3, 9, 3, 3; the first offset array is [2, 3, 3, 2, 2, 2, 3, 9, 3, 3].
  • Sub-step S2032 determining the discrete values in the first offset array, and deleting the discrete values from the first offset array, generating Second offset array;
  • Discrete values are determined from the first offset array, where the distribution of the discrete values is obviously different from the overall distribution; and the discrete values in the first offset array are deleted to screen the offset angle to prevent large error values from affecting the accuracy of the offset angle.
  • the second offset array is generated.
  • the first offset array is [2, 3, 3, 2, 2, 2, 3, 9, 3, 3]
  • the average value of the elements in the second offset array may be calculated, where the average value of the elements in the second offset array is the average value of the values of the elements in the second offset array.
  • the calculated element average is 2.875.
  • Sub-step S2034 rotating the text image using the element average value.
  • the text image is rotated by using the element average value, so that the text direction in the text image is standardized. As shown in FIG4 , after the text image is rotated, the text direction is close to horizontal, which is more in line with the standard.
  • Step 204 clustering the multiple text boxes to generate a text block set, where the text block set includes multiple text blocks;
  • All the identified text boxes are clustered into multiple different areas, namely text blocks; and text blocks are used to form a text block set.
  • all text boxes can be clustered into two headers and three body paragraphs, a total of five regions, namely, five text blocks.
  • the step of clustering multiple text boxes to generate a text block set may specifically include the following sub-steps:
  • Sub-step S2041 generating a text box mask for any text box
  • the step of generating a text box mask includes:
  • Sub-step S20411 for any text box, generating a corresponding binary image
  • a corresponding binary image can be generated based on the text box, that is, a binary image is initialized, and the pixel values where the text box exists are set to 1, and the rest of the background are set to 0.
  • Sub-step S20412 determining that the binary image is a text box mask.
  • the generated binary image is determined as a text box mask, so that in subsequent recognition, binarization processing can be performed to improve processing efficiency.
  • Sub-step S2042 dilating the text box mask
  • the text box mask is obtained, the text box mask is expanded so as to improve the processing accuracy.
  • the steps of dilating the text box mask include:
  • Sub-step S20421 performing a dilation operation on the text box mask based on a preset convolution kernel.
  • the text box mask is expanded according to the size of the preset convolution kernel to obtain the expanded text box mask; wherein the size of the preset convolution kernel can be set according to actual needs, and the embodiment of the present application does not make any specific limitation on this.
  • the expanded text box mask is divided into a plurality of connected domains, and the expanded text box mask is divided into a plurality of mutually independent connected domains to determine the text block to which it belongs.
  • the expanded binary image can be divided into connected domains according to pixels.
  • Sub-step S2044 calculating the overlap between the text box and the connected domain
  • the overlap between the text box and each connected domain is calculated, thereby calculating the overlap between the text box and each connected domain.
  • the formula for calculating the overlap between the text box and each connected domain is as follows:
  • box represents the current text box
  • area represents the current connected domain
  • count(*) represents the number of pixels in the calculation*.
  • Sub-step S2045 determining the text block according to the overlap degree
  • the text block of the text box is determined.
  • the steps of determining the text block according to the overlap degree include:
  • Sub-step S20451 sorting the connected domains in reverse order according to the overlap degree to generate a connected domain sequence
  • the connected domains can be sorted in reverse order according to the size of the overlap, that is, the connected domains can be sorted in descending order of the overlap; thus generating a connected domain sequence.
  • Sub-step S20452 determining that the first connected domain of the connected domain sequence is a text block.
  • the connected domain represents the position of the text where the text block is located.
  • the first connected domain in the connected domain sequence that is, the connected domain with the highest overlap, can be determined as the text block to which the text box belongs, that is, the text block is determined.
  • Sub-step S2046 combining the text blocks to generate a text block set.
  • Step 205 determining the order of the text blocks according to the text block set
  • the order of the text blocks is determined according to the relationship between the text blocks in the text block set, so as to obtain the layout order of the text when the text is sorted, such as vertical text left and right; vertical text right and left; horizontal text left and right; horizontal text right and left.
  • vertical text left and right since the vertical text is read from top to bottom by default, other overly special text forms (such as reading from bottom to top) can be ignored.
  • the step of determining the order of text blocks according to the text block set may include the following sub-steps:
  • Sub-step S2051 determining a target text block from the text block set
  • a target text block can be selected to represent the entire text block set, so that a single text block can be processed, that is, the processing of the text block set can be represented, thereby improving processing efficiency.
  • the step of determining the target text block from the text block set includes:
  • Sub-step S20511 for any text block, determine the number of text boxes
  • the number of text boxes in the text block can be calculated.
  • Sub-step S20512 re-sorting the text blocks in the text block set in reverse order based on the number of text boxes;
  • the text blocks in the text block set are reordered in reverse order based on the number of text boxes, so that the text blocks with more text boxes are placed at the front of the text block set, and vice versa.
  • Sub-step S20513 determining the first text block of the text block set after reverse reordering as the target text block.
  • the first text block of the text block set after reverse reordering that is, the text block with the largest number of text boxes is determined as the target text block.
  • Sub-step S2052 determining the text direction of the target text block
  • the step of determining the text direction of the target text block includes:
  • Sub-step S20521 calculating the width and height of the text box in the target text block
  • the width and height of all text boxes in the target text block can be calculated.
  • the text direction is determined by judging the relationship between the width and the height.
  • the average width and average height of all text boxes can be used for comparison.
  • Sub-step S20522 when the width is greater than the height, determining that the text direction is horizontal;
  • the width of the text box is greater than the height of the text box, that is, the text in the text block is arranged horizontally, it can be determined that the text square is horizontal.
  • Sub-step S20523 when the width is not greater than the height, determine that the text direction is vertical.
  • the width of the text box is not greater than the height of the text box, that is, the text in the text block is arranged vertically, it can be determined that the text square is vertical.
  • the reading order is determined based on the text direction and the character order of the target text block.
  • the step of determining the reading order according to the text direction and the target text block includes:
  • Sub-step S20531 determining a horizontal target character string of the target text block, where the horizontal target character string is the first horizontal line character string of the target text block;
  • the first horizontal line of the target text block that is, the top line of the target text block, can be determined as the horizontal target string.
  • Sub-step S20532 inputting the horizontal target character string into a preset fluency scoring network in order from left to right, the preset fluency scoring network is used to determine the text fluency probability according to the horizontal target character string;
  • the horizontal target string is input into the preset fluency scoring network character by character from left to right.
  • the preset fluency scoring network determines the text fluency probability according to the character relationship of the horizontal target string, and outputs the text fluency probability.
  • the preset fluency scoring network is obtained by fine-tuning the BERT model based on the text fluency training set, and can output the text fluency probability at the corresponding position of the [CLS] (field name) field outputted by it.
  • the text fluency probability can be represented by a decimal in [0, 1] or a percentage.
  • Sub-step S20533 reading the text coherence probability
  • the text coherence probability is read from a specific field of the output information of the preset coherence scoring network.
  • Sub-step S20534 when the text coherence probability is greater than a preset coherence threshold, determining the reading order to be left-right order;
  • the reading order can be determined to be left-right order, that is, the text is read from left to right.
  • the preset smooth threshold value can be determined according to actual conditions, and the present application embodiment does not specifically limit this.
  • the preset smooth threshold value is 0.5.
  • Sub-step S20535 when the text coherence probability is not greater than a preset coherence threshold, determining the reading order to be right-left order.
  • the reading order can be determined to be right-left order, that is, the text is read from right to left.
  • the step of determining the reading order according to the text direction and the target text block includes:
  • Sub-step S20536 determining the left vertical text column and the right vertical text column of the target text block
  • the left vertical text column and the right vertical text column of the target text block can be determined, that is, the leftmost column of strings in the target text block can be the left vertical text column; the rightmost column of strings in the target text block can be the right vertical text column.
  • Sub-step S20537 inputting the left vertical text column and the right vertical text column into a preset start prediction scoring network, the preset start prediction scoring network is used to determine the left start probability according to the left vertical text column, and determine the right start probability according to the right vertical text column;
  • the left vertical text column and the right vertical text column are input into the preset start prediction scoring network.
  • the preset start prediction scoring network will determine the probability of reading from the left according to the left vertical text column, that is, the left start probability; similarly, the probability of reading from the right according to the right vertical text column, that is, the right start probability. And the left start probability and the right start probability will be output in different output information respectively.
  • the preset start prediction scoring network is obtained by fine-tuning the BERT model based on the text start prediction scoring network training set, and the left start probability or the right start probability can be output at the corresponding position of the [CLS] field of its output.
  • the left start probability and the right start probability can be represented by decimals in [0, 1] or by percentages.
  • Sub-step S20538 reading the left start probability and the right start probability
  • the left start probability is read from the feature field of the output information of the left vertical text column recognition by the preset start prediction scoring network
  • the right start probability is read from the feature field of the output information of the right vertical text column recognition by the preset start prediction scoring network.
  • Sub-step S20539 when the left-side start probability is greater than the right-side start probability, determining the reading order to be the left-right order;
  • the left-start probability and the right-start probability can be compared.
  • the left-start probability is greater than the right-start probability, that is, the text starts reading from the left, it can be determined that the reading order is left-right order, that is, reading from left to right.
  • Sub-step S205310 when the right-start probability is greater than the left-start probability, the reading order is determined to be right-left order.
  • the reading order is right-left order, that is, reading from right to left.
  • Sub-step S2054 determining the order of text blocks in combination with the text direction and the reading order.
  • the horizontal and vertical directions of the text block are determined. Then, based on the horizontal and vertical directions, the reading order is determined to be left and right, and the order of the text block is determined as vertical text left and right; or vertical text right and left; or horizontal text left and right; or horizontal text right and left.
  • the steps of determining the order of text blocks in combination with the text direction and the reading order include:
  • Sub-step S20541 cyclically reading the order according to the text direction to generate the text block order.
  • the text blocks in the same layer are cycled in the reading order to determine the sorting order of each layer, so that all text blocks in the text direction meet both the text direction and the reading order.
  • the horizontal and vertical coordinates respectively represent the coordinates of the center points of different text blocks in the image, and the serial numbers 1 and 2 represent the priority.
  • the "horizontal" + "left->right” mode first arrange them according to the size of the vertical coordinate (for text blocks whose vertical coordinate difference is less than the set threshold, it is considered that their vertical coordinates have no difference).
  • For text blocks whose vertical coordinates cannot be sorted sort them from small to large according to the horizontal coordinate.
  • Step 206 for any text block, determine the reading order within the block
  • the step of determining the reading order within the block includes the following sub-steps:
  • Sub-step S2061 for any text block, position encoding of the text box is performed to generate a text box position encoding vector
  • the encoding method adopts the encoding form of ViLBERT (Vision-and-Language BERT, visual language pre-training model), and uses sine and cosine trigonometric functions to encode the odd and even bits of the feature respectively to generate an encoding vector; and add a set of encodings in the order of text blocks, and set the four modes of vertical text left and right, vertical text right and left, horizontal text left and right, and horizontal text right and left in the previous step to 0, 1, 2, and 3 respectively, and initialize them to an embedding vector respectively, and add the embedding vector to the encoding vector to generate the text box position encoding vector.
  • ViLBERT Vision-and-Language BERT, visual language pre-training model
  • Sub-step S2062 obtaining text features corresponding to the text box
  • the text recognition result corresponding to each text box can be obtained, and the text features of the text recognition result can be extracted therefrom.
  • the BERT model can be used to extract text features from the text recognition results.
  • Those skilled in the art can also choose other methods to extract text features according to actual conditions, which is not limited to this.
  • Sub-step S2063 merging the text block sequence, the text box position encoding vector and the text feature to generate a merged text feature
  • the text block sequence, text box position encoding vector and text feature are combined to generate a combined text feature, which is the position feature of the text box.
  • the steps of merging the text block sequence, the text box position encoding vector and the text feature to generate the merged text feature include:
  • Sub-step S20631 mapping text features to text block order
  • the text features can be first mapped in the fully connected layer of the model, and the text features can be mapped to the text block order to ensure that the mapped text features and the text box position encoding vector have the same size after mapping.
  • Sub-step S20632 adding the mapped text features to the text box position encoding vector to generate a merged text feature.
  • the mapped text features are then added to the text box position encoding vector to generate the merged text features.
  • Sub-step S2064 stacking and merging text features to generate an input feature group
  • the merged text features can be stacked to generate an input feature group. If the merged text feature size is d-dimensional, the merged text features are stacked to obtain an input feature group of [N, d], where N is the number of text boxes in the block.
  • Sub-step S2065 inputting the input feature group to a preset coding model, the preset coding model is used to output a coding feature group according to the input feature group;
  • the input feature group is input into a preset coding model, and the preset coding model can determine an output coding feature group of the starting possibility in the block according to the input feature group, and characterize the coding position characteristics of the text box through the coding feature group.
  • the preset coding model can be obtained by fine-tuning the BERT model based on the input feature group training set.
  • Sub-step S2066 obtaining a coding feature group, and determining a starting probability in a block corresponding to the coding feature group;
  • a coding feature group is obtained from the information output by the preset coding model, and the starting probability in the block corresponding to the coding feature is determined to determine the probability that the text box in the text block is located at the starting position.
  • Sub-step S2067 determining the reading order within the block according to the start probability within the block.
  • the text box at the starting position is determined, so that other text boxes are sequentially generated in the reading order within the block.
  • the step of determining the reading order within a block according to the starting probability within the block includes:
  • Sub-step S20671 sorting the text boxes in reverse order based on the start probability within the block to generate a reading order within the block.
  • the text boxes in the text block can be sorted in reverse order according to the size of the starting probability in the block, that is, the text boxes with high starting probability are placed in the front, and vice versa, to generate an internal reading order.
  • Step 207 sorting the text according to the text block order and the reading order within the block.
  • the step of sorting text according to the text block order and the reading order within the block may include the following sub-steps:
  • Sub-step S2071 sorting the text blocks using the text block sequence
  • the text blocks in the text set may be sorted using the text block order.
  • Sub-step S2072 sorting the text boxes using the intra-block reading order to complete text sorting.
  • the text boxes within each text block are sorted according to the reading order within the block to complete the text sorting.
  • the embodiment of the present application is to identify the text image when a text image input is detected, obtain a text box set, the text box set includes multiple text boxes; cluster the multiple text boxes to generate a text block set, the text block set includes multiple text blocks; calculate the offset angles corresponding to the multiple text boxes; rotate the text image based on the offset angles; determine the order of the text blocks according to the text block set; determine the reading order within the block for any text block; and sort the text according to the text block order and the reading order within the block.
  • the direction of the text block is standardized, which is convenient for subsequent processing, and the direction standardization can improve the accuracy of direction judgment, thereby improving the accuracy of sorting; cluster the text boxes to generate a text block set, determine the order of the text blocks for the text block set, combine semantics and layout, and obtain the order of text layout; determine the reading order within the block by sorting the text blocks, sort the text blocks based on the accuracy of the text boxes, and improve the accuracy of text sorting; and combine the text block order and the reading order within the block to obtain the overall reading order of the text, thereby further improving the accuracy of text sorting.
  • An image with text is input into a text box detector for text box prediction to obtain several detected text boxes, and the text boxes are combined to generate a text box set.
  • These text boxes are clustered to form several different regions to generate text blocks, which are then combined to form a text block set.
  • Figure 7 For text box clustering, refer to Figure 7, generate text box masks for the text boxes in the text box set, and expand the masks; divide the expanded masks into connected domains, classify the text boxes based on the overlap of the connected domains, and output the text box clustering results as text blocks.
  • the text image is angle-corrected to make the text in the text image more standardized and more suitable for Subsequent calculations.
  • the obtained text block set is processed in two ways:
  • Text block pattern prediction Analyze the reading order of all text blocks and output the text block sorting result, i.e., the text block order.
  • the text block mode prediction you can refer to Figure 8, select the text block from the text block set, and determine the target text block. Judge the square of the text box in the target text block to get the vertical and horizontal order results. When it is horizontal, based on the horizontal text box type, the first line of the horizontal text is input into the coherence scoring network to determine the left and right order results. When it is vertical, based on the vertical text box type, the leftmost text column and the rightmost text column are input into the starting prediction network to determine the left and right order results.
  • Intra-block text sorting i.e., the intra-block sorting algorithm in FIG6 : predicting the reading order of the text boxes within each text block and outputting the intra-block text box sorting result.
  • the text box set can be processed with reference to FIG. 9; the text box is encoded to obtain the text box position code. At the same time, the text in the text box is also extracted for text features. Then the text box position code is merged with the text feature to generate a merged text feature. The merged text feature is then scored to score the text order in the text set, and the text boxes in the block are sorted according to the score.
  • the text sorting device may specifically include the following modules:
  • the recognition module 1001 is used to recognize the text image when a text image input is detected, and obtain a text frame set, where the text frame set includes a plurality of text frames;
  • a clustering module 1002 is used to cluster multiple text boxes to generate a text block set, where the text block set includes multiple text blocks;
  • the second order determination module 1004 is used to determine the reading order within any text block
  • the sorting module 1005 is used to sort the text according to the order of text blocks and the reading order within the blocks.
  • the clustering module 1002 includes:
  • a text box mask generation submodule is used to generate a text box mask for any text box
  • the expansion submodule is used to expand the text box mask
  • a division submodule is used to divide the expanded text box mask into multiple connected domains
  • the overlap calculation submodule is used to calculate the overlap between the text box and the connected domain
  • a text block determination submodule is used to determine a text block based on the degree of overlap
  • the text block set generation submodule is used to combine text blocks to generate a text block set.
  • the text box mask generation submodule includes:
  • a binary image generating unit used for generating a corresponding binary image for any text box
  • the text box mask generating unit is used to determine that the binary image is a text box mask.
  • the expansion submodule includes:
  • the dilation unit is used to perform a dilation operation on the text box mask based on a preset convolution kernel.
  • the text block determination submodule includes:
  • a connected domain sequence generating unit used for sorting the connected domains in reverse order according to the overlap degree to generate a connected domain sequence
  • the text block determination unit is used to determine the first connected domain of the connected domain sequence as a text block.
  • the device further includes:
  • An offset angle calculation module is used to calculate the offset angles corresponding to multiple text boxes
  • the rotation module is used to rotate the text image based on the offset angle.
  • the rotation module includes:
  • a first offset array generation submodule used for collecting offset angles and generating a first offset array
  • a second offset array generating submodule used for determining discrete values in the first offset array, and deleting the discrete values from the first offset array to generate a second offset array;
  • An element average value calculation submodule used to calculate the average value of elements in the second offset array
  • the rotation submodule is used to rotate the text image by taking the element-wise average.
  • the first order determination module 1003 includes:
  • a target text block determination submodule is used to determine a target text block from a text block set
  • a text direction determination submodule used to determine the text direction of the target text block
  • a reading order determination submodule for determining a reading order according to a text direction and a target text block
  • the text block order determination submodule is used to determine the text block order by combining the text direction and the reading order.
  • the target text block determination submodule includes:
  • a text box quantity determination unit used for determining the number of text boxes for any text block
  • a reordering unit used for reordering the text blocks in the text block set in reverse order based on the number of text boxes;
  • the target text block determination unit is used to determine the first text block of the text block set after reverse reordering as the target text block.
  • the text direction determination submodule includes:
  • Height and width calculation unit used to calculate the width and height of the text box in the target text block
  • a first text direction determining unit configured to determine that the text direction is horizontal when the width is greater than the height
  • the second text direction determining unit is used to determine that the text direction is vertical when the width is not greater than the height.
  • the reading order determination submodule when the text direction is horizontal, includes:
  • a target character string determination unit used to determine a horizontal target character string of a target text block, wherein the horizontal target character string is the first horizontal line character string of the target text block;
  • a first input unit is used to input the horizontal target character string into a preset fluency scoring network in order from left to right, and the preset fluency scoring network is used to determine the text fluency probability according to the horizontal target character string;
  • a first reading unit used for reading the text coherence probability
  • a first reading order determination unit configured to determine the reading order as left-right order when the probability of text coherence is greater than a preset coherence threshold
  • the second reading order determining unit is used to determine the reading order as right-to-left order when the text coherence probability is not greater than a preset coherence threshold.
  • the reading order determination submodule when the text direction is vertical, includes:
  • a text column determination unit used to determine a left vertical text column and a right vertical text column of a target text block
  • the preset start prediction scoring network is used to determine the left start probability according to the left vertical text column, and determine the right start probability according to the right vertical text column;
  • a second input unit used for reading the left start probability and the right start probability
  • a third reading order determining unit configured to determine the reading order as a left-right order when the left-start probability is greater than the right-start probability
  • the fourth reading order determining unit is used to determine the reading order as right-left order when the right-start probability is greater than the left-start probability.
  • the text block order determination submodule includes:
  • the circular sorting unit is used to generate the text block sequence according to the text direction and circular reading order.
  • the second sequence determination module 1004 includes:
  • the encoding submodule is used to perform position encoding on the text box for any text block and generate a text box position encoding vector
  • a text feature acquisition submodule is used to obtain text features corresponding to the text box
  • a merging submodule is used to merge the text block sequence, text box position encoding vector and text features to generate merged text features
  • the stacking submodule is used to stack and merge text features to generate an input feature group
  • An input feature group input submodule is used to input an input feature group to a preset coding model, and the preset coding model is used to output a coding feature group according to the input feature group;
  • the submodule for determining the starting probability within a block is used to obtain a coding feature group and determine the starting probability within a block corresponding to the coding feature group;
  • the submodule for determining the reading order within a block is used to determine the reading order within a block according to the starting probability within the block.
  • the merging submodule includes:
  • a mapping unit used for mapping text features to text block sequences
  • the adding unit is used to add the mapped text features to the text box position encoding vector to generate a merged text feature.
  • the intra-block reading order determination submodule includes:
  • the text box sorting unit is used to sort the text boxes in reverse order based on the start probability within the block to generate a reading order within the block.
  • the sorting module 1005 includes:
  • a first text sorting submodule is used to sort the text blocks using the text block order
  • the second text sorting submodule is used to sort the text boxes according to the reading order within the block to complete the text sorting.
  • the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.
  • an embodiment of the present application further provides an electronic device, including:
  • the processor 1101 and the non-volatile readable storage medium 1102 store a computer program executable by the processor 1101.
  • the processor 1101 executes the computer program to perform the text sorting method of any one of the embodiments of the present application.
  • the specific implementation method and technical effect are similar to those of the method embodiment, and will not be repeated here.
  • the memory may include a random access memory (RAM) or a non-volatile memory, such as at least one disk memory.
  • RAM random access memory
  • non-volatile memory such as at least one disk memory.
  • a storage device is located remotely from the processor.
  • processors can be general-purpose processors, including central processing units (CPU), network processors (NP), etc.; they can also be digital signal processors (DSP), application specific integrated circuits (ASIC), field programmable gate arrays (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components.
  • CPU central processing units
  • NP network processors
  • DSP digital signal processors
  • ASIC application specific integrated circuits
  • FPGA field programmable gate arrays
  • the embodiment of the present application further provides a non-volatile readable storage medium 1201, on which a computer program is stored, and when the computer program is executed by a processor, the text sorting method of any one of the embodiments of the present application is executed.
  • a non-volatile readable storage medium 1201 on which a computer program is stored, and when the computer program is executed by a processor, the text sorting method of any one of the embodiments of the present application is executed.
  • the specific implementation method and technical effect are similar to those of the method embodiment, and will not be repeated here.
  • the embodiments of the present application can be provided as methods, devices, or computer program products. Therefore, the present application can adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment in combination with software and hardware. Moreover, the present application can adopt the form of a computer program product implemented on one or more computer-usable non-volatile readable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
  • computer-usable non-volatile readable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • each process and/or box in the flowchart and/or block diagram, and the combination of the process and/or box in the flowchart and/or block diagram can be realized by computer program instructions.
  • These computer program instructions can be provided to a processor of a general-purpose computer, a special-purpose computer, an embedded processor or other programmable data processing terminal device to produce a machine, so that the instructions executed by the processor of the computer or other programmable data processing terminal device produce a device for realizing the function specified in one process or multiple processes in the flowchart and/or one box or multiple boxes in the block diagram.
  • These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing terminal device to operate in a specific manner, so that the instructions stored in the computer-readable memory produce a manufactured product including an instruction device that implements the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.
  • These computer program instructions can also be loaded onto a computer or other programmable data processing terminal device so that a series of operating steps are executed on the computer or other programmable terminal device to produce computer-implemented processing, so that the instructions executed on the computer or other programmable terminal device provide steps for implementing the functions specified in one or more processes in the flowchart and/or one or more boxes in the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Character Input (AREA)

Abstract

本申请实施例提供了一种文本排序方法、装置、电子设备和非易失性可读存储介质。所述文本排序方法包括:当检测到文本图像输入时,识别所述文本图像,得到文本框集,所述文本框集包括多个文本框;对所述多个文本框聚类,生成文本块集,所述文本块集包括多个文本块;依据所述文本块集确定文本块顺序;针对任一所述文本块,确定块内阅读顺序;依据所述文本块顺序和所述块内阅读顺序进行文本排序。通过本申请实施例可以提高文本排序的精度,从而提升文本排序的准确性。

Description

一种文本排序方法、装置、电子设备和存储介质
相关申请的交叉引用
本申请要求于2022年12月22日提交中国专利局,申请号为202211654994.X,申请名称为“一种文本排序方法、装置、电子设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及字符检测技术领域,特别是涉及一种文本排序方法、一种文本排序装置、一种电子设备和一种非易失性可读存储介质。
背景技术
光学字符检测(OCR,Optical Character Recognition)的研究方向属于计算机视觉人工智能的范畴。近年来,随着多模态认知的快速发展,多种模态协同推理成为当前AI(Artificial Intelligence,人工智能)发展的主流方向之一,OCR作为一种可以将图像扩展为图像+文本的技术被广为广泛的应用。最近,诸如文本VQA(对图像中的文字进行问答)等领域越发热门,作为图像提取文本的技术再度成为研究热点。当前OCR主要分为两个研究方向,第一是文本检测,旨在将图像中的所有文本(即字符串)框定出来;第二是字符识别,旨在将框定出的区域中存在的字符识别出来。但是,对于识别出的文字的排序算法却鲜有人研究。然而,对于文本丰富的图像来说,只有正确排列所识别文本的顺序,才能有效识别出文本的内容。
当前,对于文本排序任务,现阶段方法可归纳为两种。首先是基于位置的排序算法,这类算法复杂度较低,可以实现快速的排序,但是这类方法精度较低,且对文本的布局的规范程度较为敏感,对于某些特殊的文本图像(如古文等)排序精度较差。另一种是基于语义的排序方法,这类方法通过对具体字符之间的连贯性来判断文本的先后顺序,精度较高,不过这类算法也会受到图像规范程度的限制。例如,对于某些从右向左读的页面来说,按照常规方式读取每个文本行,会导致文本排序的精度降低,从而导致整体排序变乱。
发明内容
鉴于上述问题,提出了本申请实施例以便提供一种克服上述问题或者至少部分地解决上述问题的一种文本排序方法、一种文本排序装置、一种电子设备和一种非易失性可读存储介质。
本申请实施例公开了一种文本排序方法,包括:
当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;
对多个文本框聚类,生成文本块集,文本块集包括多个文本块;
依据文本块集确定文本块顺序;
针对任一文本块,确定块内阅读顺序;
依据文本块顺序和块内阅读顺序进行文本排序。
在一些实施例中,对多个文本框聚类,生成文本块集,包括:
针对任一文本框,生成文本框掩膜;
对文本框掩膜进行膨胀;
对膨胀后的文本框掩膜划分为多个连通域;
计算文本框与连通域的重合度;
依据重合度确定文本块;
结合文本块,生成文本块集。
在一些实施例中,针对任一文本框,生成文本框掩膜,包括:
针对任一文本框,生成对应的二值图;
确定二值图为文本框掩膜。
在一些实施例中,对文本框掩膜进行膨胀,包括:
基于预设卷积核对文本框掩膜进行膨胀运算。
在一些实施例中,依据重合度确定文本块,包括:
依据重合度对连通域进行逆序排序,生成连通域序列;
确定连通域序列的第一位连通域为文本块。
在一些实施例中,在对多个文本框聚类,生成文本块集之前,方法还包括:
计算多个文本框对应的偏移角度;
基于偏移角度,旋转文本图像。
在一些实施例中,基于偏移角度,旋转文本图像,包括:
集合偏移角度,生成第一偏移数组;
确定第一偏移数组中的离散值,并从第一偏移数组中删除离散值,生成第二偏移数组;
计算第二偏移数组中元素平均值;
采用元素平均值旋转文本图像。
在一些实施例中,依据文本块集确定文本块顺序,包括:
从文本块集中确定目标文本块;
确定目标文本块的文本方向;
依据文本方向和目标文本块确定阅读顺序;
结合文本方向和阅读顺序,确定文本块顺序。
在一些实施例中,从文本块集中确定目标文本块,包括:
针对任一文本块,确定文本框数量;
基于文本框数量对文本块集中的文本块进行逆序重排序;
确定逆序重排序后的文本块集的第一位文本块为目标文本块。
在一些实施例中,确定目标文本块的文本方向,包括:
计算目标文本块中文本框的宽度和高度;
当宽度大于高度时,确定文本方向为横向;
当宽度不大于高度时,确定文本方向为纵向。
在一些实施例中,当文本方向为横向时,依据文本方向和目标文本块确定阅读顺序,包括:
确定目标文本块的横向目标字符串,横向目标字符串为目标文本块的横向第一行字符串;
按照从左到右的顺序将横向目标字符串输入至预设通顺性打分网络,预设通顺性打分网络用于依据横向目标字符串确定文本通顺概率;
读取文本通顺概率;
当文本通顺概率大于预设通顺阈值时,确定阅读顺序为左右顺序;
当文本通顺概率不大于预设通顺阈值时,确定阅读顺序为右左顺序。
在一些实施例中,当文本方向为纵向时,依据文本方向和目标文本块确定阅读顺序,包括:
确定目标文本块的左侧纵向文本列和右侧纵向文本列;
将左侧纵向文本列和右侧纵向文本列输入至预设起始预测打分网络,预设起始预测打分网络用于依据左侧纵向文本列确定左侧起始概率,依据右侧纵向文本列确定右侧起始概率;
读取左侧起始概率和右侧起始概率;
当左侧起始概率大于右侧起始概率时,确定阅读顺序为左右顺序;
当右侧起始概率大于左侧起始概率时,确定阅读顺序为右左顺序。
在一些实施例中,结合文本方向和阅读顺序,确定文本块顺序,包括:
按照文本方向,循环阅读顺序,生成文本块顺序。
在一些实施例中,针对任一文本块,确定块内阅读顺序包括:
针对任一文本块,对文本框进行位置编码,生成文本框位置编码向量;
获取文本框对应的文本特征;
合并文本块顺序、文本框位置编码向量和文本特征,生成合并文本特征;
堆叠合并文本特征,生成输入特征组;
输入输入特征组至预设编码模型,预设编码模型用于依据输入特征组输出编码特征组;
获取编码特征组,确定编码特征组对应块内起始概率;
依据块内起始概率,确定块内阅读顺序。
在一些实施例中,合并文本块顺序、文本框位置编码向量和文本特征,生成合并文本特征,包括:
将文本特征映射至文本块顺序;
将映射后的文本特征与文本框位置编码向量相加,生成合并文本特征。
在一些实施例中,依据块内起始概率,确定块内阅读顺序,包括:
基于块内起始概率对文本框进行逆序排序,生成块内阅读顺序。
在一些实施例中,依据文本块顺序和块内阅读顺序进行文本排序,包括:
采用文本块顺序对文本块进行排序;
采用块内阅读顺序对文本框进行排序,以完成文本排序。
本申请实施例还公开了一种文本排序装置,包括:
识别模块,用于当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;
聚类模块,用于对多个文本框聚类,生成文本块集,文本块集包括多个文本块;
第一顺序确定模块,用于依据文本块集确定文本块顺序;
第二顺序确定模块,用于针对任一文本块,确定块内阅读顺序;
排序模块,用于依据文本块顺序和块内阅读顺序进行文本排序。
本申请实施例还公开了一种电子设备,包括处理器、存储器及存储在存储器上并能够在处理器上运行的计算机程序,计算机程序被处理器执行时实现如上的文本排序方法的步骤。
本申请实施例还公开了一种非易失性可读存储介质,非易失性可读存储介质上存储计算 机程序,计算机程序被处理器执行时实现如上的文本排序方法的步骤。
本申请实施例包括以下优点:
本申请实施例通过当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;对多个文本框聚类,生成文本块集,文本块集包括多个文本块;依据文本块集确定文本块顺序;针对任一文本块,确定块内阅读顺序;依据文本块顺序和块内阅读顺序进行文本排序。通过对文本框进行聚类,生成文本块集,对文本块集确定文本块顺序,将语义和布局结合,得到文本布局的顺序;再通过该对文本块内进行排序确定块内阅读顺序,基于文本框为精度进行文本块内的排序,提高文本排序的精度;并且将文本块顺序和块内阅读顺序结合起来,得到文本的整体阅读顺序,从而提升文本排序的准确性。
附图说明
图1是本申请的一种文本排序方法实施例的步骤流程图;
图2是本申请的另一种文本排序方法实施例的步骤流程图;
图3是本申请的一种文本框识别示意图;
图4是本申请的一种旋转后的文本图像示意图;
图5是本申请的一种文本块顺序确定示意图;
图6是本申请的一种文本排序方法示例的步骤示意图;
图7是本申请的一种文本排序方法示例的文本框聚类的示意图;
图8是本申请的一种文本排序方法示例的文本块顺序确定的示意图;
图9是本申请的一种文本排序方法示例的块内阅读顺序确定的示意图;
图10是本申请的一种文本排序方装置实施例的结构框图;
图11是本申请实施例提供的一种电子设备的结构框图;
图12是本申请实施例提供的一种非易失性可读存储介质的结构框图。
具体实施方式
为使本申请的上述目的、特征和优点能够更加明显易懂,下面结合附图和具体实施方式对本申请作进一步详细的说明。
参照图1,示出了本申请的一种文本排序方法实施例的步骤流程图,文本排序方法具体可以包括如下步骤:
步骤101,当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;
文本图像即为带有文本的图像。当检测到文本图像输入时,可以对文本图像被中进行文本框识别预测,得到多个文本框,多个文本框组成文本框集。
步骤102,对多个文本框聚类,生成文本块集,文本块集包括多个文本块;
对得到的全部文本框进行文本框聚类,按照文本框在文本中的布局位置,将文本框聚合为几个不同的区域,单个区域即为文本块;多个文本块组文本块集。如将文本框聚类为页眉、正文段落等文本块。
步骤103,依据文本块集确定文本块顺序;
根据文本块集中文本块之间的布局关系,确定文本块之间的阅读顺序,即为文本块顺 序。
步骤104,针对任一文本块,确定块内阅读顺序;
在确定文本块之前的阅读顺序后,针对任一文本块中文本框之间关系,确定出文本框之间的阅读顺序,即为块内阅读顺序。重复针对全部文本块,都识别出其内部文本框对应的块内阅读顺序,得到全部文本块对应的块内阅读顺序。
步骤105,依据文本块顺序和块内阅读顺序进行文本排序。
得到文本块顺序和块内阅读顺序后,基于文本块顺序对文本的文本块进行排序,再对每个文本块中,基于块内阅读顺序对文本框进行排序,从而完成对文本的排序。
本申请实施例通过当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;对多个文本框聚类,生成文本块集,文本块集包括多个文本块;依据文本块集确定文本块顺序;针对任一文本块,确定块内阅读顺序;依据文本块顺序和块内阅读顺序进行文本排序。通过对文本框进行聚类,生成文本块集,对文本块集确定文本块顺序,将语义和布局结合,得到文本布局的顺序;再通过该对文本块内进行排序确定块内阅读顺序,基于文本框为精度进行文本块内的排序,提高文本排序的精度;并且将文本块顺序和块内阅读顺序结合起来,得到文本的整体阅读顺序,从而提升文本排序的准确性。
参照图2,示出了本申请的另一种文本排序方法实施例的步骤流程图,文本排序方法具体可以包括如下步骤:
步骤201,当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;
当检测到文本图像输入时,可以将文本推行输入至文本框检测器中,对文本图像进行文本框预测,得到多个文本框;多个文本框组成文本框集。其中,文本框预测的方法可以采用PixelLink(识别算法)、CRAFT(识别算法)、PSENet(识别算法)等。如图3所示,将左侧带有文字的图像输入至文本框检测器中,得到多个矩形文本框;多个矩形文本框组成文本框集。
步骤202,计算多个文本框对应的偏移角度;
由于文本图像的角度可能是倾斜的,为了方便后续对文本框和文本块的处理,可以计算出识别得到的文本框对应的偏移角度,其中该偏移角度为文本框下边缘与水平线形成的夹角角度。
步骤203,基于偏移角度,旋转文本图像;
基于偏移角度旋转文本图形,对整张文本图像进行角度校正,将文本图像中倾斜的文字方向,旋转至水平方向或接近水平的方向,使得文本图像中的文字方向更规范。
在本申请的一可选实施例中,基于偏移角度,旋转文本图像的步骤具体可以包括如下子步骤:
子步骤S2031,集合偏移角度,生成第一偏移数组;
得到全部文本框的偏移角度后,可以将全部文本框的偏移角度进行集合,生成第一偏移数组。举例而言,全部文本框的偏移角度分别为2、3、3、2、2、2、3、9、3、3;第一偏移数组为[2,3,3,2,2,2,3,9,3,3]。
子步骤S2032,确定第一偏移数组中的离散值,并从第一偏移数组中删除离散值,生成 第二偏移数组;
从第一偏移数组中确定出离散值,离散值为分布明显不同于整体分布的值;并将第一偏移数组中离散值删除,以对偏移角度进行筛选,避免较大的误差值影响偏移角度的准确性。第一偏移数组算出离散值后,生成第二偏移数组。
继续以上述示例进行说明,第一偏移数组为[2,3,3,2,2,2,3,9,3,3],可以确定离散值为9,从第一偏移数组中删除9,生成第二偏移数组[2,3,3,2,2,2,3,3,3]。
子步骤S2033,计算第二偏移数组中元素平均值;
为了使得旋转后的文本图像中文字的方向整体符合要求,可以计算第二偏移数组中元素平均值,元素平均值为第二偏移数组中各元素值的平均值。
继续以上述示例进行说明,第二偏移数组[2,3,3,2,2,2,3,3,3],计算元素平均值为2.875。
子步骤S2034,采用元素平均值旋转文本图像。
采用元素平均值将文本图像进行旋转,使得文本图像中的文本方向规范。可以参照图4,旋转后的文本图像,文本方向接近于水平,更符合规范。
步骤204,对多个文本框聚类,生成文本块集,文本块集包括多个文本块;
对识别得到的全部文本框进行文本框聚类,将全部文本框聚合为多个个不同的区域,即文本块;通过文本块组成文本块集。
如图4所示,全部文本框可以被聚类为2个页眉和3个正文段落,共5个区域,即5个文本块。
在本申请的一可选实施例中,对多个文本框聚类,生成文本块集的步骤具体可以包括如下子步骤:
子步骤S2041,针对任一文本框,生成文本框掩膜;
对任一个文本框,生成该文本框对应的文本框掩膜,重复该步骤为全部文本框生成对应的文本框掩膜。
具体地,针对任一文本框,生成文本框掩膜的步骤包括:
子步骤S20411,针对任一文本框,生成对应的二值图;
对于文本框掩膜的生成,可以基于文本框生成对应的二值图,即初始化一张二值图,并将文本框存在的像素值置为1,其余背景置为0。
子步骤S20412,确定二值图为文本框掩膜。
将生成的二值图确定为文本框掩膜,便于在后续识别时,可以针对二值化进行处理,提高处理效率。
子步骤S2042,对文本框掩膜进行膨胀;
得到文本框掩膜后,对文本框掩膜进行膨胀,使得可以提高处理的精度。
具体地,对文本框掩膜进行膨胀的步骤包括:
子步骤S20421,基于预设卷积核对文本框掩膜进行膨胀运算。
根据预设卷积核的大小对文本框掩膜进行膨胀运算,得到膨胀后的文本框掩膜;其中,预设卷积核的大小可以根据实际需求进行设置,本申请实施例对此不作具体限定。
子步骤S2043,对膨胀后的文本框掩膜划分为多个连通域;
在膨胀后的文本框掩膜划分为多个连通域,将膨胀后的文本框掩膜划分为多个相互独立的连通域,以确定其所属的文本块。具体地,可以对膨胀后的二值图按照像素进行连通域的划分。
子步骤S2044,计算文本框与连通域的重合度;
得到多个连通域后,将文本框和每个连通域的重合度进行计算,从而计算出文本框与每个连通域的重合度。文本框与每个连通域的重合度计算公式如下:
其中,box表示当前文本框,area表示当前连通域,count(*)表示计算*中的像素数量。
子步骤S2045,依据重合度确定文本块;
根据重合度的大小,确定该文本框的文本块。
具体地,依据重合度确定文本块的步骤包括:
子步骤S20451,依据重合度对连通域进行逆序排序,生成连通域序列;
可以按照重合度的大小,将连通域进行逆序排序,即按照重合度从大到小的顺序,将连通域进行排序;生成连通域序列。
子步骤S20452,确定连通域序列的第一位连通域为文本块。
连通域代表着文本块的所在文本的位置,可以将连通域序列的第一位连通域,即重合度最高的连通域确定为该文本框所属的文本块,即确定出文本块。
子步骤S2046,结合文本块,生成文本块集。
将全部文本块结合到一个集合中,生成文本块集。
步骤205,依据文本块集确定文本块顺序;
依据文本块集中文本块之间的关系,确定文本块顺序,以得到文本排序时,文本的布局顺序,如纵文本左右;纵文本右左;横文本左右;横文本右左。其中,为提高适用范围,由于纵文本默认是从上向下阅读的,对于其他过于特殊的文本形式(如从下向上读)可以不作考虑。
在本申请的一可选实施例中,依据文本块集确定文本块顺序的步骤可以包括如下子步骤:
子步骤S2051,从文本块集中确定目标文本块;
在文本块集中,可以选取目标文本块,以表征整个文本块集,使得可以对单个文本块处理,即可表征文本块集的处理,提高处理效率。
具体地,从文本块集中确定目标文本块的步骤包括:
子步骤S20511,针对任一文本块,确定文本框数量;
可以针对文本集中的每个文本块计算出该文本块内文本框数量。
子步骤S20512,基于文本框数量对文本块集中的文本块进行逆序重排序;
然后基于文本框数量的大小对文本块集中的文本块逆序重排序,从而将文本框数量多的排在文本块集的前端,反之则排在文本块集的后端。
子步骤S20513,确定逆序重排序后的文本块集的第一位文本块为目标文本块。
确定逆序重排序后的文本块集的第一位文本块,即文本框数量最多的文本块为目标文本块。
子步骤S2052,确定目标文本块的文本方向;
确定目标文本块的文本方向,以确定文本的阅读方向是横向或是纵向。
具体地,确定目标文本块的文本方向的步骤包括:
子步骤S20521,计算目标文本块中文本框的宽度和高度;
可以针对目标文本块中,计算全部文本框的宽度和高度。对于判断宽度和高度的关系确定文本方向。其中,为了避免全部文本框的宽度和高度的数值较大,可以采用全部文本框的平均宽度和平均高度进行比较。
子步骤S20522,当宽度大于高度时,确定文本方向为横向;
当文本框的宽度大于文本框的高度,即文本块的文字是横向排列,可以确定文本方形为横向。
子步骤S20523,当宽度不大于高度时,确定文本方向为纵向。
当文本框的宽度不大于文本框的高度,即文本块的文字是纵向排列,可以确定文本方形为纵向。
子步骤S2053,依据文本方向和目标文本块确定阅读顺序;
得到文本方向后,基于文本方向和目标文本块的字符排序,确定阅读顺序。
具体地,当文本方向为横向时,依据文本方向和目标文本块确定阅读顺序的步骤包括:
子步骤S20531,确定目标文本块的横向目标字符串,横向目标字符串为目标文本块的横向第一行字符串;
当文本方向为横向时,即文本是横向的,可以确定出目标文本块的横向第一行字符串,即目标文本块最上方的一行字符串,为横向目标字符串。
子步骤S20532,按照从左到右的顺序将横向目标字符串输入至预设通顺性打分网络,预设通顺性打分网络用于依据横向目标字符串确定文本通顺概率;
然后将横向目标字符串按照从左到右的顺序逐个字符输入到预设通顺性打分网络中,预设通顺性打分网络会依据横向目标字符串的字符关系确定文本通顺概率,并将该文本通顺概率输出。其中,预设通顺性打分网络是基于BERT(伯特)模型进行基于文本通顺性训练集进行微调训练得到,可在其输出的[CLS](字段名)字段的对应位置输出文本通顺概率。该文本通顺概率可以采用[0,1]内的小数进行表征,也可以采用百分数进行表征。
子步骤S20533,读取文本通顺概率;
从预设通顺性打分网络的输出信息的特定字段中读取文本通顺性概率。
子步骤S20534,当文本通顺概率大于预设通顺阈值时,确定阅读顺序为左右顺序;
当文本通顺概率大于预设通顺阈值时,即输入预设通顺性打分网络时的从左到右的顺序通顺性可以满足要求,可以确定阅读顺序为左右顺序,即文本是从左到右进行阅读。
预设通顺阈值可以根据实际情况进行确定,本申请实施例对此不作具体限定。在本申请的一示例中,预设通顺阈值为0.5。
子步骤S20535,当文本通顺概率不大于预设通顺阈值时,确定阅读顺序为右左顺序。
当文本通顺概率不大于预设通顺阈值时,即输入预设通顺性打分网络时的从左到右的顺 序通顺性并不能满足要求,可以确定阅读顺序为右左顺序,即文本是从右到左进行阅读。
具体地,当文本方向为纵向时,依据文本方向和目标文本块确定阅读顺序的步骤包括:
子步骤S20536,确定目标文本块的左侧纵向文本列和右侧纵向文本列;
当文本方向为纵向时,需要确定左右两侧中文本的起始侧,从而确定阅读顺序,可以确定出目标文本块的左侧纵向文本列和右侧纵向文本列,即可以目标文本块最左侧一列字符串,左侧纵向文本列;目标文本块最右侧一列字符串,右侧纵向文本列。
子步骤S20537,将左侧纵向文本列和右侧纵向文本列输入至预设起始预测打分网络,预设起始预测打分网络用于依据左侧纵向文本列确定左侧起始概率,依据右侧纵向文本列确定右侧起始概率;
将左侧纵向文本列和右侧纵向文本列输入到预设起始预测打分网络中,预设起始预测打分网络会依据左侧纵向文本列确定从左侧开始阅读的概率,即左侧起始概率;类似地,也会根据右侧纵向文本列确定从右侧开始阅读的概率,即右侧起始概率。并且会将左侧起始概率和右侧起始概率分别以不同的输出信息中输出。其中,预设起始预测打分网络是基于BERT模型进行基于文本起始预测打分网络训练集进行微调训练得到,可在其输出的[CLS]字段的对应位置输出左侧起始概率或右侧起始概率。该左侧起始概率和右侧起始概率可以采用[0,1]内的小数进行表征,也可以采用百分数进行表征。
子步骤S20538,读取左侧起始概率和右侧起始概率;
从预设起始预测打分网络针对左侧纵向文本列识别输出信息的特征字段中读取左侧起始概率,从预设起始预测打分网络针对右侧纵向文本列识别输出信息的特征字段中读取右侧起始概率。
子步骤S20539,当左侧起始概率大于右侧起始概率时,确定阅读顺序为左右顺序;
由于文本只能是从左侧开始阅读或从右侧开始阅读,即左侧起始概率和右侧起始概率的大小并不相同。可以对左侧起始概率和右侧起始概率进行比较,当左侧起始概率大于右侧起始概率,即文本左侧开始阅读,可以确定阅读顺序为左右顺序,即从左到右阅读。
子步骤S205310,当右侧起始概率大于左侧起始概率时,确定阅读顺序为右左顺序。
当右侧起始概率大于左侧起始概率,即文本右侧开始阅读,可以确定阅读顺序为右左顺序,即从右到左阅读。
子步骤S2054,结合文本方向和阅读顺序,确定文本块顺序。
然后基于文本方向对文本块横纵方向确定,再在横纵方向基础上,结果阅读顺序的左右,确定文本块顺序为纵文本左右;或纵文本右左;或横文本左右;或横文本右左。
具体地,结合文本方向和阅读顺序,确定文本块顺序的步骤包括:
子步骤S20541,按照文本方向,循环阅读顺序,生成文本块顺序。
在文本方向上,将处于同一层的文本块进行阅读顺序的循环,确定出每层的按照阅读顺序排序,从而在文本方向全部文本块即满足文本方向也满足阅读顺序。举例而言,可以参照图5,横纵坐标分别表示不同文本块的中心点在图像中的坐标,1和2的序号表示优先级。对于“横向”+“左->右”的模式,先按照纵坐标的大小进行排列(对于纵坐标差异小于设定阈值的文本块,认为它们纵坐标无差异)。对纵坐标无法排序的文本块,按照横坐标从小到大来排。
步骤206,针对任一文本块,确定块内阅读顺序;
确定文本块之间的顺序时,还可以确定每个文本块内,全部文本框之间的顺序,即块内阅读顺序。
在本申请的一可选实施例中,针对任一文本块,确定块内阅读顺序的步骤包括如下子步骤:
子步骤S2061,针对任一文本块,对文本框进行位置编码,生成文本框位置编码向量;
针对文本块内全部文本框进行位置编码,生成文本框位置编码向量。具体地,编码方式采用ViLBERT(Vision-and-Language BERT,视觉语言预训练模型)的编码形式,采用正弦与余弦三角函数分别对特征的奇数和偶数位进行编码,生成编码向量;并且加入一组以文本块顺序的编码,对上一步中的纵文本左右、纵文本右左、横文本左右、横文本右左四种模式分别设定为0、1、2、3四类,分别初始化为一个嵌入向量,将嵌入向量与编码向量相加,生成文本框位置编码向量。
子步骤S2062,获取文本框对应的文本特征;
可以获取每个文本框对应的文字识别结果,并从中提取出文字识别结果的文本特征。其中,可以采用BERT模型对文字识别结果进行文本特征提取,本领域技术人员还可以根据实际情况选择其他方式进行文字特征提取,对此不作限定。
子步骤S2063,合并文本块顺序、文本框位置编码向量和文本特征,生成合并文本特征;
将文本块顺序、文本框位置编码向量和文本特征进行合并,生成合并文本特征,以该合并文本特征文本框的位置特征。
具体地,合并文本块顺序、文本框位置编码向量和文本特征,生成合并文本特征的步骤包括:
子步骤S20631,将文本特征映射至文本块顺序;
对于文本框位置编码向量和文本特征的合并,可以先将文本特征在模型的全连接层中进行映射,将文本特征映射至文本块顺序,以保证映射后的映射文本特征与文本框位置编码向量拥有相同的大小。
子步骤S20632,将映射后的文本特征与文本框位置编码向量相加,生成合并文本特征。
然后将映射后的文本特征与文本框位置编码向量进行相加,生成合并文本特征。
子步骤S2064,堆叠合并文本特征,生成输入特征组;
得到每个文本框的合并文本特征后,可以对合并文本特征进行堆叠,生成输入特征组。如合并文本特征大小为d维,将合并文本特征进行堆叠,得到[N,d]的输入特征组,其中N是块内文本框的个数。
子步骤S2065,输入输入特征组至预设编码模型,预设编码模型用于依据输入特征组输出编码特征组;
将输入特征组输入到预设编码模型中,预设编码模型可以依据输入特征组确定块内起始可能性的输出编码特征组,通过编码特征组表征文本框的编码位置特征。其中,预设编码模型可以采用BERT模型基于输入特征组训练集微调训练得到。
子步骤S2066,获取编码特征组,确定编码特征组对应块内起始概率;
从预设编码模型输出的信息中获取到编码特征组;并确定该编码特征对应的块内起始概率。以确定文本块内文本框位于起始位置的概率。
子步骤S2067,依据块内起始概率,确定块内阅读顺序。
依据块内起始概率,确定处于起始位置的文本框,从而其他文本框顺延生成块内阅读顺序。
具体地,依据块内起始概率,确定块内阅读顺序的步骤包括:
子步骤S20671,基于块内起始概率对文本框进行逆序排序,生成块内阅读顺序。
可以按照块内起始概率的大小,对文本块内的文本框进行逆序排序,即将起始概率高的文本框排在前方,反之在后方,生成内阅读顺序。
步骤207,依据文本块顺序和块内阅读顺序进行文本排序。
将文本块顺序与每个块内的块内阅读顺序结合,即可得到文本的整体阅读顺序,从而可以对文本进行排序。
在本申请的一可选实施例中,依据文本块顺序和块内阅读顺序进行文本排序的步骤可以包括如下子步骤:
子步骤S2071,采用文本块顺序对文本块进行排序;
针对于文本排序,可以采用文本块顺序对文本集中的文本块进行排序。
子步骤S2072,采用块内阅读顺序对文本框进行排序,以完成文本排序。
确定文本块顺序后,再对每个文本块内的文本框按照块内阅读顺序进行排序,以此完成文本排序。
本申请实施例通过当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;对多个文本框聚类,生成文本块集,文本块集包括多个文本块;计算多个文本框对应的偏移角度;基于偏移角度,旋转文本图像;依据文本块集确定文本块顺序;针对任一文本块,确定块内阅读顺序;依据文本块顺序和块内阅读顺序进行文本排序。通过对文本图像进行偏移,使得文本块的方向规范,便于后续处理,且方向规范可以提高方向判断的准确度,进而提高排序准确性;通过文本框进行聚类,生成文本块集,对文本块集确定文本块顺序,将语义和布局结合,得到文本布局的顺序;通过该对文本块内进行排序确定块内阅读顺序,基于文本框为精度进行文本块内的排序,提高文本排序的精度;并且将文本块顺序和块内阅读顺序结合起来,得到文本的整体阅读顺序,从而进一步提升文本排序的准确性。
为了使本领域技术人员能够更好地理解本申请实施例,下面通过一个例子对本申请实施例加以说明:
参照图6,示出了本申请的一种文本排序方法示例的步骤示意图;
带有文本图像输入到文本框检测器中进行文本框预测得到若干个检测到的文本框,组合文本框审生成文本框集。
对这些文本框进行文本框聚类,使其聚合为几个不同的区域,生成文本块,组成文本块生成文本块集。
其中,对于文本框聚类可以参照图7,针对文本框集中的文本框生成文本框掩膜,对掩膜膨胀;对膨胀后的掩膜进行连通域划分,基于连通域的重合度对文本框进行归类,输出文本框聚类结果即为文本块。
在得到了文本块之后对文本图像进行角度修正,使得文本图像中的文字更规范,更适合 后续的计算。
对得到的文本块集进行两种处理:
(1)文本块模式预测:分析所有文本块的阅读顺序,输出文本块排序结果,即文本块顺序。
其中,对于文文本块模式预测可以参照图8,从文本块集中,进行文本块选择,确定出目标文本块。对目标文本块中文本框方形进行判断,得到纵横顺序结果。当为横向时,基于横文本框类型,将横文本的第一行输入至通顺性打分网络,确定左右顺序结果。当为纵向时,基于纵文本框类型,将最左侧文本列和最右侧文本列输入至起始预测网络,确定左右顺序结果。
(2)块内文本排序(即图6中块内排序算法):预测每个文本块内部的文本框的阅读顺序,输出块内文本框排序结果。
其中,对于块内文本排序可以参照图9,对文本框集进行处理;对文本框进行编码,得到文本框位置编码。同时也对文本框的内的文字进行文本特征提取。然后将文本框位置编码与文本特征进行合并,生成合并文本特征。对合并文本特征进行是被,从而对文本集内的文本顺序进行打分,按照分数进行块内文本框排序。
最后,将文本框排序整合,得到排序结果,对排序结果输出。
需要说明的是,对于方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请实施例并不受所描述的动作顺序的限制,因为依据本申请实施例,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作并不一定是本申请实施例所必须的。
参照图10,示出了本申请的一种文本排序装置实施例的结构框图,文本排序装置具体可以包括如下模块:
识别模块1001,用于当检测到文本图像输入时,识别文本图像,得到文本框集,文本框集包括多个文本框;
聚类模块1002,用于对多个文本框聚类,生成文本块集,文本块集包括多个文本块;
第一顺序确定模块1003,用于依据文本块集确定文本块顺序;
第二顺序确定模块1004,用于针对任一文本块,确定块内阅读顺序;
排序模块1005,用于依据文本块顺序和块内阅读顺序进行文本排序。
在本申请的一可选实施例中,聚类模块1002包括:
文本框掩膜生成子模块,用于针对任一文本框,生成文本框掩膜;
膨胀子模块,用于对文本框掩膜进行膨胀;
划分子模块,用于对膨胀后的文本框掩膜划分为多个连通域;
重合度计算子模块,用于计算文本框与连通域的重合度;
文本块确定子模块,用于依据重合度确定文本块;
文本块集生成子模块,用于结合文本块,生成文本块集。
在本申请的一可选实施例中,文本框掩膜生成子模块包括:
二值图生成单元,用于针对任一文本框,生成对应的二值图;
文本框掩膜生成单元,用于确定二值图为文本框掩膜。
在本申请的一可选实施例中,膨胀子模块包括:
膨胀单元,用于基于预设卷积核对文本框掩膜进行膨胀运算。
在本申请的一可选实施例中,文本块确定子模块包括:
连通域序列生成单元,用于依据重合度对连通域进行逆序排序,生成连通域序列;
文本块确定单元,用于确定连通域序列的第一位连通域为文本块。
在本申请的一可选实施例中,装置还包括:
偏移角度计算模块,用于计算多个文本框对应的偏移角度;
旋转模块,用于基于偏移角度,旋转文本图像。
在本申请的一可选实施例中,旋转模块包括:
第一偏移数组生成子模块,用于集合偏移角度,生成第一偏移数组;
第二偏移数组生成子模块,用于确定第一偏移数组中的离散值,并从第一偏移数组中删除离散值,生成第二偏移数组;
元素平均值计算子模块,用于计算第二偏移数组中元素平均值;
旋转子模块,用于采用元素平均值旋转文本图像。
在本申请的一可选实施例中,第一顺序确定模块1003包括:
目标文本块确定子模块,用于从文本块集中确定目标文本块;
文本方向确定子模块,用于确定目标文本块的文本方向;
阅读顺序确定子模块,用于依据文本方向和目标文本块确定阅读顺序;
文本块顺序确定子模块,用于结合文本方向和阅读顺序,确定文本块顺序。
在本申请的一可选实施例中,目标文本块确定子模块包括:
文本框数量确定单元,用于针对任一文本块,确定文本框数量;
重排序单元,用于基于文本框数量对文本块集中的文本块进行逆序重排序;
目标文本块确定单元,用于确定逆序重排序后的文本块集的第一位文本块为目标文本块。
在本申请的一可选实施例中,文本方向确定子模块包括:
高宽计算单元,用于计算目标文本块中文本框的宽度和高度;
第一文本方向确定单元,用于当宽度大于高度时,确定文本方向为横向;
第二文本方向确定单元,用于当宽度不大于高度时,确定文本方向为纵向。
在本申请的一可选实施例中,当文本方向为横向时,阅读顺序确定子模块包括:
目标字符串确定单元,用于确定目标文本块的横向目标字符串,横向目标字符串为目标文本块的横向第一行字符串;
第一输入单元,用于按照从左到右的顺序将横向目标字符串输入至预设通顺性打分网络,预设通顺性打分网络用于依据横向目标字符串确定文本通顺概率;
第一读取单元,用于读取文本通顺概率;
第一阅读顺序确定单元,用于当文本通顺概率大于预设通顺阈值时,确定阅读顺序为左右顺序;
第二阅读顺序确定单元,用于当文本通顺概率不大于预设通顺阈值时,确定阅读顺序为右左顺序。
在本申请的一可选实施例中,当文本方向为纵向时,阅读顺序确定子模块包括:
文本列确定单元,用于确定目标文本块的左侧纵向文本列和右侧纵向文本列;
将左侧纵向文本列和右侧纵向文本列输入至预设起始预测打分网络,预设起始预测打分网络用于依据左侧纵向文本列确定左侧起始概率,依据右侧纵向文本列确定右侧起始概率;
第二输入单元,用于读取左侧起始概率和右侧起始概率;
第三阅读顺序确定单元,用于当左侧起始概率大于右侧起始概率时,确定阅读顺序为左右顺序;
第四阅读顺序确定单元,用于当右侧起始概率大于左侧起始概率时,确定阅读顺序为右左顺序。
在本申请的一可选实施例中,文本块顺序确定子模块包括:
循环排序单元,用于按照文本方向,循环阅读顺序,生成文本块顺序。
在本申请的一可选实施例中,第二顺序确定模块1004包括:
编码子模块,用于针对任一文本块,对文本框进行位置编码,生成文本框位置编码向量;
文本特征获取子模块,用于获取文本框对应的文本特征;
合并子模块,用于合并文本块顺序、文本框位置编码向量和文本特征,生成合并文本特征;
堆叠子模块,用于堆叠合并文本特征,生成输入特征组;
输入特征组输入子模块,用于输入输入特征组至预设编码模型,预设编码模型用于依据输入特征组输出编码特征组;
块内起始概率确定子模块,用于获取编码特征组,确定编码特征组对应块内起始概率;
块内阅读顺序确定子模块,用于依据块内起始概率,确定块内阅读顺序。
在本申请的一可选实施例中,合并子模块包括:
映射单元,用于将文本特征映射至文本块顺序;
相加单元,用于将映射后的文本特征与文本框位置编码向量相加,生成合并文本特征。
在本申请的一可选实施例中,块内阅读顺序确定子模块包括:
文本框排序单元,用于基于块内起始概率对文本框进行逆序排序,生成块内阅读顺序。
在本申请的一可选实施例中,排序模块1005包括:
第一文本排序子模块,用于采用文本块顺序对文本块进行排序;
第二文本排序子模块,用于采用块内阅读顺序对文本框进行排序,以完成文本排序。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
参照图11,本申请实施例还提供了一种电子设备,包括:
处理器1101和非易失性可读存储介质1102,非易失性可读存储介质1102存储有处理器1101可执行的计算机程序,当电子设备运行时,处理器1101执行计算机程序,以执行如本申请实施例任一项的文本排序方法。具体实现方式和技术效果与方法实施例部分类似,这里不再赘述。
存储器可以包括随机存取存储器(Random Access Memory,简称RAM),也可以包括非易失性存储器(non-volatile memory),例如至少一个磁盘存储器。可选的,存储器还可以是至少 一个位于远离前述处理器的存储装置。
上述的处理器可以是通用处理器,包括中央处理器(Central Processing Unit,简称CPU)、网络处理器(Network Processor,简称NP)等;还可以是数字信号处理器(Digital Signal Processing,简称DSP)、专用集成电路(Application Specific Integrated Circuit,简称ASIC)、现场可编程门阵列(Field-Programmable Gate Array,简称FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
参照图12,本申请实施例还提供了一种非易失性可读存储介质1201,非易失性可读存储介质1201上存储有计算机程序,计算机程序被处理器运行时执行如本申请实施例任一项的文本排序方法。具体实现方式和技术效果与方法实施例部分类似,这里不再赘述。
本说明书中的各个实施例均采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似的部分互相参见即可。
本领域内的技术人员应明白,本申请实施例的实施例可提供为方法、装置、或计算机程序产品。因此,本申请实施例可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请实施例可采用在一个或多个其中包含有计算机可用程序代码的计算机可用非易失性可读存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请实施例是参照根据本申请实施例的方法、终端设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理终端设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理终端设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理终端设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理终端设备上,使得在计算机或其他可编程终端设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程终端设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本申请实施例的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例做出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本申请实施例范围的所有变更和修改。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者终端设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者终 端设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括要素的过程、方法、物品或者终端设备中还存在另外的相同要素。
以上对本申请所提供的一种文本排序方法、装置、电子设备和非易失性可读存储介质,进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本申请的限制。

Claims (20)

  1. 一种文本排序方法,其特征在于,包括:
    当检测到文本图像输入时,识别所述文本图像,得到文本框集,所述文本框集包括多个文本框;
    对所述多个文本框聚类,生成文本块集,所述文本块集包括多个文本块;
    依据所述文本块集确定文本块顺序;
    针对任一所述文本块,确定块内阅读顺序;
    依据所述文本块顺序和所述块内阅读顺序进行文本排序。
  2. 根据权利要求1所述的方法,其特征在于,所述对所述多个文本框聚类,生成文本块集,包括:
    针对任一所述文本框,生成文本框掩膜;
    对所述文本框掩膜进行膨胀;
    对膨胀后的文本框掩膜划分为多个连通域;
    计算所述文本框与所述连通域的重合度;
    依据所述重合度确定所述文本块;
    结合所述文本块,生成所述文本块集。
  3. 根据权利要求2所述的方法,其特征在于,所述针对任一所述文本框,生成文本框掩膜,包括:
    针对任一所述文本框,生成对应的二值图;
    确定所述二值图为所述文本框掩膜。
  4. 根据权利要求2所述的方法,其特征在于,所述对所述文本框掩膜进行膨胀,包括:
    基于预设卷积核对所述文本框掩膜进行膨胀运算。
  5. 根据权利要求2所述的方法,其特征在于,所述依据所述重合度确定所述文本块,包括:
    依据所述重合度对所述连通域进行逆序排序,生成连通域序列;
    确定所述连通域序列的第一位连通域为所述文本块。
  6. 根据权利要求1所述的方法,其特征在于,在所述对所述多个文本框聚类,生成文本块集之前,所述方法还包括:
    计算所述多个文本框对应的偏移角度;
    基于所述偏移角度,旋转所述文本图像。
  7. 根据权利要求6所述的方法,其特征在于,所述基于所述偏移角度,旋转所述文本图像,包括:
    集合所述偏移角度,生成第一偏移数组;
    确定所述第一偏移数组中的离散值,并从所述第一偏移数组中删除离散值,生成第二偏移数组;
    计算所述第二偏移数组中元素平均值;
    采用所述元素平均值旋转所述文本图像。
  8. 根据权利要求1所述的方法,其特征在于,所述依据所述文本块集确定文本块顺 序,包括:
    从所述文本块集中确定目标文本块;
    确定所述目标文本块的文本方向;
    依据所述文本方向和所述目标文本块确定阅读顺序;
    结合所述文本方向和所述阅读顺序,确定文本块顺序。
  9. 根据权利要求8所述的方法,其特征在于,所述从所述文本块集中确定目标文本块,包括:
    针对任一所述文本块,确定文本框数量;
    基于所述文本框数量对所述文本块集中的所述文本块进行逆序重排序;
    确定逆序重排序后的文本块集的第一位文本块为目标文本块。
  10. 根据权利要求8所述的方法,其特征在于,所述确定所述目标文本块的文本方向,包括:
    计算所述目标文本块中所述文本框的宽度和高度;
    当所述宽度大于所述高度时,确定所述文本方向为横向;
    当所述宽度不大于所述高度时,确定所述文本方向为纵向。
  11. 根据权利要求10所述的方法,其特征在于,当所述文本方向为所述横向时,所述依据所述文本方向和所述目标文本块确定阅读顺序,包括:
    确定所述目标文本块的横向目标字符串,所述横向目标字符串为所述目标文本块的横向第一行字符串;
    按照从左到右的顺序将所述横向目标字符串输入至预设通顺性打分网络,所述预设通顺性打分网络用于依据所述横向目标字符串确定文本通顺概率;
    读取所述文本通顺概率;
    当所述文本通顺概率大于预设通顺阈值时,确定所述阅读顺序为左右顺序;
    当所述文本通顺概率不大于所述预设通顺阈值时,确定所述阅读顺序为右左顺序。
  12. 根据权利要求10所述的方法,其特征在于,当所述文本方向为所述纵向时,所述依据所述文本方向和所述目标文本块确定阅读顺序,包括:
    确定所述目标文本块的左侧纵向文本列和右侧纵向文本列;
    将所述左侧纵向文本列和所述右侧纵向文本列输入至预设起始预测打分网络,所述预设起始预测打分网络用于依据所述左侧纵向文本列确定左侧起始概率,依据所述右侧纵向文本列确定右侧起始概率;
    读取所述左侧起始概率和所述右侧起始概率;
    当所述左侧起始概率大于所述右侧起始概率时,确定所述阅读顺序为左右顺序;
    当所述右侧起始概率大于所述左侧起始概率时,确定所述阅读顺序为右左顺序。
  13. 根据权利要求8所述的方法,其特征在于,所述结合所述文本方向和所述阅读顺序,确定文本块顺序,包括:
    按照所述文本方向,循环所述阅读顺序,生成所述文本块顺序。
  14. 根据权利要求1所述的方法,其特征在于,所述针对任一所述文本块,确定块内阅读顺序包括:
    针对任一所述文本块,对所述文本框进行位置编码,生成文本框位置编码向量;
    获取所述文本框对应的文本特征;
    合并所述文本块顺序、所述文本框位置编码向量和所述文本特征,生成合并文本特征;
    堆叠所述合并文本特征,生成输入特征组;
    输入所述输入特征组至预设编码模型,所述预设编码模型用于依据所述输入特征组输出编码特征组;
    获取所述编码特征组,确定所述编码特征组对应块内起始概率;
    依据所述块内起始概率,确定所述块内阅读顺序。
  15. 根据权利要求14所述的方法,其特征在于,所述合并所述文本块顺序、所述文本框位置编码向量和所述文本特征,生成合并文本特征,包括:
    将所述文本特征映射至所述文本块顺序;
    将映射后的文本特征与所述文本框位置编码向量相加,生成所述合并文本特征。
  16. 根据权利要求14所述的方法,其特征在于,所述依据所述块内起始概率,确定所述块内阅读顺序,包括:
    基于所述块内起始概率对所述文本框进行逆序排序,生成所述块内阅读顺序。
  17. 根据权利要求1所述的方法,其特征在于,所述依据所述文本块顺序和所述块内阅读顺序进行文本排序,包括:
    采用所述文本块顺序对所述文本块进行排序;
    采用所述块内阅读顺序对所述文本框进行排序,以完成所述文本排序。
  18. 一种文本排序装置,其特征在于,包括:
    识别模块,用于当检测到文本图像输入时,识别所述文本图像,得到文本框集,所述文本框集包括多个文本框;
    聚类模块,用于对所述多个文本框聚类,生成文本块集,所述文本块集包括多个文本块;
    第一顺序确定模块,用于依据所述文本块集确定文本块顺序;
    第二顺序确定模块,用于针对任一所述文本块,确定块内阅读顺序;
    排序模块,用于依据所述文本块顺序和所述块内阅读顺序进行文本排序。
  19. 一种电子设备,其特征在于,包括处理器、存储器及存储在所述存储器上并能够在所述处理器上运行的计算机程序,所述计算机程序被所述处理器执行时实现如权利要求1至17中任一项所述的文本排序方法的步骤。
  20. 一种非易失性可读存储介质,其特征在于,所述非易失性可读存储介质上存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1至17中任一项所述的文本排序方法的步骤。
PCT/CN2023/115049 2022-12-22 2023-08-25 一种文本排序方法、装置、电子设备和存储介质 WO2024131115A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211654994.XA CN115641573B (zh) 2022-12-22 2022-12-22 一种文本排序方法、装置、电子设备和存储介质
CN202211654994.X 2022-12-22

Publications (1)

Publication Number Publication Date
WO2024131115A1 true WO2024131115A1 (zh) 2024-06-27

Family

ID=84948153

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/115049 WO2024131115A1 (zh) 2022-12-22 2023-08-25 一种文本排序方法、装置、电子设备和存储介质

Country Status (2)

Country Link
CN (1) CN115641573B (zh)
WO (1) WO2024131115A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115641573B (zh) * 2022-12-22 2023-07-14 苏州浪潮智能科技有限公司 一种文本排序方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268127A (zh) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 一种电子档版式文件阅读顺序分析的方法
US10970458B1 (en) * 2020-06-25 2021-04-06 Adobe Inc. Logical grouping of exported text blocks
CN113591433A (zh) * 2021-02-20 2021-11-02 腾讯科技(深圳)有限公司 一种文本排版方法、装置、存储介质及计算机设备
CN114821590A (zh) * 2022-04-25 2022-07-29 中国平安人寿保险股份有限公司 文档信息提取方法、装置、设备及介质
CN115641573A (zh) * 2022-12-22 2023-01-24 苏州浪潮智能科技有限公司 一种文本排序方法、装置、电子设备和存储介质

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108334805B (zh) * 2017-03-08 2020-04-03 腾讯科技(深圳)有限公司 检测文档阅读顺序的方法和装置
CN111680491B (zh) * 2020-05-27 2024-02-02 北京字跳网络技术有限公司 文档信息的抽取方法、装置和电子设备
CN114332889A (zh) * 2021-08-26 2022-04-12 腾讯科技(深圳)有限公司 文本图像的文本框排序方法和文本图像的文本框排序装置

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104268127A (zh) * 2014-09-22 2015-01-07 同方知网(北京)技术有限公司 一种电子档版式文件阅读顺序分析的方法
US10970458B1 (en) * 2020-06-25 2021-04-06 Adobe Inc. Logical grouping of exported text blocks
CN113591433A (zh) * 2021-02-20 2021-11-02 腾讯科技(深圳)有限公司 一种文本排版方法、装置、存储介质及计算机设备
CN114821590A (zh) * 2022-04-25 2022-07-29 中国平安人寿保险股份有限公司 文档信息提取方法、装置、设备及介质
CN115641573A (zh) * 2022-12-22 2023-01-24 苏州浪潮智能科技有限公司 一种文本排序方法、装置、电子设备和存储介质

Also Published As

Publication number Publication date
CN115641573B (zh) 2023-07-14
CN115641573A (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
Zhuge et al. Salient object detection via integrity learning
Gao et al. A mutually supervised graph attention network for few-shot segmentation: The perspective of fully utilizing limited samples
Song et al. Region-based quality estimation network for large-scale person re-identification
Ma et al. AU R-CNN: Encoding expert prior knowledge into R-CNN for action unit detection
Wang et al. Salient object detection with recurrent fully convolutional networks
Liu et al. Deep sketch hashing: Fast free-hand sketch-based image retrieval
Wang et al. Knowledge guided disambiguation for large-scale scene classification with multi-resolution CNNs
CN104517112B (zh) 一种表格识别方法与系统
US11288324B2 (en) Chart question answering
CN106682233A (zh) 一种基于深度学习与局部特征融合的哈希图像检索方法
Zafar et al. Image classification by addition of spatial information based on histograms of orthogonal vectors
CN107729865A (zh) 一种手写体数学公式离线识别方法及系统
WO2024131115A1 (zh) 一种文本排序方法、装置、电子设备和存储介质
Ma et al. Correlation filtering-based hashing for fine-grained image retrieval
CN110399879B (zh) 一种基于注意力机制的文本行单字分割方法
Liu et al. Document image classification: Progress over two decades
Zhang et al. Automatic discrimination of text and non-text natural images
Sitaula et al. Content and context features for scene image representation
CN106649665A (zh) 一种面向图像检索的对象级深度特征聚合方法
Sahare et al. Robust character segmentation and recognition schemes for multilingual Indian document images
CN107526772A (zh) Spark平台下基于SURF‑BIT算法的图像检索系统
Sheng et al. Discriminative multi-view subspace feature learning for action recognition
Adnan et al. Automated image annotation with novel features based on deep ResNet50-SLT
Jia et al. Detecting text baselines in historical documents with baseline primitives
CN114998702B (zh) 基于BlendMask的实体识别、知识图谱生成方法及系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23905294

Country of ref document: EP

Kind code of ref document: A1