CN114399782B - Text image processing method, apparatus, device, storage medium, and program product - Google Patents

Text image processing method, apparatus, device, storage medium, and program product Download PDF

Info

Publication number
CN114399782B
CN114399782B CN202210056559.0A CN202210056559A CN114399782B CN 114399782 B CN114399782 B CN 114399782B CN 202210056559 A CN202210056559 A CN 202210056559A CN 114399782 B CN114399782 B CN 114399782B
Authority
CN
China
Prior art keywords
paragraph
text
image
text image
box
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210056559.0A
Other languages
Chinese (zh)
Other versions
CN114399782A (en
Inventor
曹真
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210056559.0A priority Critical patent/CN114399782B/en
Publication of CN114399782A publication Critical patent/CN114399782A/en
Application granted granted Critical
Publication of CN114399782B publication Critical patent/CN114399782B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the application provides a text image processing method, a device, equipment, a storage medium and a program product, which can be applied to various fields or scenes such as an intelligent platform, an artificial intelligence technology and the like, and the method comprises the following steps: extracting features of the text image to be processed to obtain image features of the text image to be processed; determining a paragraph probability map, a paragraph boundary map and a wind style segmentation map of the text image to be processed based on the image features; determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind style segmentation graph; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph. According to the embodiment of the application, more accurate division can be carried out on the text paragraphs, better text paragraph outlines are obtained, and further better text paragraph box detection effects are achieved.

Description

Text image processing method, apparatus, device, storage medium, and program product
Technical Field
The present application relates to the field of artificial intelligence, and in particular to a text image processing method, a text image processing apparatus, a computer device, a computer readable storage medium and a computer program product.
Background
In the existing scheme for processing text in a text image, text in the text image is generally processed based on text lines. Such as in a translation application scenario, text detection and recognition is performed at line granularity using optical character recognition (Optical Character Recognition, OCR) technology, i.e., text in a text image is segmented based on text lines and translated in terms of text lines. However, the text line is based on single line data, and the relation between lines is not needed to be considered, and the information at the tail end of the previous line and the information at the front end of the next line are a complete expression, namely, the text information of the single line is incomplete, which can lead to inaccurate translation due to information deletion when the translation is carried out according to the text line, and is difficult to understand. Thus, current approaches to processing text in a text image are more prone to processing text in a text image based on text paragraphs. How to detect text paragraphs for text images is a problem to be solved and is a current research hotspot.
Disclosure of Invention
The embodiment of the application provides a text image processing method, a text image processing device, computer equipment, a computer readable storage medium and a computer program product, which can divide text paragraphs more accurately and obtain better text paragraph outlines so as to achieve better text paragraph box detection effects.
In one aspect, an embodiment of the present application provides a text image processing method, where the method includes:
extracting features of the text image to be processed to obtain image features of the text image to be processed;
determining a paragraph probability map, a paragraph boundary map and a wind style segmentation map of the text image to be processed based on the image features; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind style segmentation graph; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
In another aspect, an embodiment of the present application provides a text image processing apparatus, including:
the extraction unit is used for extracting the characteristics of the text image to be processed to obtain the image characteristics of the text image to be processed;
the processing unit is used for determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image characteristics; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
The processing unit is further used for determining a paragraph frame marking graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the style segmentation graph; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
In yet another aspect, embodiments of the present application provide a computer device comprising: the text image processing method comprises a processor, an output interface, a communication interface and a memory, wherein the processor, the communication interface and the memory are connected with each other, executable program codes are stored in the memory, and the processor is used for calling the executable program codes to realize the text image processing method provided by the embodiment of the application. Accordingly, the embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores instructions, which when running on a computer, cause the computer to implement the text image processing method provided by the embodiment of the application.
Accordingly, embodiments of the present application also provide a computer program product comprising a computer program or computer instructions which, when executed by a processor, implement the steps of the text image processing method provided by the embodiments of the present application.
According to the embodiment of the application, the image characteristics of the text image to be processed are obtained by extracting the characteristics of the text image to be processed; determining a paragraph probability map, a paragraph boundary map and a wind style segmentation map of the text image to be processed based on the image features; and finally, determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind paragraph segmentation graph, wherein the paragraph frame mark graph comprises different text paragraphs in each target paragraph frame, and the target paragraph frame is close to each boundary position of the corresponding text paragraph. By adopting the mode, on one hand, the text in the text image is divided into the text paragraphs by combining the style segmentation map, and the texts in different styles can be divided into different text paragraphs, so that the actual division situation of the text paragraphs is more met, and the accuracy of the text paragraph division is improved; on the other hand, each target paragraph box in the paragraph box mark graph is close to each boundary position of the corresponding text paragraph, and is usually a polygon with more than four vertexes, compared with a rectangular text paragraph box, useless background information can be effectively reduced, and the paragraph outline indicated by the paragraph box is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic diagram of a text image processing system according to an embodiment of the present application;
fig. 2 is a schematic flow chart of a text image processing method according to an embodiment of the present application;
fig. 3 is a schematic diagram of comparison between a text image to be processed and a style segmentation map according to a first embodiment of the present application;
FIG. 4 is a schematic diagram of a comparison of different paragraph box labels of the same text according to an embodiment of the present application;
FIG. 5 is a comparative schematic diagram of different paragraph divisions of the same text provided by embodiments of the present application;
FIG. 6 is a schematic diagram illustrating a comparison of text paragraph boxes formed by the same text based on different coordinate points according to the embodiment of the present application;
fig. 7 is a flowchart of another text image processing method according to an embodiment of the present application;
FIG. 8 is a schematic diagram of a network system of a text image processing model according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a text paragraph box optimization process provided by an embodiment of the present application;
FIG. 10 is a schematic diagram illustrating a text paragraph box detection model according to an embodiment of the present application;
fig. 11 is a schematic structural diagram of a text image processing apparatus according to an embodiment of the present application;
fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
For a better understanding of embodiments of the present application, some terms related to embodiments of the present application are described below:
text paragraph box detection of text images: belonging to one of the tasks of image document analysis (Document Image Analysis, DIA) including, for example, image document classification, document layout analysis, form detection, etc.
layoutParser: is a picture document layout analysis tool built based on Facebook (Meta) target detection framework Detectron2, providing a partial pre-training model based on an open source dataset for simplifying image document analysis (DIA) studies. Easy to install, and can be used for constructing flexible and accurate pipelines so as to process documents with complex structures. The core library of the layoutParser is provided with a set of simple and visual interfaces, and can easily mark and train Deep Learning (DL) models on specific document image data sets so as to realize layout detection, character recognition and many other document processing tasks.
LayoutLM and LayoutLM2: is Microsoft (Microsoft) that proposes a simple and efficient pre-trained model for document image understanding tasks, in combination with image information and text information. The text and layout combined pre-training is performed by using a large-scale unlabeled document data set, and leading results are obtained on a plurality of downstream document understanding tasks.
In order to implement text paragraph box detection of a text image, the embodiments of the present application provide several ways including:
mode one: and analyzing the image document panel based on the LayoutParser to obtain a paragraph analysis result of the image document.
Mode two: and analyzing the image document panel based on the LayoutLM and the LayoutLM2 to obtain a paragraph analysis result of the image document.
The problems with the above several approaches are:
in the mode 1, the data set and the model which are caused by the open source of the layoutParser are based on single scene data, such as tables, academic papers, news newspaper structures and the like, and most frames are based on 4-point coordinates, so that the actual complex service scene cannot be satisfied.
The first step executed by LayoutLM and LayoutLM2 in mode 2 needs to extract text of a token level and corresponding bounding box information, and the data preprocessing cost is relatively high.
The embodiment of the application provides a text image processing method, which comprises a text paragraph box detection method in a text image, so as to obtain a paragraph probability map, a paragraph boundary map and a wind style segmentation map according to image characteristics, and obtain a paragraph box marking map based on the paragraph probability map, the paragraph boundary map and the wind style segmentation map. The boundary lines of texts with different styles are reflected by the style segmentation map, namely, texts of adjacent target paragraph boxes in the paragraph box mark map obtained based on the style segmentation map are texts with different styles, and each target paragraph box in the paragraph box mark map is more in accordance with the actual division condition of the text paragraph box, so that the accuracy of dividing the text paragraphs is improved. The text image processing method provided by the embodiment of the application can be realized based on artificial intelligence (Artificial Intelligence, AI) technology. AI refers to the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. AI technology is a comprehensive discipline, which relates to a relatively wide field; the text image processing method provided in the embodiment of the present application mainly relates to Machine Learning (ML) technology and Computer Vision (CV) technology in AI technology. Machine learning typically includes techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, and the like. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.
In a possible embodiment, the text image processing method provided in the embodiment of the present application may be further implemented based on Cloud technology (Cloud technology) and/or blockchain technology. In particular, the method can relate to one or more of Cloud storage (Cloud storage), cloud Database (Cloud Database) and Big data (Big data) in Cloud technology. For example, data (e.g., text image data, etc.) required to perform the text image processing method is acquired from a cloud database. For another example, the data required to perform the text image processing method may be stored in the form of blocks on a blockchain; data (e.g., image features, paragraph probability maps, paragraph boundary maps, paragraph binary maps, etc.) resulting from performing the text image processing method may be stored in blocks on the blockchain; in addition, the data processing device performing the text image processing method may be a node device in a blockchain network.
The text image processing method provided by the embodiment of the application can be applied to the network architecture shown in fig. 1. The data processing device 10 shown in fig. 1 may be a server or a terminal with a data (such as text image) processing function, where the server may be an independent physical server, may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, and basic cloud computing services such as big data and an artificial intelligence platform. The terminal may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like. The text database 11 shown in fig. 1 may be a local database of the data processing device 10, or may be a cloud database that can be accessed by the data processing device 10, where the text database 11 includes various text image data. The text image processing method provided in the embodiment of the present application may be executed by the data processing apparatus 10, specifically:
Acquiring a text image to be processed from the text database 11, and then extracting features of the text image to be processed to obtain image features of the text image to be processed; determining a paragraph probability map, a paragraph boundary map and a wind style segmentation map of the text image to be processed based on the image features; and finally, determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind paragraph segmentation graph, wherein the paragraph frame mark graph comprises different text paragraphs in each target paragraph frame, and the target paragraph frame is close to each boundary position of the corresponding text paragraph. By adopting the mode, on one hand, the text in the text image is divided into the text paragraphs by combining the style segmentation map, and the texts in different styles can be divided into different text paragraphs, so that the actual division situation of the text paragraphs is more met, and the accuracy of the text paragraph division is improved; on the other hand, each target paragraph box in the paragraph box mark graph is close to each boundary position of the corresponding text paragraph, and is usually a polygon with more than four vertexes, compared with a rectangular text paragraph box, useless background information can be effectively reduced, and the paragraph outline indicated by the paragraph box is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved.
The text image processing method provided by the embodiment of the application is briefly described above, and a specific implementation manner of the text image processing method is described in detail below.
The scheme provided by the embodiment of the application relates to the technologies of computer vision technology, machine learning and the like of artificial intelligence, and is specifically described by the following embodiments:
referring to fig. 2, fig. 2 is a flowchart of a text image processing method according to an embodiment of the present application. The text image processing method described in the embodiments of the present application may be performed by the data processing apparatus 10 shown in fig. 1, and the method may include the steps of:
s201, extracting features of the text image to be processed to obtain image features of the text image to be processed.
In the embodiment of the application, the text image is an image including text content, may be an image photographed or intercepted for an object displaying text, may be an image containing text decoration (such as an expression bag containing text), or may be a document image in which a document is converted into an image format (such as a PNG, JPG, TIFF format). It can be appreciated that the image features in the embodiments of the present application may be feature graphs or feature vectors. The following description of the present application will take the form of illustration.
In an embodiment, a feature extraction network of a text image processing model is called to extract features of the text image to be processed, and image features of the text image to be processed are obtained. The feature extraction network may include a convolutional neural network (Convolutional Neural Network, CNN), and performs operations such as sampling, fusing, convolution, and connection on the text image to be processed to obtain image features.
In an embodiment, the text image to be processed may be a text image to be processed obtained by the terminal device through a tool (such as a camera) configured by the terminal device, or may be a text image to be processed (such as an expression packet containing characters, a screenshot containing characters, an image containing characters, etc.) obtained by the terminal device from other devices in the network through the internet, or may be a text image to be processed obtained directly from a local storage device (such as a usb disk).
In one embodiment, the text in the text image to be processed may be text in an official language (e.g., chinese, english) used in any country. Meanwhile, the text in the text image to be processed may be one or more. For example, the text image to be processed is a photo, and the text contained in the photo can be only one text in Chinese; and can also comprise various texts such as Chinese, english, german and the like.
S202, determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image features.
In the embodiment of the application, the paragraph probability map is used for indicating a first probability value corresponding to each pixel point in the text image to be processed, and the first probability value is used for indicating the probability of belonging to a text paragraph region; based on the paragraph probability map, text paragraph regions corresponding to text paragraphs in the text image to be processed can be roughly determined. The paragraph boundary graph is used for indicating a second probability value corresponding to each pixel point in the text image to be processed, and the second probability value is used for indicating the probability of the boundary belonging to the text paragraph region; based on the paragraph boundary map, the boundary of the text paragraph area corresponding to each text paragraph in the text image to be processed can be approximately determined. The style segmentation map is used for indicating third probability values corresponding to all pixel points in the text image to be processed, and the third probability values are used for indicating boundaries between texts in different styles in the text image to be processed; based on the style segmentation map, the boundary position between texts of different styles in the text image to be processed can be roughly determined.
The style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed, so that the boundary (boundary) between the text with different styles in the text image to be processed can be marked, and further, the segmentation of the text paragraphs in the text image to be processed is facilitated. For example, as shown in fig. 3, the left graph is a text image to be processed, and text fonts and text font sizes in the text image to be processed have different styles. The right graph is a style segmentation graph, and the style segmentation graph marks the boundaries between texts of different styles in the text image to be processed (namely, the boundaries of paragraphs which represent the texts of different styles).
Illustratively, the text image to be processed in fig. 3 is processed, assuming that a paragraph map (a) not obtained based on the style segmentation map and a paragraph map (b) obtained based on the style segmentation map are obtained as shown in fig. 4. Translating text in the paragraph mark graph (a) and the paragraph mark graph (b) into English, and obtaining translation information of 'Artificial intelligence machine learning natural language processing' according to the paragraph mark graph (a); i.e., the "artificial intelligence" representing the title and the "machine learning and natural language processing" representing the text are translated together, the semantics are confusing. The translation information obtained according to the paragraph markup map (b) is "Artificial intelligence (ai)", "Machine learning Natural language processing", i.e., the "artificial intelligence" representing the title and the "machine learning and natural language processing" representing the text are translated separately, and the semantics are clear. From the example, the text in the text image is divided into different text paragraphs by combining the style segmentation map, so that the actual division situation of the text paragraphs is more met, and the accuracy of the text paragraph division is improved.
S203, determining a paragraph frame marking diagram of the text image to be processed based on the paragraph probability diagram, the paragraph boundary diagram and the wind style segmentation diagram.
In this embodiment, the paragraph box marking chart is a chart for marking paragraphs of all texts in the text image to be processed according to different styles (such as title, text, etc.), and is based on the paragraph box marking chart in the subsequent process of processing the text image to be processed.
The text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph. The approach means that the distance between the target paragraph frame and each boundary position of the corresponding text paragraph is in a preset interval, and the interval can ensure that the target paragraph frame can completely comprise the corresponding text paragraph information and can also ensure that useless background information in the target paragraph frame is less, namely, the paragraph outline indicated by the target paragraph frame is finer and more accurate, so that the detection effect on the target paragraph frame is better. Each boundary position comprises a left boundary and a right boundary of all text lines in the text paragraph, an upper boundary of a first text line, a lower boundary of a last text line, and a connecting part of the lower boundary of a last text line part and the left boundary and the right boundary of all text lines, wherein the last text line is staggered from the last text line. For example, as shown in fig. 4, the left diagram is a text line label diagram, and the right diagram is a paragraph label diagram. "1" in the right diagram represents the left and right boundaries of all lines of text in the left diagram; "2" in the right drawing represents the connecting portion of the left and right boundaries of all the text lines in the left drawing; "3" in the right diagram represents the lower boundary of the penultimate text line in the left diagram with the last text line portion staggered; "4" in the right diagram represents the upper boundary of the first text line in the left diagram; "5" in the right diagram represents the lower boundary of the last text line in the left diagram.
In an embodiment, based on the paragraph probability map, the paragraph boundary map and the wind style segmentation map, the manner of determining the paragraph box label map of the text image to be processed is as follows:
determining a paragraph binary image of the text image to be processed based on the paragraph probability image, the paragraph boundary image and the style segmentation image; and carrying out paragraph box marking processing on the paragraph binary image, and determining a paragraph box marking image of the text image to be processed. The text background pixel point in the text image to be processed is represented by 255, and is presented by a specific color, which is generally white; the non-text background pixels in the text image to be processed are represented by "0" and are presented in a color different from "255", typically black. Therefore, when the paragraph binary image is further processed, the aggregate property of the paragraph binary image is only related to the position of the pixel point with the pixel value of 0 or 255, and the multi-level value of the pixel is not involved, so that the processing is simple, the processing and compression amount of data are small, and the processing efficiency is improved.
It should be noted that the text of the adjacent areas represented by the text background pixels in the binary image of the paragraph is different in style. The paragraph binary image is determined based on the paragraph probability image, the paragraph boundary image and the style segmentation image, wherein the style segmentation image is used for indicating the boundary position between text paragraphs with different styles in the text image to be processed, namely the styles of adjacent text paragraphs in the paragraph binary image determined based on the style segmentation image are different.
In one embodiment, the step of determining the paragraph box label map of the text image to be processed includes:
(1) Determining an initial text paragraph box based on the region boundary of the target region in the paragraph binary image; the target area is any one of areas formed by pixel points with pixel values being set values in the paragraph binary image.
(2) And corroding the initial text paragraph box to obtain the corroded text paragraph box.
Wherein the etching process is to process the region boundary of the initial text paragraph box. The process comprises the following steps: defining a convolution kernel B, which may be of any shape and size, having a separately defined reference point (anchor point); typically, the convolution kernel B is a square or disk with a reference point, which may be referred to as a template or mask; convolving the convolution kernel B with the initial text paragraph frame, and calculating the minimum value of the pixel points of the coverage area of the convolution kernel B; assigning the minimum value to the pixel specified by the reference point; finally, the initial text paragraph box after corrosion treatment is obtained. In this way part of the noise in the initial text paragraph box can be eliminated (making the region boundary of the initial text paragraph box smooth); separate image elements may be segmented (the boundaries of each adjacent initial text paragraph box are segmented such that each initial text paragraph box is independent of the other).
(3) And performing polygon approximation processing on the corroded text paragraph box to obtain a target text paragraph box.
The target text paragraph box is formed based on N point coordinates, wherein N is an integer greater than or equal to 4. For example, as shown in FIG. 5, the left diagram is a text paragraph box based on a rectangle, and the right diagram is a target text paragraph box based on N-point coordinates. As can be seen from fig. 5, the target text paragraph box based on the N-point coordinates is closer to the text than the text paragraph box based on the rectangle, and the background information is less, that is, the paragraph outline indicated by the text paragraph box based on the N-point coordinates is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved.
The polygon approximation processing is to process coordinate points of the text paragraph box after the corrosion processing, so that the text paragraph box after the processing is closer to the text. As shown in fig. 6, the left text paragraph box based on the rectangle is subjected to polygon approximation processing, so as to obtain the right target text paragraph box based on the coordinates of N points, wherein the value of N is greater than or equal to 4. As can be seen from fig. 6, after the rectangular-based text paragraph box is subjected to polygon approximation, the background information in the obtained target text paragraph box based on the N-point coordinates is less, and the paragraph outline indicated by the target text paragraph box is finer and more accurate.
(4) And determining a paragraph box mark graph of the text image to be processed based on each target text paragraph box.
The above-described method may be performed by a terminal having an image translation function, for example. And translating text information in the text image to be processed according to the content in each target text paragraph box in the text image to be processed based on the paragraph box label graph of the text image to be processed.
According to the embodiment of the application, the image characteristics of the text image to be processed are obtained by extracting the characteristics of the text image to be processed; determining a paragraph probability map, a paragraph boundary map and a wind style segmentation map of the text image to be processed based on the image features; and finally, determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind paragraph segmentation graph, wherein the paragraph frame mark graph comprises different text paragraphs in each target paragraph frame, and the target paragraph frame is close to each boundary position of the corresponding text paragraph. By adopting the mode, on one hand, the text in the text image is divided into the text paragraphs by combining the style segmentation map, and the texts in different styles can be divided into different text paragraphs, so that the actual division situation of the text paragraphs is more met, and the accuracy of the text paragraph division is improved; on the other hand, each target paragraph box in the paragraph box mark graph is close to each boundary position of the corresponding text paragraph, and is usually a polygon with more than four vertexes, compared with a rectangular text paragraph box, useless background information can be effectively reduced, and the paragraph outline indicated by the paragraph box is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved.
Referring to fig. 7, fig. 7 is a flowchart of another text image processing method according to an embodiment of the present application. The method may comprise the steps of:
and S701, extracting features of the text image to be processed to obtain image features of the text image to be processed.
The specific implementation of step S701 refers to the description related to step S201 in the foregoing embodiment, and is not repeated here.
S702, calling a prediction network of a target text image processing model to process image characteristics, and determining a paragraph probability map, a paragraph boundary map and a style segmentation map of a text image to be processed.
The prediction network comprises a first sub-network, a second sub-network and a third sub-network which are connected in parallel, wherein the input ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the output end of the feature extraction network of the target text image processing model, the output ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the input end of the paragraph segmentation network of the target text image processing model, the first sub-network is used for determining a paragraph probability graph, the second sub-network is used for determining a paragraph boundary graph, and the third sub-network is used for determining a style segmentation graph. For example, as shown in fig. 8, the text image processing model includes a feature extraction network, a paragraph segmentation network, and a prediction network. Wherein the prediction network comprises a first sub-network, a second sub-network and a third sub-network connected in parallel. The input ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the output ends of the feature extraction network of the target text image processing model. The output ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the input ends of the paragraph segmentation network of the target text image processing model. The first sub-network is used for determining a paragraph probability map, the second sub-network is used for determining a paragraph boundary map, and the third sub-network is used for determining a style segmentation map.
In one embodiment, a training data set is obtained, wherein the training data set comprises a plurality of groups of training sample pairs, each group of training sample pairs comprises a sample text image and a sample paragraph frame mark graph corresponding to the sample text image, and each sample paragraph frame in the sample paragraph frame mark graph is close to each boundary position of a corresponding text paragraph; training the initial text image processing model by using a training data set to obtain a trained text image processing model; the test loss value of the trained text image processing model is smaller than the set loss value; and determining the trained text image processing model as a target text image processing model.
In one embodiment, as shown in FIG. 8, the text image processing model may include a feature extraction network, a prediction network, and a paragraph segmentation network. And optimally training the networks by using a training data set to obtain a text image processing model. The text image processing model training step can comprise the following steps:
(1) And acquiring a training data set, wherein the training data set comprises a plurality of groups of training sample pairs, each group of training sample pairs comprises a sample text image and a sample paragraph frame mark graph corresponding to the sample text image, and each sample paragraph frame in the sample paragraph frame mark graph is close to each boundary position of a corresponding text paragraph.
It will be appreciated that each sample paragraph box in the sample paragraph box markup map is manually marked. Because of the background information in the paragraph box, the subsequent text paragraph detection has a huge impact. Thus, the influence of background information can be reduced, and the effect of text paragraph box detection can be improved.
(2) And calling a feature extraction network of the initial text image processing model, and extracting features of the sample text image to obtain image features.
(3) And calling a prediction network of the initial text image processing model to process the characteristic image to obtain a paragraph probability map, a paragraph boundary map and a style segmentation map.
In the embodiment of the application, inputting the image characteristics into a first sub-network of a prediction network to obtain a paragraph probability map; inputting the image characteristics into a second sub-network of the prediction network to obtain a paragraph boundary map; and inputting the image characteristics into a third sub-network of the prediction network to obtain a style segmentation map.
(4) And calling a paragraph segmentation network of the initial text image processing model, and processing the paragraph probability map, the paragraph boundary map and the wind paragraph segmentation map to obtain a paragraph marking map.
In one embodiment, a paragraph segmentation network of the initial text image processing model is called to process the paragraph probability map, the paragraph boundary map and the wind paragraph segmentation map to obtain a paragraph binary map. And after the corrosion treatment is carried out on the paragraph binary image, carrying out polygonal approximation treatment to obtain a paragraph mark image.
Among them, in the process of performing polygon approximation, there are the following cases:
in case 1, when the overlapping ratio of the area of the paragraph frame mark map obtained after the polygon processing and the area of the paragraph frame mark map obtained after the etching processing is greater than or equal to a preset value, determining the paragraph frame mark map obtained after the polygon processing as a target paragraph frame mark map. Wherein the etching treatment may be performed using a function. For example, the corrosion treatment function is kernel=np.ones ((3, 3), dtype=np.uint 8) error=cv 2. Error (img, kernel, events=1) ss=np.hstack ((img, error)) cv_show (ss).
And 2, when the overlapping ratio of the area of the paragraph frame mark graph obtained after the polygonal processing and the area of the paragraph frame mark graph obtained after the corrosion processing is smaller than a preset value, iteratively reducing the approximate parameter value of the approximate function in the polygonal approximate processing, and continuing to perform polygonal approximate processing on the paragraph frame mark graph after the corrosion processing until the overlapping ratio is larger than or equal to the preset value, and determining the target paragraph frame mark graph. Therefore, each target paragraph box in the target paragraph box mark graph is close to each boundary position of the corresponding text paragraph, compared with the rectangular text paragraph box, useless background information can be effectively reduced, and the paragraph outline indicated by the paragraph box is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved.
In an embodiment, the step of obtaining the target text image processing model based on the training data set may further comprise:
step 1, aiming at any sample paragraph frame in a reference paragraph frame mark graph, acquiring the area and the area perimeter of any sample paragraph frame; referring to the paragraph box label graph as any sample paragraph box label graph in the training data set; determining a scaling distance D based on the scaling ratio r, the area a and the area perimeter L, wherein d= (a (1-r 2))/L; and performing scaling adjustment on any sample paragraph frame based on the scaling distance, and obtaining a scaled reference paragraph frame marker graph based on each sample paragraph frame after scaling adjustment. Therefore, the text paragraph boxes can be contracted inwards, the distance between adjacent text paragraph boxes is increased, and different text paragraph boxes can be easily segmented.
And step 2, training the initial text image processing model by utilizing each text image included in the training data set and the scaled and adjusted reference paragraph frame mark graph corresponding to each text image to obtain a trained text image processing model (target text image processing model).
It can be understood that, based on comparing each sample paragraph frame after scaling adjustment with the sample paragraph frame in the reference paragraph frame mark graph, the difference value corresponding to each sample paragraph frame after scaling adjustment is obtained, and further the loss function of the corresponding process graph (paragraph probability graph, paragraph boundary graph, style segmentation graph and paragraph binary graph) in the training process is obtained.
The loss function Ls of the paragraph probability map and the loss function Lb of the paragraph binary map are:
Ls=Lb=∑ i∈Sl y i logx i +(1-y i )log(1-x i )。
the loss function Lt of the paragraph boundary map is: lt= Σ i∈Rd |y i *-x i *|。
The loss function Lg of the style segmentation map is: lg= Σ i∈Rd |y i *-x i *|。
S l Representing all pixels in a sample paragraph box, R d Representing all pixel points, x in the corresponding sample frame obtained after scaling i Representing pixel prediction value, y i And representing the actual value of the pixel point, wherein i is the value of the pixel point.
In one embodiment, the test loss value (i.e., test loss function) is determined based on a weight parameter and the test loss value respectively corresponding to a paragraph binary image, a paragraph probability image, a paragraph boundary image and a wind-style segmentation image, which are obtained by processing a text image processing model in the test process; the weight parameters corresponding to the paragraph boundary map are determined based on scaling. Wherein, the test loss value L is: l=ls+α×lb+β×lt+γ×lg.
Ls is the loss function of the paragraph probability map; lb is a loss function of the paragraph binary map; lt is the loss function of the paragraph boundary map; lg is the loss function of the style segmentation map; alpha, beta and gamma are the weight values of the paragraph binary image, the paragraph boundary image and the wind regime segmentation image respectively.
In order to compensate the influence caused by the change of the scaling ratio r (for example, the scaling ratio r is increased, the distance between adjacent text paragraph boxes is reduced, and the difficulty of paragraph segmentation is increased), the weight value beta of the paragraph boundary map can be correspondingly adjusted, so that the relation between r and beta is in direct proportion, namely, the distance between adjacent text paragraph boxes is in inverse proportion to beta, and therefore when the scaling ratio r is increased, the text image processing model can segment text paragraphs better by increasing the weight value corresponding to the paragraph boundary map.
S703, calling a paragraph segmentation network of the feature extraction network, and determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the wind paragraph segmentation graph.
Illustratively, in the text image processing model in the embodiment of the application, in the process of determining the paragraph box marker graph, a series of optimizations are performed on a text paragraph box, and the text paragraph box is from 4-point coordinates (rectangles) to convex hull coordinates to fine contour coordinates. The useless background information in the text paragraph box is gradually reduced, and further, the paragraph outline indicated by the paragraph box is finer and more accurate, so that when the text paragraph in the text image is processed, the recognition area is reduced, and the processing efficiency is improved. For example, as shown in FIG. 9, the text paragraph box of the left hand diagram uses 4-point coordinates (rectangles); the text paragraph box of the intermediate graph uses convex hull coordinates; the text paragraph box of the right hand graph uses fine contour coordinates. As can be seen from fig. 9, the text paragraph boxes in the graph using fine contour coordinates have little useless background information.
Illustratively, the text image processing model para-DBNet in the embodiments of the present application has better effect on text paragraph detection than a text image processing model based on a micro-binarizable network (Differentiable Binarization Net, DBNet). For example, as shown in fig. 10, the upper graph is a text image processing model para-DBNet in an embodiment of the present application, and the lower graph is based on the text image processing model of DBNet. In the model structure, compared with DBNet, the para-DBNet is newly added with a style segmentation diagram which is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed; in addition, in post-processing logic, compared with DBNet, the para-DBNet is added with corrosion processing operation and polygon approximation processing operation, wherein the corrosion processing operation is used for obtaining smoother text paragraph box boundaries, and the polygon approximation processing operation is used for obtaining text paragraph boxes which are more attached to texts; in addition, the outline characteristics of the text paragraph boxes obtained through processing are optimized, compared with the text paragraph boxes with 4-point coordinates (or rectangles) obtained through DBNet processing, the text paragraph boxes with finer outline coordinates can be obtained through para-DBNet processing, namely the obtained paragraph boxes are close to the boundary positions of corresponding text paragraphs, so that useless background information can be effectively reduced, recognition areas can be reduced when the text paragraphs in a text image are processed, and processing efficiency is improved; in addition, the method can be suitable for various complex practical application scenes (such as formation of a complex polygonal text paragraph box); meanwhile, the scaling ratio r, the weight value beta of the paragraph boundary diagram and the approximation parameters of the approximation function in the polygon approximation process are improved in the training process. In the method, the value of the scaling ratio r can be increased, the distance between adjacent text paragraph boxes is reduced, the paragraph segmentation difficulty is increased, and meanwhile, in order to compensate the influence caused by the increase of the scaling ratio r, the weight value beta of a loss function of a paragraph boundary map can be increased, so that a para-DBNet can segment text paragraphs better; the approximation parameters of the approximation function in the polygon approximation process are reduced according to the actual situation, so that the text paragraph box obtained by the polygon approximation process is more in line with the actual outline of the text paragraph. That is, para-DBNet optimizes the original scaling, loss weight, text paragraph outline feature, post-processing logic and model structure of the model based on DBNet. As shown in the following Table 1, the text image processing model para-DBNet and the text image processing model based on DBNet are compared with each other to determine the text paragraph box detection effect based on the custom data set. From table 1, we can see that the detection effect (F1) of the paragraph box of the DBNet-based text image processing model is 0.7321; accuracy (precision) is 0.7721; the recall (recovery) was 0.6961. The detection effect (F1) of the paragraph box of the text image processing model para-DBNet in the embodiment of the application is 0.8372; accuracy (precision) is 0.8817; the recall (recovery) was 0.7970. In the text paragraph box detection effect index based on the data set under the user definition, the text image processing model para-DBNet in the embodiment of the application is superior to the text image processing model based on DBNet.
TABLE 1
For example, the text detection algorithm based on DBNet in the embodiments of the present application may be replaced by other algorithms. For example, the effect of text detection by algorithm a or modified algorithm B is better than that of the DBNet-based text detection algorithm, and algorithm a or algorithm B may be used instead of the DBNet-based text detection algorithm.
In the embodiment of the application, the training data set is utilized to train the initial text image processing model, and corresponding parameters are modified in the training process to obtain the target text image processing model. And processing the text image to be processed through the trained text image processing model to obtain a paragraph box mark graph. And obtaining a paragraph box mark graph by using the target text image processing model. The garbage background information in each target paragraph box is less than that in the paragraph box mark graph obtained in other modes, and the styles of text paragraphs in each adjacent target paragraph box are different.
Referring to fig. 11, a schematic structural diagram of a text image processing apparatus according to an embodiment of the present application is shown. The text image processing apparatus includes:
an extracting unit 1101, configured to perform feature extraction on a text image to be processed, so as to obtain image features of the text image to be processed;
A processing unit 1102, configured to determine a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image features; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
the processing unit 1102 is further configured to determine a paragraph box label graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the style segmentation graph; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
In an embodiment, the processing unit 1102 is configured to, when determining a paragraph box label map of the text image to be processed based on the paragraph probability map, the paragraph boundary map and the style segmentation map, specifically: determining a paragraph binary image of the text image to be processed based on the paragraph probability image, the paragraph boundary image and the style segmentation image; and carrying out paragraph frame marking processing on the paragraph binary image, and determining a paragraph frame marking image of the text image to be processed.
In an embodiment, the processing unit 1102 is configured to, when performing a paragraph box labeling process on the paragraph binary image to determine a paragraph box labeling image of the text image to be processed, specifically: determining an initial text paragraph box based on the region boundary of the target region in the paragraph binary image; wherein the target area is any one of areas formed by pixel points with pixel values being set values in the paragraph binary image; performing corrosion treatment on the initial text paragraph box to obtain a text paragraph box after corrosion treatment; performing polygon approximation on the corroded text paragraph box to obtain a target text paragraph box; and determining a paragraph box mark graph of the text image to be processed based on each target text paragraph box.
In an embodiment, the processing unit 1102 is specifically configured to, when determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image feature: invoking a prediction network of a target text image processing model to process the image characteristics, and determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed; the prediction network comprises a first sub-network, a second sub-network and a third sub-network which are connected in parallel, wherein the input ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the output end of the feature extraction network of the target text image processing model, the output ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the input end of the paragraph segmentation network of the target text image processing model, the first sub-network is used for determining the paragraph probability map, the second sub-network is used for determining the paragraph boundary map, and the third sub-network is used for determining the style segmentation map.
Optionally, the processing unit 1102 is further configured to: acquiring a training data set, wherein the training data set comprises a plurality of groups of training sample pairs, each group of training sample pairs comprises a sample text image and a sample paragraph frame mark graph corresponding to the sample text image, and each sample paragraph frame in the sample paragraph frame mark graph is close to each boundary position of a corresponding text paragraph; training the initial text image processing model by using the training data set to obtain a trained text image processing model; the test loss value of the trained text image processing model is smaller than the set loss value; and determining the trained text image processing model as the target text image processing model.
Optionally, the processing unit 1102 is further configured to: for any sample paragraph frame in a reference paragraph frame marker graph, acquiring the area and the area perimeter of the area of the any sample paragraph frame; the reference paragraph box label graph is any sample paragraph box label graph in the training data set; determining a scaling distance based on the scaling ratio, the area and the area perimeter; performing scaling adjustment on any sample paragraph frame based on the scaling distance, and obtaining a scaled reference paragraph frame mark graph based on each sample paragraph frame after scaling adjustment; the training of the initial text image processing model by using the training data set to obtain a trained text image processing model comprises the following steps: and training the initial text image processing model by utilizing each text sample image included in the training data set and the scaled and adjusted reference paragraph frame mark graph corresponding to each text sample image to obtain a trained text image processing model.
In an embodiment, the test loss value is determined based on a weight parameter and a test loss value respectively corresponding to a paragraph binary image, a paragraph probability image, a paragraph boundary image and a style segmentation image, which are obtained by processing a text image processing model in a test process; and the weight parameter corresponding to the paragraph boundary graph is determined based on the scaling ratio.
It may be understood that the functions of each functional unit of the text image processing apparatus in the embodiments of the present application may be specifically implemented according to the method in the embodiments of the method, and the specific implementation process may refer to the relevant description of the embodiments of the method, which is not repeated herein.
Referring to fig. 12, fig. 12 is a schematic structural diagram of a computer device according to an embodiment of the present application. The computer device described in the embodiment of the present application includes: a processor 1201, a communication interface 1202 and a memory 1203. The processor 1201, the communication interface 1202 and the memory 1203 may be connected by a bus or other means, and in this embodiment, the connection is exemplified by a bus. Among them, the processor 1201 (or CPU (Central Processing Unit, central processing unit)) is a computing core and a control core of a computer device, which can parse various instructions in the computer device and process various data of the computer device, for example: the CPU can be used for analyzing a startup and shutdown instruction sent by a user to the computer equipment and controlling the computer equipment to perform startup and shutdown operation; and the following steps: the CPU may transmit various types of interaction data between internal structures of the computer device, and so on. Communication interface 1202 may optionally include a standard wired interface, a wireless interface (e.g., wi-Fi, mobile communication interface, etc.), controlled by processor 1201 for transceiving data. Communication interface 1202 may also enable internal communication of a computer device. The Memory 1203 (Memory) is a Memory device in the computer device for storing programs and data. It will be appreciated that the memory 1203 herein may include both built-in memory of the computer device and extended memory supported by the computer device. Memory 1203 provides storage space that stores the operating system of the computer device, which may include, but is not limited to: android systems, iOS systems, windows Phone systems, etc., which are not limiting in this application.
In the present embodiment, the processor 1201 performs the following operations by executing executable program code in the memory 1203:
extracting features of a text image to be processed to obtain image features of the text image to be processed;
determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image features; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
determining a paragraph frame mark graph of the text image to be processed based on the paragraph probability graph, the paragraph boundary graph and the style segmentation graph; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
In an embodiment, the processor 1201, when configured to determine a paragraph box label map of the text image to be processed based on the paragraph probability map, the paragraph boundary map and the style segmentation map, is specifically configured to: determining a paragraph binary image of the text image to be processed based on the paragraph probability image, the paragraph boundary image and the style segmentation image; and carrying out paragraph frame marking processing on the paragraph binary image, and determining a paragraph frame marking image of the text image to be processed. In an embodiment, the processor 1201, when configured to perform a paragraph box labeling process on the paragraph binary image and determine a paragraph box labeling image of the text image to be processed, is specifically configured to: determining an initial text paragraph box based on the region boundary of the target region in the paragraph binary image; wherein the target area is any one of areas formed by pixel points with pixel values being set values in the paragraph binary image; performing corrosion treatment on the initial text paragraph box to obtain a text paragraph box after corrosion treatment; performing polygon approximation on the corroded text paragraph box to obtain a target text paragraph box; and determining a paragraph box mark graph of the text image to be processed based on each target text paragraph box.
In an embodiment, the processor 1201, when configured to determine a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image feature, is specifically configured to: invoking a prediction network of a target text image processing model to process the image characteristics, and determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed; the prediction network comprises a first sub-network, a second sub-network and a third sub-network which are connected in parallel, wherein the input ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the output end of the feature extraction network of the target text image processing model, the output ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the input end of the paragraph segmentation network of the target text image processing model, the first sub-network is used for determining the paragraph probability map, the second sub-network is used for determining the paragraph boundary map, and the third sub-network is used for determining the style segmentation map.
In an embodiment, the processor 1201 is further configured to: acquiring a training data set, wherein the training data set comprises a plurality of groups of training sample pairs, each group of training sample pairs comprises a sample text image and a sample paragraph frame mark graph corresponding to the sample text image, and each sample paragraph frame in the sample paragraph frame mark graph is close to each boundary position of a corresponding text paragraph; training the initial text image processing model by using the training data set to obtain a trained text image processing model; the test loss value of the trained text image processing model is smaller than the set loss value; and determining the trained text image processing model as the target text image processing model.
In an embodiment, the processor 1201 is further configured to: for any sample paragraph frame in a reference paragraph frame marker graph, acquiring the area and the area perimeter of the area of the any sample paragraph frame; the reference paragraph box label graph is any sample paragraph box label graph in the training data set; determining a scaling distance based on the scaling ratio, the area and the area perimeter; performing scaling adjustment on any sample paragraph frame based on the scaling distance, and obtaining a scaled reference paragraph frame mark graph based on each sample paragraph frame after scaling adjustment; the training of the initial text image processing model by using the training data set to obtain a trained text image processing model comprises the following steps: and training the initial text image processing model by utilizing each text sample image included in the training data set and the scaled and adjusted reference paragraph frame mark graph corresponding to each text sample image to obtain a trained text image processing model.
In an embodiment, the test loss value is determined based on a weight parameter and a test loss value respectively corresponding to a paragraph binary image, a paragraph probability image, a paragraph boundary image and a style segmentation image, which are obtained by processing a text image processing model in a test process; and the weight parameter corresponding to the paragraph boundary graph is determined based on the scaling ratio.
In a specific implementation, the processor 1201, the communication interface 1202 and the memory 1203 described in the embodiments of the present application may perform a text image processing method provided in the embodiments of the present application, and may perform an implementation described in a text image processing apparatus provided in the embodiments of the present application, which is not described herein again.
The embodiments of the present application also provide a computer-readable storage medium having instructions stored therein, which when run on a computer, cause the computer to perform the text image processing method according to the embodiments of the present application.
Embodiments of the present application also provide a computer program product comprising instructions which, when run on a computer, cause the computer to perform a text image processing method as described in embodiments of the present application.
It should be noted that, in the specific embodiment of the present application, related data such as a text image to be processed, a text image processing model and a training data set thereof are related, and when the above embodiments of the present application are applied to specific products or technologies, the related data needs to obtain permission or consent of the related objects, and the collection, use and processing of the related data need to comply with related laws and regulations and standards of related countries and regions.
It should be noted that, for simplicity of description, the foregoing method embodiments are all expressed as a series of action combinations, but it should be understood by those skilled in the art that the present application is not limited by the described order of action, as some steps may take other order or be performed simultaneously according to the present application. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required in the present application.
Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program to instruct related hardware, the program may be stored in a computer readable storage medium, and the storage medium may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.
The foregoing disclosure is only illustrative of some of the embodiments of the present application and is not, of course, to be construed as limiting the scope of the appended claims, and therefore, all changes that come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.

Claims (9)

1. A text image processing method, the method comprising:
extracting features of a text image to be processed to obtain image features of the text image to be processed;
determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image features; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
determining a paragraph binary image of the text image to be processed based on the paragraph probability image, the paragraph boundary image and the style segmentation image;
determining an initial text paragraph box based on the region boundary of the target region in the paragraph binary image; wherein the target area is any one of areas formed by pixel points with pixel values being set values in the paragraph binary image;
performing corrosion treatment on the initial text paragraph box to obtain a text paragraph box after corrosion treatment;
performing polygon approximation on the corroded text paragraph box to obtain a target text paragraph box;
determining a paragraph box mark graph of the text image to be processed based on each target text paragraph box; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
2. The method of claim 1, wherein the determining a paragraph probability map, a paragraph boundary map, and a style segmentation map of the text image to be processed based on the image features comprises:
invoking a prediction network of a target text image processing model to process the image characteristics, and determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed;
the prediction network comprises a first sub-network, a second sub-network and a third sub-network which are connected in parallel, wherein the input ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the output end of the feature extraction network of the target text image processing model, the output ends of the first sub-network, the second sub-network and the third sub-network are respectively connected with the input end of the paragraph segmentation network of the target text image processing model, the first sub-network is used for determining the paragraph probability map, the second sub-network is used for determining the paragraph boundary map, and the third sub-network is used for determining the style segmentation map.
3. The method of claim 2, wherein the method further comprises:
Acquiring a training data set, wherein the training data set comprises a plurality of groups of training sample pairs, each group of training sample pairs comprises a sample text image and a sample paragraph frame mark graph corresponding to the sample text image, and each sample paragraph frame in the sample paragraph frame mark graph is close to each boundary position of a corresponding text paragraph;
training the initial text image processing model by using the training data set to obtain a trained text image processing model; the test loss value of the trained text image processing model is smaller than the set loss value;
and determining the trained text image processing model as the target text image processing model.
4. A method as claimed in claim 3, wherein the method further comprises:
for any sample paragraph frame in a reference paragraph frame marker graph, acquiring the area and the area perimeter of the area of the any sample paragraph frame; the reference paragraph box label graph is any sample paragraph box label graph in the training data set;
determining a scaling distance based on the scaling ratio, the area and the area perimeter;
performing scaling adjustment on any sample paragraph frame based on the scaling distance, and obtaining a scaled reference paragraph frame mark graph based on each sample paragraph frame after scaling adjustment;
The training of the initial text image processing model by using the training data set to obtain a trained text image processing model comprises the following steps:
and training the initial text image processing model by utilizing each text sample image included in the training data set and the scaled and adjusted reference paragraph frame mark graph corresponding to each text sample image to obtain a trained text image processing model.
5. The method of claim 4, wherein the test loss value is determined based on a weight parameter and a test loss value respectively corresponding to a paragraph binary image, a paragraph probability image, a paragraph boundary image and a wind pattern segmentation image, which are obtained by processing a text image processing model in a test process; and the weight parameter corresponding to the paragraph boundary graph is determined based on the scaling ratio.
6. A text image processing apparatus, characterized in that the apparatus comprises:
the extraction unit is used for extracting the characteristics of the text image to be processed to obtain the image characteristics of the text image to be processed;
the processing unit is used for determining a paragraph probability map, a paragraph boundary map and a style segmentation map of the text image to be processed based on the image characteristics; the style segmentation map is used for indicating the boundary positions between text paragraphs with different styles in the text image to be processed;
The processing unit is further configured to: determining a paragraph binary image of the text image to be processed based on the paragraph probability image, the paragraph boundary image and the style segmentation image; determining an initial text paragraph box based on the region boundary of the target region in the paragraph binary image; wherein the target area is any one of areas formed by pixel points with pixel values being set values in the paragraph binary image; performing corrosion treatment on the initial text paragraph box to obtain a text paragraph box after corrosion treatment; performing polygon approximation on the corroded text paragraph box to obtain a target text paragraph box; determining a paragraph box mark graph of the text image to be processed based on each target text paragraph box; the text paragraphs in each target paragraph box included in the paragraph box mark graph are different, and the target paragraph box is close to each boundary position of the corresponding text paragraph.
7. A computer device, comprising: the text image processing method according to any one of claims 1 to 5, comprising a processor, a communication interface and a memory, said processor, said communication interface and said memory being interconnected, wherein said memory stores executable program code, said processor being adapted to invoke said executable program code.
8. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program which, when run on a computer, causes the computer to implement the text image processing method according to any one of claims 1 to 5.
9. A computer program product, characterized in that the computer program product comprises a computer program or computer instructions which, when executed by a processor, implements the text image processing method according to any of claims 1-5.
CN202210056559.0A 2022-01-18 2022-01-18 Text image processing method, apparatus, device, storage medium, and program product Active CN114399782B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210056559.0A CN114399782B (en) 2022-01-18 2022-01-18 Text image processing method, apparatus, device, storage medium, and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210056559.0A CN114399782B (en) 2022-01-18 2022-01-18 Text image processing method, apparatus, device, storage medium, and program product

Publications (2)

Publication Number Publication Date
CN114399782A CN114399782A (en) 2022-04-26
CN114399782B true CN114399782B (en) 2024-03-22

Family

ID=81230244

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210056559.0A Active CN114399782B (en) 2022-01-18 2022-01-18 Text image processing method, apparatus, device, storage medium, and program product

Country Status (1)

Country Link
CN (1) CN114399782B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222368A (en) * 2018-11-26 2020-06-02 北京金山办公软件股份有限公司 Method and device for identifying document paragraph and electronic equipment
CN111274239A (en) * 2019-12-30 2020-06-12 安徽知学科技有限公司 Test paper structuralization processing method, device and equipment
CN112016551A (en) * 2020-10-23 2020-12-01 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and computer storage medium
CN112990203A (en) * 2021-05-11 2021-06-18 北京世纪好未来教育科技有限公司 Target detection method and device, electronic equipment and storage medium
WO2021218322A1 (en) * 2020-04-30 2021-11-04 深圳壹账通智能科技有限公司 Paragraph search method and apparatus, and electronic device and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111222368A (en) * 2018-11-26 2020-06-02 北京金山办公软件股份有限公司 Method and device for identifying document paragraph and electronic equipment
CN111274239A (en) * 2019-12-30 2020-06-12 安徽知学科技有限公司 Test paper structuralization processing method, device and equipment
WO2021218322A1 (en) * 2020-04-30 2021-11-04 深圳壹账通智能科技有限公司 Paragraph search method and apparatus, and electronic device and storage medium
CN112016551A (en) * 2020-10-23 2020-12-01 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and computer storage medium
CN112990203A (en) * 2021-05-11 2021-06-18 北京世纪好未来教育科技有限公司 Target detection method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN114399782A (en) 2022-04-26

Similar Documents

Publication Publication Date Title
CN111027563A (en) Text detection method, device and recognition system
Wilkinson et al. Neural Ctrl-F: segmentation-free query-by-string word spotting in handwritten manuscript collections
CA3139085A1 (en) Representative document hierarchy generation
CN114596566B (en) Text recognition method and related device
CN112396049A (en) Text error correction method and device, computer equipment and storage medium
US11544495B2 (en) Attributionally robust training for weakly supervised localization and segmentation
CN113762269B (en) Chinese character OCR recognition method, system and medium based on neural network
CN111488732B (en) Method, system and related equipment for detecting deformed keywords
CN114638960A (en) Model training method, image description generation method and device, equipment and medium
CN111507330A (en) Exercise recognition method and device, electronic equipment and storage medium
CN112990175B (en) Method, device, computer equipment and storage medium for recognizing handwritten Chinese characters
CN115393872B (en) Method, device and equipment for training text classification model and storage medium
CN113033269B (en) Data processing method and device
WO2024041032A1 (en) Method and device for generating editable document based on non-editable graphics-text image
CN112269872A (en) Resume analysis method and device, electronic equipment and computer storage medium
CN113780276A (en) Text detection and identification method and system combined with text classification
CN111881900B (en) Corpus generation method, corpus translation model training method, corpus translation model translation method, corpus translation device, corpus translation equipment and corpus translation medium
CN116361502B (en) Image retrieval method, device, computer equipment and storage medium
KR20210116371A (en) Image processing method, device, electronic equipment, computer readable storage medium and computer program
KR102043693B1 (en) Machine learning based document management system
CN114399782B (en) Text image processing method, apparatus, device, storage medium, and program product
CN114579796B (en) Machine reading understanding method and device
CN112395834B (en) Brain graph generation method, device and equipment based on picture input and storage medium
CN115034177A (en) Presentation file conversion method, device, equipment and storage medium
CN114818627A (en) Form information extraction method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant