CN118334634A

CN118334634A - Image text detection method, system, equipment and storage medium

Info

Publication number: CN118334634A
Application number: CN202410085002.9A
Authority: CN
Inventors: 刘绍锴; 冯浩; 李厚强; 周文罡
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Filing date: 2024-04-03
Publication date: 2024-07-12

Abstract

The invention discloses an image character detection method, an image character detection system, an image character detection device and a storage medium, wherein progressive learning is introduced to realize more accurate and efficient character detection, and through combining contour transformation and progressive learning, the image character detection method not only can adaptively process characters with oblique, bending and irregular shapes. Meanwhile, the progressive learning is added, so that the method can be self-optimized and adjusted when processing different scenes and character types, the need of manual intervention is greatly reduced, the stability and reliability of detection are improved, and in general, the invention can provide a more universal, efficient and self-adaptive image character detection solution for wide users and researchers.

Description

Image text detection method, system, equipment and storage medium

Technical Field

The present invention relates to the field of image text detection technologies, and in particular, to an image text detection method, system, device, and storage medium.

Background

With the rapid development of information technology, research in the fields of image processing and computer vision is also getting deeper. In particular, in text detection and recognition, these techniques have been widely used in a variety of contexts, such as invoice verification, autopilot, unmanned aerial vehicle detection, and the like. Although many studies have been conducted for this, many challenges remain in practical use.

The traditional text detection method is mainly based on technologies such as image segmentation, edge detection, pattern matching and the like. Although these methods can achieve good results in some controlled scenes, their performance tends to be unsatisfactory in complex backgrounds, low resolution, or text-to-background colors. Furthermore, these conventional methods typically require a large amount of manual parameter adjustment to accommodate different scenarios and applications.

In recent years, with the rise of deep learning technology, many text detection methods based on neural networks have been proposed. These methods are generally capable of automatically learning features in extracted images and exhibit excellent generalization performance in various complex scenarios. However, even these advanced methods often encounter difficulties in handling word boundaries, details and shapes. For example, many methods are difficult to detect accurately when the text is similar to the background color or the text is unevenly distributed. Also, existing methods often perform poorly when dealing with fine contours of text, which is unacceptable in some high precision applications. Specifically, the current solution mainly has the following problems:

1) The traditional image text detection method is mainly aimed at processing transverse and regular text at the beginning of design. Such designs have often made them undesirable for detecting diagonal, curved or irregularly shaped text. Particularly in complex scenes, such as parts of books with handwritten annotations, document images of historic documents, etc., or billboards in urban streets, the detection capabilities of these methods are often greatly limited. Therefore, in practical applications, it is difficult for these techniques to meet the requirements of diversified and dynamic text detection.

2) Most of the prior art processes complex image scenes, such as nested, overlapping or background noisy scenes, with relatively large computational effort, resulting in inefficient processing. Particularly, in the occasions needing real-time processing, such as live caption extraction, street view navigation and the like, the methods often have difficulty in meeting the requirement of real-time performance. Moreover, these methods often suffer from performance bottlenecks in the face of large-scale, high-resolution, or large amounts of dynamically changing image data, severely impacting user experience and application versatility.

3) Many conventional approaches lack efficient adaptive learning capabilities. In practice, operators often spend a great deal of time making manual parameter adjustments and model fine-tuning whenever facing new text types or different scenes. This operation not only increases the technological use threshold, but also greatly reduces the stability and reliability of the method. For example, when the detection system is deployed in a new environment, the conventional technology is likely to be unable to be directly applied due to the differences in language, culture and writing style, and a great deal of pre-adjustment work is required.

Disclosure of Invention

The invention aims to provide an image character detection method, an image character detection system, an image character detection device and a storage medium, which can more accurately and efficiently detect character information in an image, especially in the situations of low contrast between characters and a background or uneven character distribution.

The invention aims at realizing the following technical scheme:

an image text detection method, comprising:

step 1, extracting features of an original image to obtain a multi-scale feature map;

step 2, performing preliminary detection on the text region by utilizing the multi-scale feature map to obtain an initial contour of the text region;

and 3, utilizing a multi-scale feature map and an initial contour of a text region, and continuously iterating and optimizing the contour shape by using a progressive learning mechanism, wherein each iteration refines and adjusts the contour on the basis of the previous iteration, and finally, the contour capable of covering each text is obtained through iteration.

An image text detection system comprising:

the feature extraction module is used for extracting features of the original image to obtain a multi-scale feature map;

The outline initialization module is used for carrying out preliminary detection on the text area by utilizing the multi-scale feature map to obtain an initial outline of the text area;

The progressive contour optimization module is used for utilizing the multi-scale feature map and the initial contour of the text region, using a progressive learning mechanism to continuously iterate and optimize the contour shape, refining and adjusting the contour on the basis of the previous iteration every time, and finally iterating to obtain the contour capable of covering each text.

A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.

A readable storage medium storing a computer program which, when executed by a processor, implements the method described above.

According to the technical scheme provided by the invention, progressive learning is introduced to realize more accurate and efficient character detection, and through combining contour transformation and progressive learning, the invention can adaptively process characters in oblique, bending and irregular shapes. Meanwhile, the progressive learning is added, so that the method can be self-optimized and adjusted when processing different scenes and character types, the need of manual intervention is greatly reduced, the stability and reliability of detection are improved, and in general, the invention can provide a more universal, efficient and self-adaptive image character detection solution for wide users and researchers.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of an image text detection method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a frame of an image text detection method according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of an image text detection system according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to fall within the scope of the invention.

The terms that may be used herein will first be described as follows:

The terms "comprises," "comprising," "includes," "including," "has," "having" or other similar referents are to be construed to cover a non-exclusive inclusion. For example: including a particular feature (e.g., a starting material, component, ingredient, carrier, formulation, material, dimension, part, means, mechanism, apparatus, step, procedure, method, reaction condition, processing condition, parameter, algorithm, signal, data, product or article of manufacture, etc.), should be construed as including not only a particular feature but also other features known in the art that are not explicitly recited.

The term "consisting of … …" is meant to exclude any technical feature element not explicitly listed. If such term is used in a claim, the term will cause the claim to be closed, such that it does not include technical features other than those specifically listed, except for conventional impurities associated therewith. If the term is intended to appear in only a clause of a claim, it is intended to limit only the elements explicitly recited in that clause, and the elements recited in other clauses are not excluded from the overall claim.

Aiming at the problems of incomplete detection or unmatched detection results possibly occurring when the existing image and text detection method processes complex background, small characters or text dense areas, the invention aims to provide an image and text detection scheme based on progressive contour transformation. The scheme can more accurately and efficiently detect the text information in the image, especially in the situations of low contrast between the text and the background or uneven text distribution.

The following describes a detailed description of an image text detection scheme provided by the invention. What is not described in detail in the embodiments of the present invention belongs to the prior art known to those skilled in the art. The specific conditions are not noted in the examples of the present invention and are carried out according to the conditions conventional in the art or suggested by the manufacturer.

Example 1

The embodiment of the invention provides an image text detection method, which mainly comprises the following steps as shown in fig. 1:

and step 1, extracting features of the original image to obtain a multi-scale feature map.

In the embodiment of the invention, the input image can be processed through the stacked convolution layer, the pooling layer and the normalization layer to obtain the multi-scale feature map.

And step 2, performing preliminary detection on the text region by using the multi-scale feature map to obtain an initial contour of the text region.

In the embodiment of the invention, the circumscribed rectangular detection frame of the text area is obtained based on the existing arbitrary target detector by utilizing the multi-scale feature map; then initializing an octagon in the circumscribed rectangle detection frame, wherein the vertex positions of the octagon are 1/4 and 3/4 positions of the rectangle sides, and obtaining a polygonal outline attached to the character shape and serving as an initial outline of the character area.

The step is an iterative execution process, and the contour output by the last iteration is the contour capable of covering each character. The process at the kth iteration is as follows:

1) Feature sampling is performed on a k-1 th iteration of the contour C _k-1 from the multi-scale feature map to obtain a vertex feature f _k-1, which is expressed as:

f_k-1＝Sample(F,C_k-1)

where Sample is a sampling function based on bilinear interpolation, and C _k-1 is the initial contour of the text region when k=1.

2) The vertex features are aggregated to obtain an aggregate feature g _k-1, denoted as:

g_k-1＝CircConv(f_k-1)

Wherein CircConv (-) is a circular convolution function for feature aggregation.

3) The displacement vector Δc _k-1 of the predicted contour using the aggregate feature g _k-1 is expressed as:

ΔC_k-1＝Updater(g_k-1)

the Updater is a displacement prediction module, and is formed by stacking a convolution layer and a correction linear unit.

4) The contour C _k-1 of the kth-1 time is updated in combination with the displacement vector DeltaC _k-1 of the contour to obtain a contour C _k of the kth iteration, expressed as:

C_k＝C_k-1+ΔC_k-1。

In the above scheme provided by the embodiment of the invention, the step 2 is realized by a contour initialization module, the step 3 is realized by a progressive contour optimization module, and the two modules train in the following manner;

For the contour initialization module, the initialization loss function L _init is calculated by using the distance between the initial contour C ₀ and the actual contour C _gt of the text region output by the contour initialization module, and is expressed as:

L_init＝||C₀-C_gt||

Wherein i is an L1 norm symbol.

For a progressive profile tuning module, the distance between the profile of each iteration and the actual profile C _gt is used to calculate an iteration loss function L _evolve, expressed as:

Wherein, C _i is the contour of the ith iteration, and K is the iteration number.

The total loss function L _total is:

L_total＝λ_init L_init+λ_evolve L_evolve

Wherein λ _init and λ _evolve are two weight factors.

And training a profile initializing module and a progressive profile optimizing module by using the total loss function.

In the embodiment of the present invention, the step serial numbers are only used to identify different steps, and do not represent the sequence of the steps, and the specific sequence of the steps can be embodied by specific technical content.

The scheme provided by the embodiment of the invention has the advantage that characters in various forms can be comprehensively and accurately detected. Firstly, not only can the traditional transverse characters be effectively detected, but also the characters with oblique, bending and irregular shapes can be accurately detected. This provides a solid technical support for processing modern and complex text image scenes such as billboards, handwriting annotations, document images and the like. Secondly, the invention combines the powerful technology of deep learning, and can perform self-learning and optimization on the detection method. This means that the detection effect of the method will gradually increase with the increase of the use time and the accumulation of the data amount, and can adapt to more scenes and cope with more complicated character forms. This adaptive learning capability is difficult to achieve in many conventional approaches. In addition, high efficiency can be maintained when processing a large amount of image data. The three modules of feature extraction, contour initialization and progressive contour optimization are combined, so that the accuracy of detection is ensured while the calculation redundancy is reduced. This balance is difficult to achieve in many prior art methods. In short, the image text detection method of the invention is an innovative and beneficial solution to the text detection field by virtue of the comprehensive detection capability, self-learning advantage and high-efficiency performance.

In order to more clearly demonstrate the technical scheme and the technical effects provided by the invention, the method provided by the embodiment of the invention is described in detail below by using specific embodiments.

1. The scheme is introduced in whole.

The core idea of the invention is to accurately identify the characters in the image by utilizing advanced algorithm of deep learning and combining progressive contour optimization strategy. As shown in fig. 2, it mainly comprises the following three main modules:

1) And the feature extraction module is used for: extracting features of the original image, and outputting the obtained features in two paths, wherein one path of the features is output to the progressive profile optimizing module, and the other path of the features is output to the profile initializing module.

2) Profile initialization module: in the module, based on the extracted feature map, preliminary text detection is performed, and a preliminary contour including a text region is generated. This process involves not only the detection of text regions, but also the further refinement of the detected rectangular box into a polygonal outline that more closely fits the actual text shape. These polygonal contours are initial contours, which lay the foundation for progressive contour tuning.

3) Progressive profile tuning module: the module, as a core part of the invention, accepts as input the features from the feature extraction module and the initial profile of the profile initialization module. By introducing a progressive learning mechanism, the module continuously iterates and optimizes the contour shape, and each iteration refines and adjusts the contour on the basis of the previous iteration so as to finally obtain the contour prediction capable of accurately covering each character. The iterative learning method can continuously optimize the detection effect, approach the word boundary and remarkably improve the accuracy and the robustness of word detection.

The invention realizes the high-precision detection of the characters in the image through the three mutually-cooperated modules, and particularly in complex scenes which are difficult to deal with by the traditional method, the invention has excellent performance and practical value.

2. The specific scheme is introduced.

1. And the characteristic extraction module.

The input to the feature extraction module is the original image, e.g. an RGB imageAnd 3 represents three channels of RGB, and H and W are the height and width of the image respectively.

The feature extraction module may be a feature extractor comprising a series of convolution layers (Conv), pooling layers (Pool), and normalization layers (Norm); the input original image may extract a multi-scale feature map F through a series of convolution layers Conv, pooling layers Pool and normalization layers Norm.

The deep features can help effectively capture text information in an image, strengthen the contrast between text and background and provide powerful feature support for subsequent text detection.

2. And a contour initializing module.

The outline initialization module mainly comprises: the detector mainly takes charge of detecting rectangular frames of the text areas, and the contour refiner refines the rectangular frames into polygonal contours closely attached to the actual text shapes.

The detector predicts a rectangular detection frame of the text region, and can be realized by a disclosed target detection algorithm. The contour refinement module initializes octagons in each rectangular detection frame as an initial contour of the text region. Wherein, the position of each vertex of the octagon is 1/4 and 3/4 of the position of each side of the rectangular frame. Thereby providing the necessary information for subsequent contour refinement.

3. And a progressive profile optimizing module.

In the kth iteration of the progressive profile tuning module, the profile C _k-1 undergoes a fine tuning step to more accurately fit the text profile in the image. The process first performs feature sampling on C _k-1 from the multi-scale feature map F to obtain the associated vertex feature F _k-1. This step is achieved by bilinear sampling, ensuring that features extracted from the multi-scale feature map match the vertex positions of the contours. Next, a feature aggregation operation is performed, and the sampled vertex feature f _k-1 is aggregated into an aggregate feature g _k-1 through circular convolution CircConv, so as to obtain more abundant context information and detail features. Feature aggregation may enhance the feature representation of the vertices of the contours, thereby providing a strong basis for further deformation of the contours. The deformation and adjustment of the contour are completed through an updater module in the progressive contour optimization module, and the hyperbolic tangent function is specifically applied to convert the initial contour feature g ₀ into an initial hidden state h ₀, so that a starting point is provided for vertex coordinate updating in the iterative process. The subsequent hidden states h ₁ to h _k-1 are updated step by the GRU unit (gate loop unit) based on the previous state and the current profile information, which receives the aggregated feature g _k-1 and calculates a displacement vector deltac _k-1, This vector indicates how to adjust the current contour vertex to be closer to the actual text edge. The iterative update module may employ a gated loop unit or other leachable structure to gradually update to approximate the optimal profile, and finally, the updater module applies Δc _k-1 to the current profile C _k-1 by element-level addition to obtain an updated profile C _k. The overall process can be summarized by the following formula:

f_k-1＝Sample(F,C_k-1)

g_k-1＝CircConv(f_k-1)

ΔC_k-1＝Updater(g_k-1)

C_k＝C_k-1+ΔC_k-1。

In the above formula, updater is responsible for calculating displacement vector Δc _k-1 of the contour by the displacement prediction module, and the contour deformation module obtains updated contour C _k by combining displacement vector Δc _k-1 of the contour, where the displacement prediction module and the contour deformation module together form the Updater module.

The above process is repeated until the set iteration times K are reached, and the final outline is obtained, namely the image text detection result.

3. Training protocols.

In the embodiment of the invention, the optimization training is mainly performed aiming at the contour initialization module and the progressive contour optimization module.

For the contour initialization module, the initialization loss function L _init calculates the distance of the predicted initial contour C ₀ from the actual contour C _gt:

L_init＝||C₀-C_gt||

Wherein i is an L1 norm symbol.

For the progressive profile tuning module, the iterative loss function L _evolve calculates the distance of the predicted profile from the actual profile:

wherein C _i is the contour of the ith iteration.

Finally, the total loss function L _total is:

L_total＝λ_init L_init+λ_evolve L_evolve

Where λ _init and λ _evolve are two weight factors to balance the initialization and iteration losses.

From the description of the above embodiments, it will be apparent to those skilled in the art that the above embodiments may be implemented in software, or may be implemented by means of software plus a necessary general hardware platform. With such understanding, the technical solutions of the foregoing embodiments may be embodied in a software product, where the software product may be stored in a nonvolatile storage medium (may be a CD-ROM, a U-disk, a mobile hard disk, etc.), and include several instructions for causing a computer device (may be a personal computer, a server, or a network device, etc.) to perform the methods of the embodiments of the present invention.

Example two

The invention also provides an image text detection system, which is mainly used for realizing the method provided by the previous embodiment, as shown in fig. 3, and mainly comprises:

In view of the above, the details of the main processing of each module have been described in the previous embodiments, and will not be described in detail.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the system is divided into different functional modules to perform all or part of the functions described above.

Example III

The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.

Further, the processing device further comprises at least one input device and at least one output device; in the processing device, the processor, the memory, the input device and the output device are connected through buses.

In the embodiment of the invention, the specific types of the memory, the input device and the output device are not limited; for example:

The input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;

The output device may be a display terminal;

the memory may be random access memory (Random Access Memory, RAM) or non-volatile memory (non-volatile memory), such as disk memory.

Example IV

The invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.

The readable storage medium according to the embodiment of the present invention may be provided as a computer readable storage medium in the aforementioned processing apparatus, for example, as a memory in the processing apparatus. The readable storage medium may be any of various media capable of storing a program code, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

The foregoing is only a preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions easily contemplated by those skilled in the art within the scope of the present invention should be included in the scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.

Claims

1. An image text detection method is characterized by comprising the following steps:

2. The method for detecting image text according to claim 1, wherein the feature extraction of the original image to obtain a multi-scale feature map comprises:

And processing the input image through the stacked convolution layer, the pooling layer and the normalization layer to obtain a multi-scale feature map.

3. The method for detecting image text according to claim 1, wherein the preliminary detection of text regions by using the multi-scale feature map includes:

Obtaining an external rectangular detection frame of the text area based on any target detector by utilizing the multi-scale feature map;

Initializing an octagon in the circumscribed rectangular detection frame, wherein the vertex positions of the octagon are 1/4 and 3/4 of the positions of the rectangular sides, obtaining a polygonal outline attached to the shape of the characters, and taking the polygonal outline as an initial outline of the character area.

4. The image text detection method of claim 1, wherein the process at the kth iteration is as follows:

Feature sampling is carried out on the k-1 th iterative contour C _k-1 from the multi-scale feature map, and vertex features f _k-1 are obtained;

Polymerizing the vertex characteristics to obtain an aggregation characteristic g _k-1;

Predicting a displacement vector delta C _k-1 of the contour by using the aggregation feature g _k-1;

And updating the k-1 th contour C _k-1 by combining the displacement vector delta C _k-1 of the contour to obtain a k iteration contour C _k.

5. The method for detecting image text according to claim 4, wherein,

The feature samples are expressed as:

f_k-1＝Sample(F,C_k-1)

wherein Sample (& gt) is a sampling function, and when k=1, C _k-1 is an initial contour of a text region;

aggregating vertex features is expressed as:

g_k-1＝CircConv(f_k-1)

6. The method for detecting image text according to claim 4, wherein,

The displacement vector of the calculated contour is expressed as:

ΔC_k-1＝Updater(g_k-1)

the Updater is a displacement prediction module and is formed by stacking a convolution layer and a correction linear unit;

The update of contour C _k-1 for the k-1 th time in combination with the displacement vector ΔC _k-1 of the contour is expressed as:

C_k＝C_k-1+ΔC_k-1。

7. the image text detection method according to claim 1, wherein the step 2 is implemented by a contour initialization module, the step 3 is implemented by a progressive contour optimization module, and the two modules are trained in the following manner;

L_init＝||C₀-C_gt||

Wherein, L is L1 norm symbol;

wherein C _i is the contour of the ith iteration, and K is the iteration number;

The total loss function L _total is:

L_total＝λ_initL_init+λ_evolveL_evolve

Wherein λ _init and λ _evolve are two weight factors;

8. An image text detection system, comprising:

9. A processing apparatus, comprising: one or more processors; a memory for storing one or more programs;

Wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-7.

10. A readable storage medium storing a computer program, characterized in that the method according to any one of claims 1-7 is implemented when the computer program is executed by a processor.