CN113807336A

CN113807336A - Semi-automatic labeling method, system, computer equipment and medium for image text detection

Info

Publication number: CN113807336A
Application number: CN202110906651.7A
Authority: CN
Inventors: 黄双萍; 刘宗昊; 王庆丰
Original assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Current assignee: Guangdong Provincial Laboratory Of Artificial Intelligence And Digital Economy Guangzhou; South China University of Technology SCUT
Priority date: 2021-08-09
Filing date: 2021-08-09
Publication date: 2021-12-17
Anticipated expiration: 2041-08-09
Also published as: CN113807336B

Abstract

The invention discloses a semi-automatic labeling method, a system, computer equipment and a medium for image text detection, wherein the method comprises the following steps: acquiring a text image; acquiring a text center line from a text image; generating N candidate bounding boxes around a text centerline; inputting the N candidate text regions into a loose recognizer and a strict recognizer simultaneously, recognizing the N candidate text regions through the loose recognizer to obtain estimated text contents, and predicting the content recognition result of each candidate text region through the strict recognizer; comparing the N content identification results with the estimated text content, and respectively calculating identification losses to obtain N identification losses; obtaining the index of the most accurate candidate bounding box by determining the index of the minimum loss in all the identification losses, and further obtaining the final text box label; and optimizing the text box label by taking the recognition loss as a guide, and finally obtaining a compact text box label. The invention can improve the text detection labeling efficiency and labeling effect.

Description

Semi-automatic labeling method, system, computer equipment and medium for image text detection

Technical Field

The invention relates to a semi-automatic labeling method, a semi-automatic labeling system, computer equipment and a storage medium for image text detection, and belongs to the technical field of artificial intelligence and OCR.

Background

With the development of artificial intelligence technology, text detection technology has been greatly developed as a fundamental computer vision task. Text detection refers to the positioning of a text area in an image, and the technology can be widely applied to the industries of unmanned driving, robot navigation, blind person assistance and the like. With the great data-driven machine learning algorithm such as deep learning, great success is achieved in the fields of natural language processing, computer vision, voice recognition and the like, the image text detection technology based on deep learning is greatly developed, and the performance of the image text detection algorithm is remarkably improved. However, these methods rely on a large amount of detection annotation data.

To date, the acquisition of the detection annotation data has been largely manual, time consuming, labor intensive, and expensive. Especially for irregular text region positions, more points are usually needed for labeling, the efficiency is extremely low, and the precision is low due to human subjectivity. Therefore, a semi-automatic or automatic labeling algorithm is required to replace a manual labeling method, so that the labeling efficiency and accuracy are improved. Automatic algorithms have so far been implemented using detection algorithms, so-called pre-labeling. Because the performance of the detection algorithm is limited, the automatic algorithm pre-labeled by the detection algorithm cannot generate high-quality labels, more manual checks are still needed to obtain truly usable labels, and the difficult problem of image text detection labeling is not substantially solved.

Disclosure of Invention

In view of the above, the present invention provides a semi-automatic labeling method, system, computer device and storage medium for image text detection, which can improve the text detection labeling efficiency and labeling effect.

The invention aims to provide a semi-automatic image text detection labeling method.

The invention also provides a semi-automatic image text detection labeling system.

It is a third object of the invention to provide a computer apparatus.

It is a fourth object of the present invention to provide a storage medium.

The first purpose of the invention can be achieved by adopting the following technical scheme:

a semi-automatic labeling method for image text detection, the method comprising:

acquiring a text image;

acquiring a text center line from a text image, wherein the text center line is a bent broken line penetrating through the center of the text and is formed by sequentially connecting K +1 points to form K straight line segments;

generating N candidate boundary boxes surrounding the text center line, wherein each candidate boundary box is a polygonal area outline and surrounds N candidate text areas;

inputting the N candidate text regions into a loose recognizer and a strict recognizer simultaneously, recognizing the N candidate text regions through the loose recognizer to obtain estimated text contents, and predicting the content recognition result of each candidate text region through the strict recognizer;

comparing the N content identification results with the estimated text content, and respectively calculating identification losses to obtain N identification losses;

obtaining the index of the most accurate candidate bounding box by determining the index of the minimum loss in all the identification losses, and further obtaining the final text box label;

and optimizing the text box label by taking the recognition loss as a guide, and finally obtaining a compact text box label.

Further, the generating N candidate bounding boxes around the text center line specifically includes:

determining K +1 normals, each normal n_iIntersects the text centerline at point c_iAt the same time as the text center line at point c_iThe tangent of (A) is vertical;

at each normal n_iIn the above, a length h is determined_jLine segment l of_iLine segment l_iQuilt point c_iHalving, regarding the end points of all the line segments as the vertexes of the polygon, and connecting all the vertexes in sequence to obtain a candidate bounding box B_j；

By determining N differencesVariable h of_jTo obtain N candidate bounding boxes.

Further, the variable h_jIs determined as follows:

wherein j is 1, 2.

Further, the identifying, by the loose recognizer, the estimated text content from the N candidate text regions specifically includes:

the N candidate text regions are identified through a loose identifier to obtain N identification results { T }_jI j ═ 1, 2,.. N }, and the recognition result

Is a matrix of L C in shape;

calculating adjacent recognition results T_jAnd T_j-1Difference d of_jFrom the difference d_jMinimum recognition result T_j*To obtain an estimated text content T^*Wherein

The slave difference d_jMinimum recognition result

To obtain an estimated text content T^*The following formula:

wherein component (a)

Represents the recognition result T_jProbability that the u-th character belongs to the v-th class.

Further, the structure of the loose recognizer is an image text recognizer based on a convolutional neural network, and the image text recognizer comprises a corrector, a first encoder, a first sequence model and a first decoder;

the corrector for correcting the input image region R_jThe text image shape of (2);

the first encoder is used for extracting the characteristics of the corrected text image;

the first sequence model is used for extracting context dependent features;

the first decoder is used for translating the context dependent features and outputting a recognition result T_j；

The loose text recognizer is trained by using a loose text region of the synthesized image, wherein the loose text region is a region in the regional image, except for text, and a proper amount of background interference is introduced.

Further, the structure of the rigid recognizer is an image text recognizer based on a convolutional neural network, and the rigid recognizer comprises a second encoder, a second sequence model and a second decoder;

the second encoder is used for encoding the input image region R_jCarrying out feature extraction;

the second sequence model is used for extracting context dependent features;

the second decoder is used for decoding the context dependent characteristic and outputting a recognition result s_j；

The rigid recognizer is trained using a compact text region of the synthetic image, the compact text region being an image region within the regional image that is free of background interference other than text.

Further, the labeling of the text box to identify the loss as a guide is optimized as follows:

wherein the content of the first and second substances,

indicating a loss of recognition

To pair

And μ represents the update step.

The second purpose of the invention can be achieved by adopting the following technical scheme:

a semi-automatic annotation system for image text detection, the system comprising:

the text image acquisition module is used for acquiring a text image;

the text center line acquisition module is used for acquiring a text center line from a text image, wherein the text center line is a bent broken line penetrating through the center of the text and is formed by sequentially connecting K +1 points to form K straight line segments;

the candidate bounding box generating module is used for generating N candidate bounding boxes surrounding the text center line, wherein each candidate bounding box is a polygonal area outline and encloses N candidate text areas;

the identification module is used for simultaneously inputting the N candidate text regions into the loose recognizer and the strict recognizer, obtaining estimated text contents by identifying the N candidate text regions through the loose recognizer, and predicting the content identification result of each candidate text region through the strict recognizer;

the identification loss calculation module is used for comparing the N content identification results with the estimated text content, and respectively calculating identification loss to obtain N identification losses;

the text box label acquisition module is used for acquiring the index of the most accurate candidate boundary box by determining the index of the minimum loss in all the identification losses so as to obtain the final text box label;

and the optimization module is used for optimizing the text box label by taking the identification loss as a guide to finally obtain a compact text box label.

The third purpose of the invention can be achieved by adopting the following technical scheme:

a computer device comprises a processor and a memory for storing a program executable by the processor, wherein when the processor executes the program stored in the memory, the semi-automatic labeling method for detecting the image text is realized.

The fourth purpose of the invention can be achieved by adopting the following technical scheme:

a storage medium stores a program, and when the program is executed by a processor, the semi-automatic labeling method for image text detection is realized.

Compared with the prior art, the invention has the following beneficial effects:

the method comprises the steps of obtaining a text center line from a text image, generating a candidate boundary box surrounding the text center line, inputting the candidate boundary box into a loose recognizer and a strict recognizer, recognizing and obtaining estimated text contents through the loose recognizer, predicting a content recognition result through the strict recognizer, further calculating recognition loss, obtaining an index of the most accurate candidate boundary box through determining an index with the minimum loss in all recognition losses, further obtaining final text box labeling, optimizing the text box labeling by taking the recognition loss as a guide, obtaining compact text box labeling, achieving semi-automatic labeling, enabling the semi-automatic labeling to be between manual labeling and automatic labeling algorithms, and considering both labeling efficiency and labeling effect.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the structures shown in the drawings without creative efforts.

Fig. 1 is a simple flowchart of a semi-automatic labeling method for image text detection in embodiment 1 of the present invention.

Fig. 2 is a specific flowchart of the image text detection semi-automatic labeling method according to embodiment 1 of the present invention.

Fig. 3 is a schematic flowchart of candidate boundary generation in the image text detection semi-automatic labeling method according to embodiment 1 of the present invention.

Fig. 4 is a schematic flowchart of compact boundary estimation in the image text detection semi-automatic labeling method according to embodiment 1 of the present invention.

Fig. 5 is a block diagram of a semi-automatic image text detection annotation system according to embodiment 2 of the present invention.

Fig. 6 is a block diagram of a computer device according to embodiment 3 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer and more complete, the technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments, and all other embodiments obtained by a person of ordinary skill in the art without creative efforts based on the embodiments of the present invention belong to the protection scope of the present invention.

Example 1:

as shown in fig. 1 and fig. 2, the present embodiment provides a semi-automatic labeling method for image text detection, which includes the following steps:

s201, acquiring a text image.

The text image of the embodiment is a scene text image, and can be acquired by collection, for example, by shooting the scene text image with a camera, or by searching from a database, for example, by storing the scene text image in the database in advance, and by searching the scene text image from the database, the text image can be acquired, and the acquired text image is used as an input.

S202, acquiring a text center line from the text image.

In this embodiment, the text center line is a curved fold line passing through the center of the textThe linear array is formed by sequentially connecting K +1 points to form K linear segments, wherein the K +1 points are marked as { c_iI 1, 2., K +1}, with the text centerline as input.

S203, generating N candidate boundary boxes surrounding the text center line, wherein each candidate boundary box is a polygonal area outline and surrounds N candidate text areas.

In this embodiment, the N candidate bounding boxes are denoted as { B }_j1, 2.,. N }, where N candidate text regions are denoted as { R |_j|j＝1，2，...，N}，N＝30。

With reference to fig. 3, the step S203 is a step of candidate bounding boxes, which specifically includes:

s2031, determining K +1 normals, each normal n_iIntersects the text centerline at point c_iAt the same time as the text center line at point c_iThe tangent of (a) is vertical.

S2032, at each normal n_iIn the above, a length h is determined_jLine segment l of_iLine segment l_iQuilt point c_iHalving, regarding the end points of all the line segments as the vertexes of the polygon, and connecting all the vertexes in sequence to obtain a candidate bounding box B_j。

S2033, determining N different variables h_jTo obtain N candidate bounding boxes.

In this embodiment, the variable h is different_jDifferent polygon bounding boxes may be determined, candidate N different h_jN candidate bounding boxes, variable h, may be obtained_jIs determined as follows:

wherein j is 1, 2.

The following steps S204 to S206 are semantic boundary decision steps:

s204, inputting the N candidate text regions into a loose recognizer and a strict recognizer simultaneously, recognizing the N candidate text regions through the loose recognizer to obtain estimated text contents, and predicting the content recognition result of each candidate text region through the strict recognizer.

In this embodiment, identifying and obtaining the estimated text content from the N candidate text regions by using the loose identifier specifically includes:

1) for N candidate text regions R by a relaxed recognizer_jIdentifying the 1, 2, N to obtain N identification results { T }_jI j ═ 1, 2,.. N }, and the recognition result

Is a matrix of L x C shape.

2) Calculating adjacent recognition results T_jAnd T_j-1Difference d of_jFrom the difference d_jMinimum recognition result

To obtain an estimated text content T^*Wherein

Calculating adjacent recognition results T_jAnd T_j-1Difference d of_jIncluding but not limited to cross entropy loss function, CTC (connectionist Temporal Classification) loss function, and edit distance, from the difference d_jMinimum recognition result

To obtain an estimated text content T^*The following formula:

wherein component (a)

The structure of the relaxed recognizer of the embodiment is an image text recognizer based on a convolutional neural network, and comprises a corrector, a first encoder, a first sequence model and a first decoder, and the descriptions of the parts are as follows:

corrector for correcting an input image region R_jThe text image shape of (2);

a first sequence model for extracting context dependent features;

a first decoder for translating the context dependent features and outputting a recognition result T_j；

The loose recognizer is trained by using a loose text region of the synthesized image, wherein the loose text region is a region in the regional image, except for the text, and a proper amount of background interference is introduced.

The structure of the rigid recognizer of this embodiment is an image text recognizer based on a convolutional neural network, and includes a second encoder, a second sequence model and a second decoder, and the descriptions of the respective parts are as follows:

a second encoder for encoding the input image region R_jCarrying out feature extraction;

a second sequence model for extracting context dependent features;

a second decoder for decoding the context dependent feature and outputting a recognition result s_j；

The rigid recognizer is trained using the compacted text regions of the synthetic image, which are regions of the image within the regional image that, except for text, do not have background interference.

S205, comparing the N content recognition results with the estimated text content, respectively calculating recognition losses, and obtaining N recognition losses.

In this embodiment, the recognition loss is given as { l }_j1, 2.., N }, calculating a recognition penalty including, but not limited to, crossingAn entropy loss function, a ctc (connectionist Temporal classification) loss function, and an edit distance.

S206, obtaining the index of the most accurate candidate bounding box by determining the index of the minimum loss in all the identification losses, and further obtaining the final text box label.

And S207, marking the text box, optimizing by taking the recognition loss as a guide, and finally obtaining a compact text box mark.

Referring to fig. 4, the step S107 is a tight boundary estimation step, and the text box is labeled to identify the loss as a guide for optimization, as follows:

wherein the content of the first and second substances,

indicating a loss of recognition

To pair

And μ represents the update step.

In the above embodiment, the strict identifier performs training using the compact text region of the synthesized image, and is sensitive to the text image background region, so that the obtained text detection box is compact and has high accuracy.

It should be noted that although the method operations of the above-described embodiments are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the depicted steps may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

Example 2:

as shown in fig. 5, the present embodiment provides a semi-automatic labeling system for image text detection, which includes a text image obtaining module 501, a text centerline obtaining module 502, a candidate bounding box generating module 503, a recognition module 504, a recognition loss calculating module 505, a text box labeling obtaining module 506, and an optimizing module 507, where the specific functions of each module are as follows:

a text image obtaining module 501, configured to obtain a text image.

The text center line obtaining module 502 is configured to obtain a text center line from the text image, where the text center line is a curved broken line that runs through the center of the text and is formed by sequentially connecting K +1 points to form K straight line segments.

The candidate bounding box generating module 503 is configured to generate N candidate bounding boxes around a text centerline, where each candidate bounding box is a polygon region outline and encloses N candidate text regions.

The recognition module 504 is configured to input the N candidate text regions into the relaxed recognizer and the rigid recognizer at the same time, recognize the estimated text content from the N candidate text regions by the relaxed recognizer, and predict the content recognition result of each candidate text region by the rigid recognizer.

And the recognition loss calculating module 505 is configured to compare the N content recognition results with the estimated text content, and calculate recognition losses respectively to obtain N recognition losses.

The text box label obtaining module 506 is configured to obtain an index of the most accurate candidate bounding box by determining an index of the smallest loss among all the identification losses, and further obtain a final text box label.

And the optimizing module 507 is configured to optimize the text box label by using the recognition loss as a guide, so as to finally obtain a compact text box label.

The specific implementation of each module in this embodiment may refer to embodiment 1, which is not described herein any more; it should be noted that the system provided in this embodiment is only illustrated by the division of the functional units, and in practical applications, the functions may be distributed by different functional modules according to needs, that is, the internal structure is divided into different functional modules to complete all or part of the functions described above.

Example 3:

the present embodiment provides a computer device, which may be a computer, as shown in fig. 6, and includes a system bus 601, a connected processor 602, a memory, an input device 603, a display device 604, and a network interface 605, where the processor is configured to provide computing and control capabilities, the memory includes a nonvolatile storage medium 606 and an internal memory 607, the nonvolatile storage medium 606 stores an operating system, a computer program, and a database, the internal memory 607 provides an environment for the operating system and the computer program in the nonvolatile storage medium to run, and when the processor 602 executes the computer program stored in the memory, the image text detection semi-automatic labeling method of embodiment 1 is implemented as follows:

acquiring a text image;

Example 4:

the present embodiment provides a storage medium, which is a computer-readable storage medium, and stores a computer program, and when the computer program is executed by a processor, the method for semi-automatically labeling image text detection in embodiment 1 is implemented as follows:

acquiring a text image;

It should be noted that the computer readable storage medium of the present embodiment may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In summary, the invention obtains a text center line from a text image, generates a candidate bounding box around the text center line, inputs the candidate bounding box into a loose recognizer and a strict recognizer, obtains estimated text content through recognition of the loose recognizer, predicts a content recognition result through the strict recognizer, further calculates recognition loss, obtains an index of the most accurate candidate bounding box through determining an index of the minimum loss in all recognition losses, further obtains a final text box label, optimizes the text box label by taking the recognition loss as a guide, obtains a compact text box label, realizes semi-automatic label, and can give consideration to both label efficiency and label effect through the intermediate between a manual label and an automatic label algorithm.

The above description is only for the preferred embodiments of the present invention, but the protection scope of the present invention is not limited thereto, and any person skilled in the art can substitute or change the technical solution and the inventive concept of the present invention within the scope of the present invention.

Claims

1. A semi-automatic labeling method for image text detection is characterized by comprising the following steps:

acquiring a text image;

2. The image text detection semi-automatic labeling method according to claim 1, wherein the generating N candidate bounding boxes around the text center line specifically comprises:

By determining N different variables h_jTo obtain N candidate bounding boxes.

3. The image text detection semi-automatic labeling method of claim 2, characterized in that the variable h_jIs determined as follows:

wherein j is 1, 2.

4. The method for semi-automatic annotation of image text detection according to claim 1, wherein the identifying of the estimated text content from the N candidate text regions by the relaxed identifier specifically comprises:

Is a matrix of L C in shape;

calculating adjacent recognition results T_jAnd T_j-1Difference d of_jFrom the difference d_jMinimum recognition result

To obtain an estimated text content T^*Wherein

The slave difference d_jMinimum recognition result

To obtain an estimated text content T^*The following formula:

wherein component (a)

5. The image text detection semi-automatic labeling method of any one of claims 1-4, characterized in that the structure of the relaxed recognizer is a convolutional neural network-based image text recognizer, comprising a corrector, a first encoder, a first sequence model and a first decoder;

the first sequence model is used for extracting context dependent features;

6. The image text detection semi-automatic labeling method of any one of claims 1-4, characterized in that the structure of the rigid recognizer is a convolutional neural network-based image text recognizer, which comprises a second encoder, a second sequence model and a second decoder;

the second sequence model is used for extracting context dependent features;

7. The image text detection semi-automatic labeling method according to any one of claims 1 to 4, characterized in that the labeling of the text box is optimized by taking loss recognition as a guide, as follows:

wherein the content of the first and second substances,

indicating a loss of recognition

To pair

And μ represents the update step.

8. A semi-automatic annotation system for image text detection, the system comprising:

the text image acquisition module is used for acquiring a text image;

9. A computer device comprising a processor and a memory for storing a program executable by the processor, wherein the processor, when executing the program stored in the memory, implements the image text detection semi-automatic labeling method of any one of claims 1 to 7.

10. A storage medium storing a program, wherein the program, when executed by a processor, implements the image text detection semi-automatic labeling method of any one of claims 1 to 7.