CN111798480A

CN111798480A - Character detection method and device based on single character and character connection relation prediction

Info

Publication number: CN111798480A
Application number: CN202010719772.6A
Authority: CN
Inventors: 吕馗; 汪明浩
Original assignee: Beijing Seektruth Data Technology Service Co ltd
Current assignee: Beijing Seektruth Data Technology Service Co ltd
Priority date: 2020-07-23
Filing date: 2020-07-23
Publication date: 2020-10-20
Anticipated expiration: 2040-07-23
Also published as: CN111798480B

Abstract

The invention discloses a character detection method and a device based on single character and character connection relation prediction, which comprises the following steps: training a neural network model; and performing feature extraction through the neural network model to obtain a feature map, wherein the feature map comprises a character position score map Text-map and an inter-character connection relation score map Link-map, performing post-processing on the Text-map and the Link-map, and calculating the minimum external rectangle of each connected domain to realize the detection of the position of the character from the picture. The character-level-based prediction model can force the convolution kernel of the model to pay more attention to character-level features, can effectively correlate character-level features learned by the model in a post-processing branch, realizes the detection of long texts, and avoids the omission of characters.

Description

Character detection method and device based on single character and character connection relation prediction

Technical Field

The invention belongs to the field of character detection, and particularly relates to a character detection method based on single character and character connection relation prediction.

Background

Before the explosion of the neural network technology, character detection is usually based on manually set feature extraction methods, such as region feature extraction MSER, programming framework swt (standard Widget toolkit), and the like, as basic components. With the development of neural network technology in recent years, new and more neural network-based methods have become popular. Some new text detection methods have been designed in sequence, but in general, these methods are essentially improved based on general target detection or general case segmentation methods, such as detecting text in natural images (SSD), faster-CNN, Full Convolution Network (FCN). We can classify it into two big categories according to different text detection algorithm design methods, which are: a text detection method based on bounding box regression and a text detection method based on pixel segmentation.

A text detection method based on bounding box regression; some text detection methods are improved by using a method of surrounding box regression in general target detection. Unlike general object detection, objects in text detection are often irregular fields and have certain angles. And therefore cannot directly use a general purpose object detector as a model for text detection. In order to solve the problem, some text box (Textboxes) character detection methods are designed, which can adapt to text objects of different shapes by changing the shapes of a convolution kernel and an anchor box (anchor boxes) in a general object detector, and have good detection effect on horizontal rectangular character objects. The depth Matching Prior Network (Deep Matching Prior Network) further applies a quadrilateral sliding window to filter the detection error area. In a recent method, a Rotation-Sensitive regression detector (RSDD) fully utilizes the Rotation invariant feature to improve the detection effect of a character detector on an inclined text target by actively rotating a convolution filter. However, when using this method, a priori knowledge is needed to limit the shape of the target in a natural scene.

A text detection method based on pixel segmentation; another common method is based on segmentation methods, which aim to find the coverage area of the text on the image at the pixel level. This method is used to detect text boundary regions. There are also similar algorithms that attempt to reduce background interference at the feature level for methods that are optimized on this basis, using attention mechanisms to enhance text regions at the feature level. The segmentation-based character detection method aims to represent an input image as a text/non-text region in the form of a binary image, and then further segment text instances in the text region by using a post-processing method.

Disclosure of Invention

The invention provides a character detection method and a character detection device based on single character and character connection relation prediction, which are used for realizing the detection of each character in a picture.

The technical scheme adopted by the invention is as follows:

in a first aspect, the present invention provides a method for detecting words based on single character and word connection relation prediction, comprising the following steps:

training a neural network model; performing feature extraction through the neural network model to obtain a feature map, wherein the feature map comprises a character position score map Text-map and an inter-character connection relation score map Link-map, and performing post-processing on the Text-map and the Link-map, and the post-processing comprises the following steps:

converting the Text-map into a binary map Bm1, setting a threshold value lambda 1 for the Text-map,

converting the Link-map into a binary map Bm2, setting a threshold value lambda 2 for the Link-map,

initializing the binary map Bm1 to be 0, wherein the position value on the binary map Bm1 is 1 when the characteristic map value of the corresponding position is larger than a threshold value lambda 1;

initializing the binary map Bm2 to be 0, wherein the value of the position on the binary map Bm2 is 1 when the value of the characteristic map of the corresponding position is larger than a threshold value lambda 2;

performing connected component analysis on the obtained binary map Bm1 and binary map Bm2, and obtaining pixels of all character areas at the moment;

and calculating the minimum circumscribed rectangle of each connected domain to realize the detection of the position of the characters from the picture.

According to the technical characteristics, the post-processing method does not use other methods such as Non-Maximum Suppression (NMS) except that the three calculations need extremely small calculation amount, and after a connected domain representing a text region is obtained, a minimum circumscribed rectangle is obtained to represent a circumscribed rectangle of the text instance, so that a binary graph is obtained only through a threshold value, which needs time design; in essence, the character detection method is also a character detection method based on pixel segmentation, but the method is different from other methods of directly segmenting character examples, and in order to break through the limitation of the steps, the method can also combine a complete text example enclosure box by predicting the characteristic diagram of a single character and the connection relation characteristic diagram of the example to which the character belongs in a searching mode, and can realize the uplink and downlink connection problem in character detection, the representation of a curved text field and the detection of a long text field.

In one possible design, the model training of the training neural network model includes:

performing label conversion, and converting the label in the form of the bounding box numerical value into a score map label with character-level labels;

for each training image, generating a corresponding Text-map and a corresponding Link-map for each instance in the picture; generating a specific Gaussian map unit for each character position in the original image on the Text-map, and generating a binary map unit representing example connection relation for each example in the original image on the Link-map; for the generation of Link-map, firstly drawing a diagonal line for each character bounding box, and obtaining an upper triangle and a lower triangle at the moment; and then connecting the centers of gravity of the upper triangle and the lower triangle which belong to the same example respectively to obtain an upper line and a lower line, and closing the two line segments to obtain the polygonal enclosing frame representing the example connection relation. Firstly, the binary connection graph generated by the method is different from a binary semantic segmentation graph in a general character detection scheme based on segmentation, the connection relation of the method is more compact, and the connection tightness degree between two characters can be measured through the width of the connection binary graph; secondly, the method encodes a single character into a Gaussian heat map to reflect the center and the edge of the character, can flexibly represent the relationship between a real label and an image, and can help a depth model to learn semantic connection information in a word or a field by using a binary segmentation graph to represent the connection relationship in an example.

In one possible design, the gaussian map unit is generated as follows:

generating a two-dimensional standard Gaussian feature map in advance;

calculating a perspective transformation matrix of four coordinates of a standard Gaussian characteristic diagram and a character surrounding frame;

and transferring the standard Gaussian feature map to a surrounding frame region through perspective transformation.

In one possible design, when the neural network model is trained, the training samples are used for generating character-level labels from the word-level labels in a weak supervision training mode, and the character-level labels are represented in a form of character surrounding boxes.

In one possible design, the method for generating the character-level labels is as follows:

for a real training sample with word-level labels, carrying out forward reasoning on the real training sample by using a fully trained model, and predicting to obtain Text-map of the sample; cutting out a feature slice of the text on the feature map according to the original word-level label; estimating the corresponding position of each character on the character feature slice by applying a watershed algorithm and representing the corresponding position in a form of a bounding box; and finally, converting the surrounding frame into the original image.

In one possible design, in the model training process, the number of the generated character bounding boxes is matched with the number of the real characters to obtain a confidence level, and the confidence level is used for measuring the character bounding boxes generated in the model training process.

In one possible design, the method for measuring the character bounding box generated in the model training process by using the confidence coefficient is as follows:

for a word level label instance w in the training data, Tw and Lw are used for respectively representing the position of a generated character bounding box of the instance w and the number of real characters; applying a watershed image segmentation algorithm to obtain Lcw the number of character bounding boxes, then calculating the confidence score Sconf (w) of the example character bounding box by the following formula,

wherein, the min function is the minimum value in the return given parameter table;

at this time, the confidence score map sc (p) for describing the complete image may be expressed as follows,

p represents all pixels in the corresponding real word bounding box.

The confidence score sconf (w) is used for measuring the confidence of the generated label, so that the label generated by the trained model is more accurate, and the identification effect of the whole network model can be monitored through a confidence score graph Sc (p) describing a complete image.

In one possible design, the neural network model employs DetNet as the base network. The DetNet is a novel basic network specially designed for a target detection task, and the capability of positioning the position of a small target object in the target detection task is ensured based on less down-sampling operations.

In a second aspect, the present invention provides a word detection apparatus based on single character and word-word connection relation prediction, comprising a memory, a processor and a transceiver connected in sequence, wherein the memory is used for storing a computer program, the transceiver is used for sending and receiving messages, and the processor is used for reading the computer program and executing the method according to the first aspect.

In a third aspect, the present invention provides a computer-readable storage medium having stored thereon instructions which, when run on a computer, perform the method according to the first aspect.

The invention has the following advantages and beneficial effects:

1. according to the invention, through the post-processing method, except that the three calculations need extremely small calculation amount, other methods such as Non-Maximum Suppression (NMS) are not used, after a connected domain representing a character area is obtained, the minimum circumscribed rectangle is obtained to represent the circumscribed rectangle of the character example, so that a binary image is obtained only through a threshold value, and time is spent on designing; the character detection method is basically a character detection method based on pixel segmentation, but the method can break through the limitation of the steps compared with other methods for directly segmenting character examples, and the method can realize the up-down connection problem in character detection, the representation of a curved text field and the detection of a long text field by predicting the connection relation characteristic diagram of a single character and an example to which the character belongs and combining a complete text example enclosure frame in a searching mode; in other words, the traditional character detection method is based on surrounding box regression or pixel segmentation, and in order to detect long texts or large target texts, a deeper network structure is often needed to improve the receptive field of the model.

2. The binary connection graph generated by the method is different from a binary semantic segmentation graph in a general character detection scheme based on segmentation, the connection relation of the method is more compact, and the connection tightness degree between two characters can be measured through the width of the connection binary graph; secondly, the method encodes a single character into a Gaussian heat map to reflect the center and the edge of the character, can flexibly represent the relationship between a real label and an image, and represents the intrinsic connection relationship of an example by using a binary segmentation map, which is beneficial to a depth model to learn semantic connection information in words or fields;

3. the confidence score Sconf (w) is used for measuring the confidence of the generated label in the training process, so that the label generated by the training model is more accurate, and the identification effect of the whole network model can be monitored through the confidence score chart Sc (p) describing the complete image.

Drawings

The accompanying drawings, which are included to provide a further understanding of the embodiments of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:

FIG. 1 is an overall structure diagram of a neural network model according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to examples and accompanying drawings, and the exemplary embodiments and descriptions thereof are only used for explaining the present invention and are not meant to limit the present invention.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments of the invention. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including," when used herein, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, numbers, steps, operations, elements, components, and/or groups thereof.

It should also be noted that, in some alternative implementations, the functions/acts noted may occur out of the order noted in the figures. For example, two figures shown in succession may, in fact, be executed substantially concurrently, or the figures may sometimes be executed in the reverse order, depending upon the functionality/acts involved.

It should be understood that specific details are provided in the following description to facilitate a thorough understanding of example embodiments. However, it will be understood by those of ordinary skill in the art that the example embodiments may be practiced without these specific details. For example, systems may be shown in block diagrams in order not to obscure the examples in unnecessary detail. In other instances, well-known processes, structures and techniques may be shown without unnecessary detail in order to avoid obscuring example embodiments.

The aim of character detection is to accurately locate the position of characters in an image, and the development of character detection technology in recent years goes through the following stages. The text detection method comprises a first stage of detecting horizontal texts, a second stage of detecting inclined straight texts with certain inclination angles, and a third stage of detecting texts in any shapes including bent texts with any bending angles. Recent research efforts have tended to locate text positions using arbitrary polygons, which can be used to more accurately locate text positions in images. In the embodiment, the character detection method based on character detection and word-word relation prediction also belongs to this new category.

Our method aims to accurately locate the position of each character in the input image and the character connection relationship with a field.

Examples

In a first aspect, the present embodiment provides a method for detecting words based on single character and inter-word connection relationship prediction, including the following steps:

performing Connected Component Labeling (CCL) on the obtained binary map Bm1 and binary map Bm2, and obtaining pixels of all text regions at this time; in specific implementation, the step can be implemented by connected components function in OpenCV, which is a cross-platform computer vision and machine learning software library.

And calculating the minimum circumscribed rectangle of each connected domain to detect the position of the characters from the picture, wherein the calculation method of the step is realized by a minAreaRect function in OpenCV during specific implementation.

The overall structure of the text detection method model provided by this embodiment mainly includes three parts, as shown in fig. 1, a first part: a main network formed by DetNet is used as a characteristic extraction part, the DetNet is a novel basic network specially designed for a target detection task, and the capability of positioning the position of a small target object in the target detection task is ensured based on less down-sampling operations; a second part: an upsampling feature fusion part; and a third part: the post-processing part based on connected domain analysis is also called the inference part, since this phase only exists at the time of inference. The post-processing is the reasoning part. In implementation, the model may output text position representation modes of various shapes according to specific requirements, typically including character bounding boxes, word bounding boxes, polygonal curved edge bounding boxes, and the like.

In the post-processing method, except that the three calculations need extremely small calculation amount, other methods such as Non-Maximum Suppression (NMS) are not used, after a connected domain representing a text region is obtained, a minimum circumscribed rectangle of the connected domain represents a text example, so that a binary image is obtained only by a threshold value, which needs time design; in essence, the character detection method is also a character detection method based on pixel segmentation, but the method is different from other methods of directly segmenting character examples, and in order to break through the limitation of the steps, the method can realize the up-down connection problem in character detection, the representation of a curved text field and the detection of a long text field by predicting the connection relation characteristic diagram of a single character and an example to which the character belongs and combining a complete text example enclosure frame in a searching mode.

performing label conversion, and converting the label in the form of the bounding box numerical value into a score map label with character-level labels; because the training process of the algorithm does not directly regress character coordinates, but needs a special post-processing branch to process the feature map, the training process of the algorithm needs label conversion, and labels in a bounding box numerical value form are converted into score map labels with character level labels.

The labeling method of the synthetic character level is as follows: for each training image, generating a corresponding Text-map and a corresponding Link-map for each instance in the picture; generating a specific Gaussian map unit for each character position in the original image on the Text-map, and generating a binary map unit representing example connection relation for each example in the original image on the Link-map; for the generation of Link-map, firstly drawing a diagonal line for each character bounding box, and obtaining an upper triangle and a lower triangle at the moment; and then connecting the centers of gravity of the upper triangle and the lower triangle which belong to the same example respectively to obtain an upper line and a lower line, and closing the two line segments to obtain the polygonal enclosing frame representing the example connection relation. Firstly, the binary connection graph generated by the method is different from a binary semantic segmentation graph in a general character detection scheme based on segmentation, the connection relation of the method is more compact, and the connection tightness degree between two characters can be measured through the width of the connection binary graph; secondly, the method encodes a single character into a Gaussian heat map to reflect the center and the edge of the character, can flexibly represent the relationship between a real label and an image, and can help a depth model to learn semantic connection information in a word or a field by using a binary segmentation graph to represent the connection relationship in an example.

Directly from the bounding box of each character, calculating the gaussian distribution value of each pixel value within the bounding box takes a lot of time, which is not feasible in practical training processes. It should be noted that, in the training process of the general target detection model, because the label is a horizontal quadrilateral bounding box, in this class of algorithms, they provide a method for generating an internal gaussian feature map directly according to the label of the target bounding box, and because in our current task, the character bounding box is often an arbitrary convex quadrilateral, in consideration of time complexity, the generation of the gaussian map unit is completed by using the following steps:

generating a two-dimensional standard Gaussian feature map in advance;

In one possible design, when the neural network model is trained, the training samples are used for generating character-level labels from the word-level labels in a weak supervision training mode, and the character-level labels are represented in a form of character surrounding boxes. In specific implementation, the method for generating the character level label is as follows:

In one possible design, in the model training process, the number of the generated character bounding boxes is matched with the number of the real characters to obtain a confidence level, and the confidence level is used for measuring the character bounding boxes generated in the model training process. In specific implementation, the method for measuring the character bounding box generated in the model training process by using the confidence coefficient comprises the following steps:

p represents all pixels in the corresponding real word bounding box.

In other words, in the conventional character detection method, no matter based on bounding box regression or pixel segmentation, in order to detect a long text or a large target text, a deeper network structure is often required to improve the receptive field of the model.

In a second aspect, the present embodiment provides a word detection apparatus based on single character and word-word connection relation prediction, including a memory, a processor, and a transceiver connected in sequence, where the memory is used for storing a computer program, the transceiver is used for sending and receiving messages, and the processor is used for reading the computer program and executing the method according to the first aspect.

For example, the Memory may include, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Flash Memory (Flash Memory), a First In First Out (FIFO) Memory, and/or a First In Last Out (FILO) Memory, and the like; the processor may not be limited to the use of a microprocessor model number STM32F105 family; the transceiver may be, but is not limited to, a Wireless Fidelity (WiFi) Wireless transceiver, a bluetooth Wireless transceiver, a General Packet Radio Service (GPRS) Wireless transceiver, a ZigBee protocol (ZigBee) Wireless transceiver, and/or the like. In addition, the device may include, but is not limited to, a power module, a display screen, and other necessary components.

A third aspect of the present embodiment provides a computer-readable storage medium, on which instructions are stored, and when the instructions are executed on a computer, the method according to the first aspect or any one of the possible designs of the first aspect of the present embodiment is executed. The computer-readable storage medium refers to a carrier for storing data, and may include, but is not limited to, floppy disks, optical disks, hard disks, flash memories, flash disks and/or Memory sticks (Memory sticks), etc., and the computer may be a general purpose computer, special purpose computer, computer network, or other programmable device.

For the working process, the working details, and the technical effects of the computer-readable storage medium provided in this embodiment, reference may be made to the first aspect of the embodiment, which is not described herein again.

The present invention provides a computer program product comprising instructions which, when run on a computer, may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus, cause the computer to perform the method according to the first aspect of the embodiments.

The embodiments described above are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device to perform the methods described in the embodiments or some portions of the embodiments.

The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. The character detection method based on single character and character connection relation prediction is characterized by comprising the following steps of:

initializing a binary map Bm1 to be 0, wherein the value of the position on the binary map Bm1 is 1 when the value of the feature map of the corresponding position is larger than a threshold value lambda 1;

2. The method of claim 1, wherein the model training for training the neural network model comprises:

performing label conversion, and converting the label in the form of the bounding box numerical value into a score map label with character level labels;

for each training image, generating a corresponding Text-map and a corresponding Link-map for each instance in the picture; and generating a specific Gaussian map unit on the Text-map for each character position in the original image, and generating a binary map unit representing example connection relation on the Link-map for each example in the original image.

3. The method of claim 2, wherein the Gaussian map unit is generated by:

generating a two-dimensional standard Gaussian feature map in advance;

4. The method as claimed in claim 1, wherein the neural network model generates a character-level label from the word-level label by training samples in a weak supervision manner during training, and the character-level label is represented in a form of a character bounding box.

5. The method of claim 4, wherein the method for generating the character-level labels comprises:

6. The word detection method based on single character and word connection relation prediction as claimed in claim 4, wherein: in the model training process, the number of the generated character bounding boxes is matched with the number of the real characters to obtain a confidence coefficient, and the confidence coefficient is used for measuring the character bounding boxes generated in the model training process.

7. The method for detecting words based on single character and word connection relation prediction as claimed in claim 6, wherein the method for measuring the character bounding box generated in the model training process by using confidence coefficient comprises:

for a word level label instance w in the training data, Tw and Lw are used for respectively representing the position of a character bounding box generated by the instance w and the number of real characters; applying a watershed image segmentation algorithm to obtain Lcw the number of character bounding boxes, then calculating the confidence score Sconf (w) of the example character bounding box by the following formula,

p represents all pixels in the corresponding real word bounding box.

8. The word detection method based on single character and word connection relation prediction as claimed in claim 1, wherein: the neural network model adopts a basic network DetNet.

9. A character detection device based on single character and character connection relation prediction is characterized in that: the system comprises a memory, a processor and a transceiver which are connected in sequence, wherein the memory is used for storing a computer program, the transceiver is used for transmitting and receiving messages, and the processor is used for reading the computer program and executing the method according to any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that: the computer-readable storage medium has stored thereon instructions that, when executed on a computer, perform the method of any of claims 1-8.