CN107093172B

CN107093172B - Character detection method and system

Info

Publication number: CN107093172B
Application number: CN201610091568.8A
Authority: CN
Inventors: 徐昆; 郭晓威; 黄飞跃; 郑宇飞; 张惜今; 卢艺帆
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2016-02-18
Filing date: 2016-02-18
Publication date: 2020-03-17
Anticipated expiration: 2036-02-18
Also published as: CN107093172A; WO2017140233A1

Abstract

The invention discloses a character detection method and a system; the method comprises the following steps: performing color reduction processing on each image in a three-color channel of a target image to obtain a color reduction image, and converting the target image into a binary image; combining the connected blocks with the same color in the color reduction image, and combining the connected blocks with the same color in the binary image; combining the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a connecting mode in the vertical and horizontal directions respectively to obtain candidate character areas in the target image; and extracting a specific area from the position of the candidate character area on the target image, and judging whether the extracted specific area contains character rows or character columns or not based on the comparison result of the probability of containing the character area in the extracted specific area and a preset probability threshold. By implementing the invention, the text in the image can be accurately detected.

Description

Character detection method and system

Technical Field

The invention relates to a character detection technology in an image, in particular to a character detection method and a character detection system.

Background

A document image is a document in an image Format, which is a document that is converted into an image Format by some means (such as scanning) for a user to read electronically, and typical examples of the document image are a Portable Document Format (PDF) image and a DjVu Format image.

The current text detection technology can detect text in a document image (locate a text-bearing area in the image), and perform text recognition based on the detected text-bearing area.

Images in a general sense include not only document images but also non-document images (i.e., images uploaded by users in a scan format image such as a web album, which may be joint photographic experts group (JPG) images, Bitmap (BMP) images, Tag Image File Format (TIFF) images, Graphic Interchange Format (GIF) images, exchangeable image file format (EXIF) images, and the like.

If the characters in the non-document format image can be identified, accurate semantic information can be obtained, and a user is helped to retrieve and manage the image. To identify the characters in the non-scanning format image and detect the characters in the image, which is a necessary pre-step, the existing character detection technology mostly uses manually specified features to judge whether the image contains the characters or not, and detects English characters, because Chinese and English have significant differences in character form structures, the detection accuracy of Chinese applied to the document image and the detection accuracy of English applied to the document image have great differences, and the requirement of practical application is difficult to meet.

Disclosure of Invention

The embodiment of the invention provides a character detection method and a character detection system, which can accurately detect a text in an image.

The technical scheme of the embodiment of the invention is realized as follows: .

In a first aspect, an embodiment of the present invention provides a text detection method, where the method includes:

performing color reduction processing on each image in a three-color channel of a target image to obtain a color reduction image, and converting the target image into a binary image;

merging the connected blocks with the same color in the color reduction image, and merging the connected blocks with the same color in the binary image;

combining the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a connecting mode in the vertical and horizontal directions respectively to obtain a candidate character area in the target image;

and extracting a specific region at the position on the target image corresponding to the candidate character region, and judging whether the extracted specific region contains character rows or character columns or not based on the comparison result of the probability of containing the character region in the extracted specific region and a preset probability threshold.

Preferably, the color-reducing processing of each image in the three color channels of the target image to obtain a color-reduced image includes:

quantizing each channel of the red, green and blue channels of the target image by K levels respectively to obtain K level intervals;

and mapping the brightness of each pixel in the target image in the RGB three-color channel into an interval of corresponding channel quantization, wherein K is an integer and 255> K > 1.

Preferably, the merging connected blocks with the same color in the color-reduced image and the merging connected blocks with the same color in the binary image includes:

for each pixel in the subtractive image and in the binary image as a separate connected block, establishing a union set for the pixels and performing the following:

if the color of the pixel is the same as that of any one of the pixels adjacent to the pixel 8, merging the connected blocks of the two adjacent pixels with the same color into the same connected block

And judging the pixel area of each connected block, merging the connected blocks into the connected blocks adjacent to the connected blocks if the pixel area of the connected blocks is smaller than a pixel area threshold value, and setting the color of the connected blocks as the color of the merged connected blocks.

Preferably, after the merging the connected blocks with the same color in the color-reduced image and the merging the connected blocks with the same color in the binary image, the method further comprises:

discarding connected blocks which are in the color reduction image and in the binary image and accord with preset characteristics; the preset features include at least one of:

the area of the connected blocks is smaller than the pixel area threshold value;

the length of any one edge of the connected blocks is larger than that of the connected blocks of a first preset proportion of the corresponding image edge length;

and any side length in the connected blocks is larger than the frame length threshold, and the ratio of the pixel area to the bounding box area is smaller than the ratio threshold.

merging the connected blocks of each color channel in the subtractive image into new connected blocks respectively based on the position relationship of the connected blocks, and merging the connected blocks in the binary image into new connected blocks based on the position relationship; wherein, at least one of the following processes is executed:

merging the connected blocks with the distance smaller than the distance threshold;

taking the maximum value of the average values of the respective lengths and the widths of any two connected blocks, and if the maximum value meets a preset condition, combining the two selected connected blocks;

combining connected blocks of which the bounding boxes are crossed and the crossed parts accord with preset crossed characteristics;

and merging the connected blocks of which the bounding boxes are aligned and meet a preset alignment merging rule.

Preferably, the merging, in a connected manner, the connected blocks of each color channel of the three color channels of the color-reduced image and the connected blocks in the binary image in the vertical and horizontal directions, respectively, to obtain a candidate text region in the target image, includes:

combining in the horizontal direction, combining in the vertical direction and combining in the horizontal direction in sequence based on a connection combination rule; wherein the connection merging rule comprises:

and connecting the two selected connected blocks to form a new connected block according to at least one of the following conditions:

the minimum distance between the center distances or the edge distances of the bounding boxes of the two communicating blocks in the reference axial direction is smaller than a first preset proportion of the minimum side length of the side lengths of the bounding boxes of the two communicating blocks corresponding to the reference axial direction;

the distance between the bounding boxes of the two communicating blocks in the direction perpendicular to the reference axial direction is smaller than a second preset proportion of the smallest side length of the bounding boxes of the two communicating blocks in the side length perpendicular to the reference axial direction;

the difference value of the side lengths of the bounding boxes of the two communicating blocks in the reference axial direction is smaller than a third preset proportion of the smallest side length of the side lengths of the bounding boxes of the two communicating blocks corresponding to the reference axial direction.

Preferably, the extracting a specific region from the position of the candidate text region on the target image, and determining whether the extracted specific region includes a text row or a text column based on a comparison result between a probability that the extracted specific region includes a text region and a preset probability threshold includes:

extracting a specific area from the target image, obtaining connected bounding boxes on the color reduction image and the binary image, sending the bounding boxes obtained by connecting the color reduction image and the binary image into a convolutional neural network classifier by a specific sliding window step length sliding window for judgment, and obtaining the probability of characters contained in each sliding window;

averaging the probabilities of characters contained in the sliding window to obtain the probability that the candidate character area comprises character rows or character columns;

and if the obtained probability is greater than a preset probability threshold, determining that the character row or the character column exists in the specific area.

In a second aspect, an embodiment of the present invention provides a text detection system, where the system includes:

the color reduction binary processing unit is used for carrying out color reduction processing on each image in three-color channels of the target image to obtain a color reduction image and converting the target image into a binary image;

a first merging unit, configured to merge connected blocks with the same color in the color-reduced image, and merge connected blocks with the same color in the binary image;

a second merging unit, configured to merge a connected block of each color channel of the three color channels of the color-reduced image and a connected block in the binary image in a connected manner in the vertical and horizontal directions, respectively, to obtain a candidate text region in the target image;

and the judging unit is used for extracting a specific area from the position, corresponding to the candidate character area, of the target image and judging whether the extracted specific area contains character rows or character columns or not based on the comparison result of the probability of containing the character area in the extracted specific area and a preset probability threshold.

Preferably, the color-reducing binary processing unit is further configured to quantize each of the red, green, and blue channels of the target image into K levels respectively to obtain K levels of intervals;

Preferably, the first merging unit is further configured to perform the following processing on each pixel in the subtractive color image and in the binary image as a single connected block by establishing a union set for the pixels:

the first merging unit is further configured to merge two adjacent connected blocks to which pixels with the same color belong into the same connected block if the color of the pixel is the same as that of any one of the 8 adjacent pixels

The first merging unit is further configured to determine a pixel area of each connected block, merge the connected blocks into a connected block adjacent to the connected block if the pixel area of the connected block is smaller than a pixel area threshold, and set the color of the connected block as the color of the merged connected block.

Preferably, the system further comprises:

a discarding processing unit, configured to discard connected blocks in the color-reduced image and connected blocks in the binary image that meet a preset feature after the first merging unit merges the connected blocks in the color-reduced image that have the same color and merges the connected blocks in the binary image that have the same color; the preset features include at least one of:

discarding connected blocks of which the area is smaller than a pixel area threshold value in the connected blocks;

discarding the connected blocks with any side length larger than the side length of the corresponding image in a first preset proportion;

and discarding the connected blocks of which any side length is larger than the frame length threshold value and the ratio of the pixel area to the bounding box area is smaller than the ratio threshold value.

Preferably, the system further comprises

A fourth merging unit, configured to merge the connected blocks with the same color in the color-reduced image and merge the connected blocks with the same color in the binary image into new connected blocks based on the positional relationship of the connected blocks of each color channel in the color-reduced image, and merge the connected blocks in the binary image into new connected blocks based on the positional relationship after the first merging unit merges the connected blocks with the same color in the color-reduced image and merges the connected blocks with the same color in the binary image;

the fourth merging unit is further configured to perform at least one of the following processes:

Preferably, the second merging unit is further configured to sequentially perform merging in the horizontal direction, merging in the vertical direction, and merging in the horizontal direction based on a connection merging rule; wherein the connection merging rule comprises:

Preferably, the determining unit is further configured to extract a specific region from the target image, obtain a connected bounding box between the color-reduced image and the binary image, send the connected bounding box between the color-reduced image and the binary image into a convolutional neural network classifier by using a specific sliding window step size sliding window for discrimination, and obtain a probability that each sliding window contains a character;

the judging unit is further configured to average probabilities of the characters included in the sliding window to obtain a probability that the candidate character region includes a character row or a character column;

the judging unit is further configured to judge that a text row or a text column exists in the specific region if the obtained probability is greater than a preset probability threshold.

According to the method, the image is divided into the connected blocks according to colors, the connected blocks are potential bounding boxes containing characters, then the probability that each bounding box contains character rows (or character columns) is verified through a convolutional neural network sliding window, when the probability is larger than a preset probability threshold value, the bounding box is judged to contain the character rows (or the character columns), the processing is suitable for the document image and the non-document image, and the text in the image can be accurately detected.

Drawings

FIG. 1 is a first flowchart of a text detection method according to an embodiment of the present invention;

FIG. 2 is a flowchart illustrating a second exemplary embodiment of a text detection method;

fig. 3 to 6 are schematic diagrams illustrating detection results of the text detection method according to the embodiment of the invention;

FIGS. 7-8 are schematic diagrams of convolutional neural networks in an embodiment of the present invention;

fig. 9 is an alternative structural diagram of the text detection system according to the embodiment of the invention.

Detailed Description

Embodiments of the present invention provide a method and system for detecting text in images, including scanned-format images and non-scanned-format images, where the images include not only conventional scanned-format images, such as PDF format, but also non-document images, such as joint photographic experts group (JPG) images, Bitmap (BMP) images, Tagged Image File Format (TIFF) images, Graphics Interchange Format (GIF) images, exchangeable image file format (EXIF) images, and the like.

The text detection system disclosed by the embodiment of the invention positions the region bearing the text in the image by implementing a file detection method, and the image for text detection by the file detection system can be a document image such as a PDF document, or can also be a non-document image such as a JPG image, a BMP image, a TIFF image, a GIF image and an EXIF image, and is used as a source of the image, mainly used for screen capture of electronic equipment (such as a smart phone, a tablet computer and a notebook computer), scanned electronic versions of printed matters such as posters and magazines, and other digital images containing printed Chinese characters.

Referring to fig. 1, in the embodiment of the present invention, in step 101, color reduction processing is performed on each image in three color channels of a target image to obtain a color-reduced image, and the target image is converted into a binary image; in step 102, combining the connected blocks with the same color in the color reduction image, and combining the connected blocks with the same color in the binary image; in step 103, combining the connected blocks of each color channel of the three color channels of the subtractive image and the connected blocks in the binary image in a connected manner in the vertical and horizontal directions respectively to obtain candidate text regions in the target image; in step 104, a specific region is extracted from the position of the target image corresponding to the candidate text region, and whether a text row or a text column is included in the extracted specific region is determined based on a comparison result between a probability that the extracted specific region includes a text region and a preset probability threshold.

It can be seen that the text detection system locates text lines (or text columns, such as text lines of Chinese characters, of course, text lines of letters such as English letters, numbers and symbols, or text lines formed by any type of character combination of Chinese characters, letters, numbers, symbols and the like) in the images shown in fig. 3 to 6 by clustering and layering colors of the images, merging and filtering connected blocks, and discriminating based on the deep convolutional neural network, so as to identify the characters in the text lines based on the located text lines.

The present invention will be described in further detail below with reference to the accompanying drawings and specific embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

Example one

Referring to fig. 2, the method for detecting text by the text detection system of the embodiment includes the following steps:

step 201, performing color reduction processing on the target image to obtain a color reduction image of the target image.

Inputting a target image to be detected, quantizing each channel of red, green and blue (RGB) three colors of the target image by K levels (K is an integer and 255> K >1, for example, 4), namely dividing (for example, uniformly dividing) the brightness of each channel in the RGB three-color channel into K sections (Bin), namely, reducing the brightness level of 0-255 to 0- (K-1) level, mapping the brightness of each pixel in the target image in the RGB three-color channel into the Bin of the corresponding channel division, for the target image, because each channel in the RGG three-color channel has 256 brightness levels (0-255), the target image can have 255^3 (the third power of 255) colors, and after dividing the brightness of each channel in the RGB three-color channel into K sections, the target image has K ^3 (the third power of K, less than 255^3) colors, thus, a color-reduced image f1 is obtained.

Taking the value K as 2, each channel has two luminance levels of 0 and 1 after quantization, that is, 0 to 127 in the luminance levels 0 to 255 of each channel are mapped to the quantized luminance 0, 128-255 in the luminance levels 0 to 255 of each channel are mapped to the quantized luminance 1, if the luminance of the corresponding RGB three-color channel of one pixel in the target image is (0, 122, 255), the luminance after color reduction processing is (0, 0, 1), and the above-mentioned luminance mapping processing is performed on each pixel in the target image.

As characters in an image usually have 2 cases, 1) the characters are monochromatic; 2) the brightness of the characters is obviously different from the areas around the characters. Step 201 achieves the following technical effects for the above two cases respectively: the characters in the color-reducing image are made to have one of K ^3 colors.

Step 202, local binarization processing is carried out on the target image to obtain a binary image of the target image.

Converting a target image into a gray image (only one gray channel), and performing local adaptive binarization on the gray image: dividing the gray scale image into N windows, and dividing pixels in the windows into two parts according to a uniform threshold value T for each window in the N windows to obtain a binary image f2, wherein T is a Gaussian weighted sum of windows with preset sizes (such as 25 × 25 pixels) and pixels as centers.

As characters in an image usually have 2 cases, 1) the characters are monochromatic; 2) the brightness of the characters is obviously different from the areas around the characters. Step 202 achieves the following technical effects for the two cases: and making the characters in the binary image belong to black or white.

The pixels corresponding to the characters in the color-reduced image and the binary image obtained in step 201 and step 202 have the same color, and in step 203, each pixel is used as a connected block and the connected blocks having the same color are combined, so that the characters are connected.

And step 203, identifying connected blocks in the color reduction image and the binary image, merging the connected blocks with the same color in the color reduction image, and merging the connected blocks with the same color in the binary image.

For the connected block of each color channel of the RGB three-color channels of the color-reduced image f1, and the connected block of the binary image f2 (only one gray image), the following processing is performed:

1) each pixel is taken as a single connected block (namely, a connected subgraph, which is a concept in graph theory, each pixel on an image is taken as a vertex in an undirected graph, an edge is taken between adjacent pixels, and the whole image is taken as an undirected graph).

2) Building and searching a set, which is a classical algorithm for efficiently performing a connected block merging process).

3) The subtractive color image f1 is traversed, and each pixel of the binary image f2 is processed to:

traversing the pixels in the subtractive color image f 1: for a certain pixel, if the color of any one of the pixels (the color of any one channel of the pixel in the RGB channel refers to the brightness value of the pixel in the corresponding channel, and the color in the pixel gray-scale image refers to the gray-scale value of the pixel in the gray-scale image) in 8 adjacent pixels (the pixels refer to 8 adjacent pixels at the upper, lower, left and right sides of the pixel and at two ends of 2 diagonal lines) is the same, merging the connected blocks to which two adjacent pixels with the same color belong into the same connected block; then, traversing each connected block, and judging the pixel area of each connected block: if the pixel area of the connected block k (the value range of k corresponds to the number of connected blocks) is smaller than the pixel area threshold (4 pixels), the connected block k (the pixel area of which is smaller than the pixel area threshold) is merged into the connected block adjacent to the connected block k, and the color of the connected block (the pixel area of which is smaller than the pixel area threshold) is set as the color of the merged connected block.

For example, for a pixel I in the subtractive color image f1 (I taking the value I)₁≥i≥1，I₁The number of pixels in the subtractive color image f 1) is the luminance of any channel X among the three RGB color channels (here, the channel X is any one of the three RGB color channels, and is referred to as the R channel), and if the luminance of any pixel j among pixels i and 8 adjacent pixels (which means 8 adjacent pixels at the upper, lower, left, right, and both ends of 2 diagonal lines) is the same in the corresponding channel (which is the same as the assumed R channel), the connected block to which the pixel i belongs and the connected block to which the pixel j belongs are merged into one connected block. Then, traversing each connected block, and judging the pixel area of each connected block: and if the pixel area of the connected block k (the value range of k is the number of the connected blocks) is smaller than the threshold value (4 pixels), merging the connected block k into a connected block adjacent to the connected block k, and setting the color of the pixel in the connected block k as the brightness of the connected block into which the connected block k is merged.

For another example, for a certain pixel, if the pixel I (I is I) in the gray scale map of the target image₂≥i≥1，I₂The number of pixels in the grayscale map) is the same as the color (grayscale value) of the pixel j in 8 adjacent pixels (8 pixels at the upper, lower, left, right, and two ends of 2 diagonal lines in total), the connected blocks to which the adjacent pixels i and j belong are merged into the same connected block; then, traversing each connected block, and judging the pixel area of each connected block: and if the pixel area of the connected block k (the value range of k is the number of the connected blocks) is smaller than the threshold (4 pixels), merging the connected block k into a connected block adjacent to the connected block k, and setting the gray value of the pixel in the connected block k as the gray value of the pixel in the connected block into which the connected block k is merged.

Step 203 merges pixels belonging to the same character (at least the same stroke for a Chinese character) together into a connected block for subsequent processing.

A subsequent step 204 discards connected blocks in the color-reduced image and in the binary image that meet the preset features (where the preset features correspond to features of non-text regions in the image).

Step 204, after merging the connected blocks in the color-reduced image and the binary image, discarding the connected blocks in the color-reduced image and the binary image that meet the preset features (where the preset features correspond to the features of the non-character region in the image).

The connected blocks of each color channel in the color reduction image f1 and the connected blocks of the binary image f2 are respectively subjected to at least one of the following processes:

1) discarding connected blocks with areas still smaller than a pixel area threshold (for example, 4 pixels) in the connected blocks, and regarding the connected blocks with areas still smaller than the pixel area threshold (for example, 4 pixels) as not bearing characters;

2) discarding the connected blocks corresponding to the background colors: the length of any side of the connected block is larger than a first preset proportion (such as 0.8 time) of the side length of the corresponding image;

3) abandon the intercommunication piece that the frame corresponds: any side of the connected block is larger than the frame length threshold (such as 65 pixels), and the ratio of the pixel area of the connected block to the bounding box area is smaller than the ratio threshold (such as 0.22). The bounding box of a connected block is the smallest rectangle that includes all the pixels contained in the connected block (the sides of the rectangle are parallel to the x and y axes of the image and therefore can be uniquely identified)

Optionally, step 206 may be further performed to merge the disconnected strokes of the characters in the image (e.g., the chinese characters and i and j of the english characters) together in view of the fact that the image includes characters whose strokes are not connected.

And step 205, merging the connected blocks into new connected blocks respectively based on the position relationship (such as distance and intersection) of the connected blocks of each color channel in the subtractive color image, and merging the connected blocks into new connected blocks based on the position relationship (such as distance and intersection) of the connected blocks in the binary image.

1) And merging the connected blocks with the distance smaller than the distance threshold (the distance refers to the Chebyshev distance d between the center points of the bounding boxes of the two connected blocks).

2) The maximum value of the average values of the length and the width of each of the two connected blocks is taken as ms (max ((a1+ b1)/2.0, (a2+ b2)/2.0)), where a1 and b1 are the length and the width a2 of the bounding box of the first connected block and b2 is the length and the width of the bounding box of the second connected block), and 0.4ms is taken as the distance threshold. Then, if the preset conditions are met, the method comprises the following steps: 0.4ms <1 or 1<0.4ms <3, and distance d < 3; and merging the two selected connected blocks.

3) For the connected blocks of each of the RGB three-color channels of the color-reduced image f1 and the connected blocks of the binary image f2, the connected blocks in which the bounding boxes intersect and the intersecting portions conform to the preset intersection features are merged. For example, if the bounding boxes of two connected blocks intersect, the area of the intersection part is greater than the preset 10% of the area of the smaller of the two bounding boxes, and the area of the intersection part is less than 10% of the image area, then the two connected blocks with the intersection in the bounding boxes are merged.

4) Merging connected blocks whose bounding boxes are aligned and which satisfy a preset alignment merging rule (alignment means: the bounding boxes of the connectivity block are aligned in a horizontal or vertical direction, i.e.: 1) the surrounding boxes of the two communicating blocks are consistent in height and consistent in vertical position; 2) the width of the bounding boxes of the two connected blocks is the same, and the positions in the horizontal direction are the same).

An example of an alignment merge rule is: after merging the aligned connected blocks, the bounding boxes of the two connected blocks (i.e. the smallest bounding box containing the two bounding boxes) are merged if the sum of the bounding box areas of the two connected blocks is less than the area increment proportion threshold (e.g. 10%) and the area of the merged bounding box is less than the proportion threshold (e.g. 10%) of the image area.

In step 206, the connected blocks of each color channel of the RGB three-color channels of the color-reduced image f1 and the connected blocks in the binary image f2 are respectively combined in a connected manner in the vertical and horizontal directions, so as to obtain candidate text regions (including text row regions and text column regions) in the image.

The aim is to connect single characters (such as Chinese characters) into character rows or columns: based on the connection merging rule (the same connection merging rule is used for the merging in the horizontal direction and the merging in the vertical direction, which will be described later), the connected blocks are firstly merged in the horizontal direction, then merged in the vertical direction, and finally merged in the horizontal direction.

Generally, characters in a horizontal arrangement mode are more common than characters in a vertical arrangement mode in an image, so that in step 206, the connected blocks are firstly merged in the horizontal direction, the horizontally arranged characters are guaranteed to be merged firstly, the possibility that the horizontal characters are wrongly and vertically merged is reduced, then the connected blocks are merged in the vertical direction, and the characters which do not meet the horizontal merging rule but meet the vertical merging rule are merged well; however, in this process, since the bounding box of the connected block may be changed, a new pair of bounding boxes satisfying the horizontal combination rule is generated, and therefore, the connected blocks in the horizontal direction are combined again.

One example of a join merge rule is that a bounding box of two connected tiles connects two connected tiles as new connected tiles satisfying at least one of the following conditions:

1) the center distance (the distance between the coordinates of the centers of the two bounding boxes in the corresponding reference axial direction) or the minimum distance (the distance between the edge coordinates of the two bounding boxes in the reference axial direction) of the bounding boxes of the two communicating blocks in the reference axial direction (a horizontal axis or a vertical axis) is smaller than a first preset proportion (for example, 0.15 times) of the minimum side length (the side length consistent with the reference axial direction) of the side lengths (the side lengths consistent with the reference axial direction) of the two bounding boxes corresponding to the reference axial direction;

since the coordinate ranges of the two bounding boxes in the corresponding reference axial directions may be separated or partially overlapped, the distances of the bounding boxes of the two connected blocks in the corresponding reference axial directions can be most accurately characterized by adopting the smaller distance in the center distance or the edge distance.

2) The distance between the bounding boxes of the two communicating blocks in the direction perpendicular to the reference axial direction is smaller than a second preset proportion (such as two times) of the minimum side length of the side lengths of the two bounding boxes corresponding to the direction perpendicular to the reference axial direction;

3) the difference between the side lengths of the bounding boxes of the two connected blocks in the reference axial direction (the difference between the side lengths of the bounding boxes of the two 2 connected blocks corresponding to the reference axial direction) is smaller than a third preset proportion (for example, 30%) of the minimum side length of the bounding boxes of the two connected blocks in the corresponding reference axial direction.

Step 207, extracting specific regions at the positions of bounding boxes (i.e. candidate text regions containing text lines or text columns) corresponding to the connected blocks on the target image, and correspondingly judging whether the text lines or the text columns are contained in the specific regions or not based on the probability that the text lines or the text columns are contained in the specific regions for each extracted specific region.

In the foregoing steps 201 to 206, a new bounding box obtained from a connected bounding box obtained from the color-reduced image f1 and the binary image f2, that is, a union of bounding boxes connected in a row, is rectangular in shape, that is, a region potentially including a text row or a text column (that is, a candidate text region), a region of interest (roigeion of interest, that is, the aforementioned specific region, a region to be processed which is outlined in the target image I in a manner of a box, a circle, an ellipse, an irregular polygon, etc.) is extracted from the target image I, a probability p _ w that a text is included in each sliding window is obtained by using a specific sliding window step length, for example, using the shortest side length S of the region as a window side length, and using 0.5S as a sliding window step length, and sending the sliding window into a pre-trained Convolutional Neural Network (CNN) classifier to discriminate, and the probabilities p _ w that the text region included in each sliding window are obtained by averaging all p _ w, so as to obtain the probability p _ l that the text region of the candidate text row (or text row), and if the probability p _ l is larger than a preset probability threshold (0.5), judging that the character row (or the character column) exists in the region of interest.

At step 208, the overlapping bounding boxes are combined into a bounding box and output as a text-containing region.

Steps 201 to 204 ensure the positional accuracy of bounding box (i.e. potential text region) (even if there is another image element in the bounding box instead of text row (or text column), the image element corresponding to text row can be discarded accurately, and the probability threshold filtering in step 208 ensures that the text row (or text column) is contained in the bounding box passing the filtering, and the bounding box passing the filtering has a more accurate position, and all the overlapping bounding boxes are combined into one bounding box and output without extreme inhibition.

Training a convolutional neural network:

marking Chinese characters in the received data (including character images), then screening the output of the step 206 (before filtering by the convolutional neural network), selecting the part close to the mark, cutting the bounding box into sliding windows according to the method in the step 208, manually separating the windows belonging to the characters and the windows not belonging to the characters, and scaling all the windows to 32 × 32 pixels.

These windows were built into training and validation data, training the neural networks shown in fig. 6 and 7, with each data being cut to 27 x 27 pixel size by random center and flipped randomly. Using random gradient descent (SGD) training, the trained base _ size takes 50, the weight attenuation (weight _ decay) takes 0.0005, the momentum takes 0.9, the learning rate (learning rate) calculates lr ═ base _ lr (1+ 0.0001. iter) ^ (-0.75) with the following formula, iter is the number of iterations, the first 10 ten thousand iterations, base _ lr takes 0.001, and then takes 0.0001.

An embodiment of the present invention provides a text detection system, which is shown in fig. 9 and includes:

a subtractive color binary processing unit 100, configured to perform subtractive color processing on each image in three color channels of a target image to obtain a subtractive color image, and convert the target image into a binary image;

a first merging unit 200, configured to merge connected blocks with the same color in the color-reduced image, and merge connected blocks with the same color in the binary image;

a second merging unit 300, configured to merge a connected block of each color channel of the three color channels of the color-reduced image and a connected block in the binary image in a connected manner in the vertical and horizontal directions, respectively, to obtain a candidate text region in the target image;

a determining unit 400, configured to extract a specific region from a position on the target image corresponding to the candidate text region, and determine whether the extracted specific region includes a text row or a text column based on a comparison result between a probability that the extracted specific region includes a text region and a preset probability threshold.

Preferably, the color-reducing binary processing unit 100 is further configured to quantize each of the red, green, and blue channels of the target image into K levels of intervals;

Preferably, the first merging unit 200 is further configured to, as a single connected block, establish a union set for each pixel in the color-reduced image and in the binary image, and perform the following processing on the union set of the pixels:

the first merging unit 200 is further configured to merge two adjacent connected blocks with the same color into the same connected block if the color of the pixel is the same as that of any one of the 8 adjacent pixels

The first merging unit 200 is further configured to determine a pixel area of each connected block, merge the connected block into a connected block adjacent to the connected block if the pixel area of the connected block is smaller than a pixel area threshold, and set a color of the connected block as a color of the merged connected block.

Preferably, the system further comprises:

a discarding processing unit 500, configured to discard connected blocks in the color-reduced image and connected blocks in the binary image that meet a preset feature after the first merging unit 200 merges connected blocks in the color-reduced image that have the same color and merges connected blocks in the binary image that have the same color; the preset features include at least one of:

Preferably, the system further comprises

A third merging unit 600, configured to merge the connected blocks with the same color in the color-reduced image and merge the connected blocks with the same color in the binary image into new connected blocks based on the position relationship of the connected blocks of each color channel in the color-reduced image, and merge the connected blocks in the binary image into new connected blocks based on the position relationship after the first merging unit 200 merges the connected blocks with the same color in the color-reduced image and merges the connected blocks in the binary image into new connected blocks based on the position relationship;

the third merging unit 600 is further configured to perform at least one of the following processes:

Preferably, the second merging unit 300 is further configured to sequentially perform merging in the horizontal direction, merging in the vertical direction, and merging in the horizontal direction based on a connection merging rule; wherein the connection merging rule comprises:

Preferably, the determining unit 400 is further configured to extract an interesting region from the target image, obtain a connected bounding box between the color-reduced image and the binary image, send the connected bounding box between the color-reduced image and the binary image into a convolutional neural network classifier by a sliding window with a specific sliding window step length for discrimination, and obtain a probability that each sliding window contains a character;

the determining unit 400 is further configured to average probabilities of the characters included in the sliding window to obtain a probability that the candidate character region includes a character row or a character column;

the determining unit 400 is further configured to determine that a text row or a text column exists in the region of interest if the obtained probability is greater than a preset probability threshold.

An embodiment of the present invention provides a computer storage medium, where an executable instruction is stored in the computer storage medium, and the executable instruction is used to execute the file detection method shown in fig. 1 or fig. 2.

In summary, the embodiments of the present invention have the following beneficial effects:

the invention provides a method and a system for detecting characters in an image, which are suitable for positioning characters such as print Chinese characters and the like in the image in a network album, and the output result can be used as the input of a character recognition system to help to finally generate an accurate character recognition result.

Those skilled in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as a removable Memory device, a Random Access Memory (RAM), a Read-Only Memory (ROM), a magnetic disk, and an optical disk.

Alternatively, the integrated unit of the present invention may be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a removable storage device, a RAM, a ROM, a magnetic or optical disk, or various other media that can store program code.

The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.

Claims

1. A method for detecting text, the method comprising:

quantizing each channel of red, green and blue channels of the target image by K levels respectively to obtain K level intervals, wherein K is an integer and is more than 255 and more than K > 1;

mapping the brightness of each pixel in the target image in an RGB three-color channel to an interval corresponding to channel quantization to obtain a color reduction image, and converting the target image into a binary image;

sequentially carrying out combination in the horizontal direction, combination in the vertical direction and combination in the horizontal direction on the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image based on a connection combination rule to obtain a candidate character area in the target image; wherein the connection merging rule comprises:

the minimum distance between the center distances or the edge distances of the bounding boxes of the two communicating blocks in the reference axial direction is smaller than a first preset proportion of the minimum side length of the side lengths of the bounding boxes of the two communicating blocks corresponding to the reference axial direction; the distance between the bounding boxes of the two communicating blocks in the direction perpendicular to the reference axial direction is smaller than a second preset proportion of the smallest side length of the bounding boxes of the two communicating blocks in the side length perpendicular to the reference axial direction; the difference value of the side lengths of the bounding boxes of the two communicating blocks in the reference axial direction is smaller than a third preset proportion of the smallest side length of the bounding boxes of the two communicating blocks corresponding to the side lengths in the reference axial direction;

2. The method of claim 1, wherein merging connected blocks having the same color in a subtractive image and merging connected blocks having the same color in a binary image comprises:

if the color of the pixel is the same as that of any one of the adjacent pixels of 8, combining the connected blocks of the two adjacent pixels with the same color into the same connected block;

3. The method of claim 1, wherein after merging connected blocks having the same color in the subtractive image and merging connected blocks having the same color in the binary image, the method further comprises:

4. The method of claim 1, wherein after merging connected blocks having the same color in the subtractive image and merging connected blocks having the same color in the binary image, the method further comprises:

merging the connected blocks of each color channel in the subtractive image into new connected blocks respectively based on the position relationship of the connected blocks, and merging the connected blocks in the binary image into new connected blocks based on the position relationship; wherein the merging comprises performing at least one of:

5. The method according to any one of claims 1 to 4, wherein the extracting a specific region at a position on the target image corresponding to the candidate text region, and determining whether the extracted specific region contains a text row or a text column based on a comparison result between a probability that the extracted specific region contains a text region and a preset probability threshold comprises:

6. A text detection system, the system comprising:

the color reduction binary processing unit is used for quantizing each channel of red, green and blue channels of the target image by K levels respectively to obtain K level intervals, wherein K is an integer and is more than 255 and K is more than 1;

the color reduction binary processing unit is used for mapping the brightness of each pixel in the target image in an RGB three-color channel to a quantized interval of a corresponding channel to obtain a color reduction image, and converting the target image into a binary image;

a second merging unit, configured to sequentially perform merging in the horizontal direction, merging in the vertical direction, and merging in the horizontal direction on a connected block of each color channel of the three-color channels of the color-reduced image and a connected block in the binary image based on a connection and merging rule, so as to obtain a candidate text region in the target image; wherein the connection merging rule comprises:

7. The system of claim 6,

the first merging unit is further configured to perform the following processing on each pixel in the color reduction image and the binary image as a single connected block, establishing a union-check set for the pixels:

the first merging unit is further configured to merge two adjacent connected blocks to which pixels with the same color belong into the same connected block if the color of the pixel is the same as that of any one of the pixels adjacent to the 8 th pixel;

8. The system of claim 6, wherein the system further comprises:

9. The system of claim 6, further comprising

wherein the fourth merging unit is further configured to perform at least one of the following processes:

10. The system according to any one of claims 6 to 9,

the judgment unit is further configured to extract a specific region from the target image, obtain connected bounding boxes in the color reduction image and the binary image, send the connected bounding boxes in the color reduction image and the binary image to a convolutional neural network classifier through a specific sliding window step length sliding window for judgment, and obtain a probability that each sliding window contains characters;

11. A storage medium having stored thereon executable instructions for causing a processor to perform the text detection method of any one of claims 1 to 5 when executed.