WO2017140233A1

WO2017140233A1 - Text detection method and system, device and storage medium

Info

Publication number: WO2017140233A1
Application number: PCT/CN2017/073407
Authority: WO
Inventors: 徐昆; 郭晓威; 黄飞跃; 郑宇飞; 张惜今; 卢艺帆
Original assignee: 腾讯科技（深圳）有限公司
Priority date: 2016-02-18
Filing date: 2017-02-13
Publication date: 2017-08-24
Also published as: CN107093172B; CN107093172A

Abstract

A text detection method and system, a device and a storage medium. The method comprises: performing subtractive colour processing on each image in a three-colour channel of a target image to obtain a subtractive colour image, and converting the target image into a binary image (101); merging connected blocks with the same colour in the subtractive colour image and merging connected blocks with the same colour in the binary image (102); respectively merging the connected blocks of each colour channel of the three-colour channel of the subtractive colour image and the connected blocks in the binary image in the vertical and horizontal directions in a connected manner, so as to obtain a candidate text area in the target image (103); and extracting a specific area on a position corresponding to the candidate text area in the target image, and based on a comparison result of the probability of a text area being included in the specific extracted area and a pre-set probability threshold value, determining whether the extracted specific area includes a text row or a text column (104). Accurate detection can be performed on text in an image.

Description

Text detection method and system, device, storage medium

Technical field

The invention relates to a text detection technology in an image, in particular to a text detection method and system, a device and a storage medium.

Background technique

A document image is an image format document, which converts a paper document or the like into an image format by some means (such as scanning) for electronic reading by a user. A typical example of a document image is a portable document format (PDF, Portable Document Format). Format) format image, and DjVu format image.

The current text detection technology can detect the text in the document image (the area in the image where the text is carried), and perform text recognition based on the detected area of the text.

The image in the general sense includes not only the document image but also the non-document image (that is, the image uploaded by the user such as a web album in a scanned format image, which may be a Joint Photographic Experts Group (JPG) image, bitmap (BMP) image. , Tag Image File Format (TIFF) images, Graphics Interchange Format (GIF) images, and Exchangeable Image File Format (EXIF) images.

If you can recognize text in non-document format images, you can get accurate semantic information to help users retrieve and manage images. In order to identify the text in the non-scan format image, detecting the text in the image is a necessary pre-step. The current text detection technology uses artificially designated features to determine whether the image contains text and more for English characters. Detection, because there is a significant difference in the glyph structure between Chinese and English, there is a big difference between the accuracy of the Chinese detection applied to the document image and the accuracy of detecting the English in the document image, which is difficult to meet the needs of practical applications.

Summary of the invention

Embodiments of the present invention provide a text detection method, system, device, and storage medium, which can accurately detect text in an image.

The technical solution of the embodiment of the present invention is implemented as follows:

In a first aspect, an embodiment of the present invention provides a text detection method, including:

Performing color reduction processing on each of the three color channels of the target image to obtain a subtractive image, and converting the target image into a binary image;

Merging the connected blocks having the same color in the subtractive image, and merging the connected blocks having the same color in the binary image;

And connecting the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a vertical manner and a horizontal direction, respectively, to obtain the target image Candidate text area;

Determining, in the target image, a specific region corresponding to the position of the candidate text region, and determining whether the extracted specific region is based on a comparison result of the extracted probability of including the text region in the specific region and a preset probability threshold Contains text lines or text columns.

In a second aspect, an embodiment of the present invention provides a text detection system, including:

a subtractive binary processing unit configured to perform a color reduction process on each of the three color channels of the target image to obtain a subtractive image, and convert the target image into a binary image;

a first merging unit configured to merge connected blocks having the same color in the reduced color image, and merge connected blocks having the same color in the binary image;

a second merging unit configured to merge the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a vertical and horizontal direction Obtaining a candidate text region in the target image;

The determining unit is configured to extract a specific area on the target image corresponding to the position of the candidate text area, based on the extracted probability and preset of the text area in the specific area The comparison result of the probability thresholds determines whether a text line or a character string is included in the extracted specific region.

In a third aspect, an embodiment of the present invention provides a text detecting device, including: a memory and a processor, where the executable file stores executable instructions, where the executable instructions are used to cause the processor to perform the following operations:

Performing color reduction processing on each of the three color channels of the target image to obtain a subtractive image;

Converting the target image into a binary image;

Combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image;

Extracting a specific region on the target image corresponding to a position of the candidate text region;

And determining, according to the comparison result of the extracted probability of including the text area in the specific area and the preset probability threshold, whether the extracted specific area includes a character line or a character string.

In a fourth aspect, an embodiment of the present invention provides a storage medium, where executable instructions are stored for performing a text detection method provided by an embodiment of the present invention.

In the embodiment of the present invention, the image is divided into connected blocks according to color, and the connected block is a potential bounding box containing characters, and then the convolutional neural network sliding window is used to verify the probability that each bounding box contains a text line (or a character string). When the probability is greater than the preset probability threshold, it is determined that the bounding box contains a character line (or a character string), and the above processing is applicable to the document image and the non-document image, and the text in the image can be accurately detected.

DRAWINGS

1-1 to 1-6 are schematic diagrams of pixel relationships provided by an embodiment of the present invention;

2 is an optional structural diagram of a character detection system according to an embodiment of the present invention;

3 is a schematic structural diagram of a character detecting device according to an embodiment of the present invention;

4 is a schematic flowchart 1 of a text detection method according to an embodiment of the present invention;

FIG. 5 is a second schematic diagram of a flow of a text detection method according to an embodiment of the present invention; FIG.

6 to FIG. 9 are schematic diagrams showing detection results of a character detecting method according to an embodiment of the present invention;

10 to FIG. 11 are schematic diagrams of a convolutional neural network according to an embodiment of the present invention;

FIG. 12 is a schematic structural diagram of a character detection system according to an embodiment of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is to be understood that the examples are provided to illustrate the invention and not to limit the invention. In addition, the embodiments provided below are part of the embodiments for carrying out the invention, and are not intended to provide all embodiments for carrying out the invention, and the technical solutions of the following embodiments are provided to those skilled in the art without any inventive work. The examples obtained by carrying out the reorganization and other embodiments based on the invention are all within the scope of the invention.

It should be noted that, in the embodiments of the present invention, the terms "including", "comprising", or any other variations thereof are intended to encompass non-exclusive inclusions, such that a method or apparatus comprising a plurality of elements includes not only the Elements, but also other elements not explicitly listed, or elements that are inherent to the implementation of the method or device. In the absence of further limitation, an element defined by the phrase "comprising a ..." does not exclude the presence of additional related elements in the method or device including the element (eg, a step in the method or a unit in the device) ).

The nouns and terms referred to in the embodiments of the present invention are applicable to the following explanations.

1) Gray value: indicates the integer number of pixels, for example, the range of pixels is 0-255, which is called the image of 256 gray levels.

2) Adjacency: Two pixels are in contact, then they are contiguous. A pixel is in contact with a pixel in its neighborhood. Adjacency only considers the spatial relationship of pixels.

Adjacencies include the following types:

2.1) 4 adjacency: As shown in Figure 1-1, the 4 neighborhoods of the pixel p(x, y) are adjacent pixels: (x+1, y); (x-1, y); (x, y+ 1); (x, y-1).

2.2) D adjacency: As shown in Figure 1-2, the D neighborhood of the pixel p(x, y) is the pixel on the diagonal (x+1, y+1); the D of the pixel p is represented by ND(p) Neighborhood: (x+1, y-1); (x-1, y+1); (x-1, y-1).

2.3) 8 adjacency: As shown in Figure 1-3, the 8 neighborhoods of the pixel p(x, y) are: pixels of the 4 neighborhood + D neighborhood, and the 8 neighborhood of the pixel p is represented by N8(p) .

3) Connected, two pixel connections (1) are contiguous; (2) Gray values (or other attributes) satisfy a particular similarity criterion (gray equal or moderate in a certain set).

Connectivity includes the following types:

3.1) 4 connectivity

As shown in FIGS. 1-4, for pixels p and q having a gray value V, if q is in the set N4(p), the two pixels are said to be 4 connected.

3.2) 8 connectivity

As shown in FIGS. 1-5, for pixels p and q having a value of V, if q is in the set N8(p), the two pixels are said to be 8-connected.

As shown in Figure 1-6, for pixels p and q with a value of gray value V, if:

I.q in the set N4(p), or,

II.q is in the set ND(p), and the intersection of N4(p) and N4(q) is empty (pixels without gray value V), then pixels p and q are connected by m, ie, 4 connected and D Connected hybrids are connected.

4) The connected areas are also connected to each other, and the pixels that communicate with each other (any of the above-described communication methods) form one area, and the unconnected points form different areas. Such a set of points where all points are connected to each other is called a connected domain.

Embodiments of the present invention provide a method, system, device, and storage medium for detecting characters in images (including images in a scan format and images in a non-scan format), and the images described herein include not only images in a conventional scan format, such as PDF format, which can also include non-document images such as Joint Photographic Experts Group (JPG) images, bitmap (BMP) images, Tagged Image File Format (TIFF) images, Graphic Interchange Format (GIF) images, and interchangeable image file formats. (EXIF) An image of any form such as an image.

The character detection method, system, device, and storage medium according to the embodiments of the present invention perform a file detection method to locate an area in which an image is carried in an image, and the image detected by the file detection system may be a document image such as a PDF document, or Non-document images, such as JPG images, BMP images, TIFF images, GIF images, and EXIF images, as a source of images, mainly for the scanning of electronic devices (such as smart phones, tablets, laptops), prints of posters, etc. Electronic version, and other digital images containing printed Chinese characters.

Referring to FIG. 4, an optional flowchart of the file detection method provided by the embodiment of the present invention is shown. In step 101, each image in the three color channels of the target image is subjected to color reduction processing to obtain a subtractive image, and Converting the target image into a binary image; in step 102, combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image; In step 103, the connected blocks of each color channel of the color-reduced image three-color channel and the connected blocks in the binary image are respectively combined in a vertical and horizontal direction to obtain the a candidate text area in the target image; in step 104, extracting a specific area on the target image corresponding to the position of the candidate text area, based on the extracted probability and preset probability of the text area in the specific area The comparison result of the thresholds determines whether a text line or a character string is included in the extracted specific region.

It can be seen that the text detection system locates the lines of text in the image as shown in Figures 6 to 9 by color clustering, layering, connected block merging and filtering, and discriminating based on a deep convolutional neural network ( Or a text column, such as a Chinese character, or a letter such as English. A text line of letters, numbers, symbols, or a combination of characters of any type such as Chinese characters, letters, numbers, symbols, etc., thereby identifying text in the text line based on the positioned text line.

For the character detection system provided by the embodiment of the invention, the text detection system can be implemented by a plurality of servers arranged in a distributed manner.

For example, referring to an optional structural diagram of the character detection system 20 shown in FIG. 2, a plurality of servers (the server 21 to the server 24) cooperate to detect text from the image, that is, each server completes at least part of the file detection method. The steps are sent to other servers that rely on the results of this processing to form the final result of the text detection.

Of course, for the server in the character detection system shown in FIG. 5, each server can perform text detection on different images (or the same image) in parallel, that is, the server 21 does not depend on other servers when performing text detection (the server 22 to The detection result of the server 24).

For the character detecting device provided by the embodiment of the present invention, a schematic diagram of the structure of the electronic device 30 for character detection provided by the embodiment of the present invention exemplarily shown in FIG. 3 is shown. FIG. 3 shows that the structure of the electronic device 30 is only one example of a suitable structure and is not intended to suggest any limitation with respect to the structure of the electronic device. The electronic device 30 includes a personal computer, a server computer, a handheld or laptop device, a mobile device (such as a mobile phone, a personal digital assistant (PDA), a media player, etc.), a consumer electronic device, a small computer, a mainframe computer, A distributed computing environment, etc., including any of the above devices.

Although not required, embodiments are described in the general context in which "computer readable instructions" are executed by one or more electronic devices. Computer readable instructions may be distributed via computer readable media (discussed below). Computer readable instructions may be implemented as program modules, such as functions, objects, application programming interfaces (APIs), data structures, etc. that perform particular tasks or implement particular abstract data types. Typically, the functionality of the computer readable instructions can be arbitrarily combined in various environments or distributed.

FIG. 3 illustrates an example of the structure of an electronic device 30 provided in accordance with an embodiment of the present invention. In one configuration, electronic device 30 includes at least one processing unit 31 and storage unit 32. Depending on the exact configuration and type of electronic device, memory unit 32 may be volatile (such as RAM), non-volatile (such as ROM, flash memory, etc.) or some combination of the two. This configuration is illustrated by dashed lines in FIG.

In other embodiments, electronic device 30 may include additional features and/or functionality. For example, electronic device 30 may also include additional storage devices (eg, removable and/or non-removable) including, but not limited to, magnetic storage devices, optical storage devices, and the like. This additional storage device is illustrated by storage unit 33 in FIG. In one embodiment, computer readable instructions for implementing one or more embodiments provided by embodiments of the present invention may be in storage unit 33. The storage unit 33 may also store other computer readable instructions for implementing an operating system, an application, and the like. Computer readable instructions may be loaded into storage unit 32 for execution by, for example, processing unit 31.

The term "computer readable medium" as used in the embodiments of the invention includes computer storage media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions or other data. The storage unit 32 and the storage unit 33 are examples of computer storage media. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital versatile disk (DVD) or other optical storage device, magnetic tape cassette, magnetic tape, magnetic disk storage device or other magnetic storage device, Or any other medium that can be used to store desired information and that can be accessed by electronic device 30. Any such computer storage media may be part of the electronic device 30.

Electronic device 30 may also include a communication connection 36 that allows electronic device 30 to communicate with other devices. Communication connection 36 may include, but is not limited to, a modem, a network interface card (NIC), an integrated network interface, a radio frequency transmitter/receiver, an infrared port, a USB connection, or other interface for connecting electronic device 30 to other electronic devices. Communication connection 36 may include a wired connection or wireless connection. Communication connection 36 can transmit and/or receive communication media.

The term "computer readable medium" can include a communication medium. Communication media typically embodies computer readable instructions or other data in a "modulated data signal" such as a carrier wave or other transport mechanism, and includes any information delivery media. The term "modulated data signal" can include a signal that one or more of the signal characteristics are set or changed in such a manner as to encode the information into the signal.

Electronic device 30 may include an input unit 35 such as a keyboard, mouse, pen, voice input device, touch input device, infrared camera, video input device, and/or any other input device. Output unit 34 may also be included in electronic device 30, such as one or more displays, speakers, printers, and/or any other output device. Input unit 35 and output unit 34 may be connected to electronic device 30 via a wired connection, a wireless connection, or any combination thereof. In one embodiment, an input device or output device from another electronic device can be used as the input unit 35 or output unit 34 of the electronic device 30.

The components of electronic device 30 can be connected by various interconnects, such as a bus. Such interconnects may include Peripheral Component Interconnect (PCI) (such as Fast PCI), Universal Serial Bus (USB), Firewire (IEEE 1394), optical bus architecture, and the like. In another embodiment, the components of electronic device 30 may be interconnected by a network. For example, storage unit 32 may be comprised of a plurality of physical memory units that are interconnected by a network located in different physical locations.

Referring to FIG. 5 again, the method for detecting text by the character detecting system in the embodiment of the present invention can be applied to a text detecting system or a text detecting device such as the foregoing, including the following steps:

Step 201: Perform color reduction processing on the target image to obtain a subtractive image of the target image.

Input the target image to be detected in the text detection system or the text detection device, and quantize each of the channels of the red, green, and blue (RGB) colors of the target image by K (the integer is K and 255>K>1, for example, The value is 4), that is, the luminance division (for example, uniform division) of each channel in the RGB three-color channel is K intervals (Bin), that is, the brightness level of 0-255 is reduced to 0-(K-1). Level, the brightness of each pixel in the target image in the RGB three-color channel is mapped to the Bin of the corresponding channel division.

For the target image, since each channel in the RGG tri-color channel has 256 brightness levels (0-255), the target image can have 255^3 (255 cubic) colors, and each of the RGB three-color channels After the luminance of the channels is divided into K intervals, the target image has K^3 (the square of K, less than 255^3) colors, and thus the subtractive image f1 is obtained.

Taking K as the value of 2, each channel has a level of 0 and 1 brightness after quantization, that is, 0-1727 of the brightness level 0-255 of each channel is mapped to the quantized brightness 0, Map 128-255 of the brightness level 0-255 of each channel to the quantized brightness 1. If the brightness of the corresponding RGB three-color channel of one pixel in the target image is (0, 122, 255), then the color reduction process The subsequent luminance is (0, 0, 1), and the above-described luminance mapping processing is performed for each pixel in the target image.

Since the characters in the image usually have two kinds of cases: 1) the text is monochrome; 2) the brightness of the text is significantly different from the area around the text. Step 201 achieves the following technical effects for the above two cases: the text in the subtractive image has one of K^3 colors.

Step 202: Perform local binarization processing on the target image to obtain a binary image of the target image.

Converting the target image into a grayscale image (only one grayscale channel), local adaptive binarization of the grayscale image: dividing the grayscale image into N windows, and then following each of the N windows A uniform threshold T divides the pixels in the window into two parts, resulting in a binary image f2, which is the Gaussian weighted sum of the windows of the preset size (eg 25*25 pixels) centered on this pixel.

Since the characters in the image usually have two kinds of cases: 1) the text is monochrome; 2) the brightness of the text is significantly different from the area around the text. Step 202 achieves the following technical effects for the above two cases: respectively, the text in the binary image belongs to one of black or white.

The pixels corresponding to the text in the subtractive image obtained in step 201 and step 202 and the characters in the binary image have the same color, and each pixel is used as a connected block in step 203 and will have the same color. The color connected blocks are merged to connect the text.

Step 203, identifying the connected blocks in the subtracted image and in the binary image, combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image.

For the connected block of each color channel of the RGB three-color channel of the subtractive image f1, and the connected block of the binary image f2 (only one grayscale image), the following processing is performed:

1) For each pixel as a separate connected block (that is, the connected subgraph, which is a concept in graph theory, each pixel on the image is regarded as a vertex in the undirected graph, and adjacent pixels are regarded as having One side, the entire image is treated as an undirected image).

2) Establishing and collecting, and collecting is a classic algorithm for efficient interconnection block merging process).

3) traversing the subtractive image f1, and each pixel of the binary image f2 to perform the following processing:

Traversing the pixel in the subtractive image f1: for a certain pixel, if the pixel is adjacent to 8 (referring to the pixel up and down and the total of 8 adjacent pixels of the two diagonals) The color of the pixel (the color of any channel in the RGB channel refers to the brightness value of the pixel in the corresponding channel, and the color in the grayscale image of the pixel refers to the gray value of the pixel in the grayscale image), then the adjacent The connected blocks to which two pixels of the same color belong are merged into the same connected block; then, each connected block is traversed, and the pixel area of each connected block is judged: if the connected block k (the value range of k corresponds to the connected block) The number of pixels is smaller than the pixel area threshold (4 pixels), then the connected block k (the pixel area is smaller than the pixel area threshold) is merged into the connected block adjacent to the connected block k, and the connected block (the pixel area is smaller than The color of the pixel area threshold is set to the color of the connected connected block.

For example, for the pixel i in the subtractive image f1 (i takes the value I ₁ ≥ i ≥ ₁ , I ₁ is the number of pixels in the subtracted image f1) in any of the RGB three-color channels X (here, the channel X is the brightness of any of the RGB three-color channels, here R channel), if the pixels i and 8 are adjacent to each other (refers to the upper and lower sides of the pixel i and the two ends of the two diagonals) Any pixel j in the pixel is consistent in brightness of the corresponding channel (consistent with the aforementioned assumed R channel), and the connected block to which the pixel i belongs and the connected block to which the pixel j belongs are merged into one connected block. Then, traversing each connected block, determining the pixel area of each connected block: if the pixel area of the connected block k (the range of k is the number of connected blocks) is smaller than the threshold (4 pixels), the connected block k Merged into the connected block adjacent to the connected block k, the color of the pixel in the connected block k is set to the luminance of the connected block in which the connected block k is incorporated.

For another example, for a certain pixel, if the pixel i in the grayscale image of the target image (i takes the value I ₂ ≥ i ≥ 1, I ₂ is the number of pixels in the grayscale image) and the adjacent pixel (pixel) The color (gray value) of the pixel j in the upper and lower left and right sides of i and the total of 8 pixels at both ends of the two diagonal lines are the same, and the adjacent blocks i and the connected blocks to which the pixel j belongs are merged into the same communication. Block; then, traverse each connected block, and judge the pixel area of each connected block: if the pixel area of the connected block k (the range of k is the number of connected blocks) is smaller than a threshold (4 pixels), it will be connected The block k is merged into the connected block adjacent to the connected block k, and the gray value of the pixel in the connected block k is set to the gray value of the pixel in the connected block in which the connected block k is incorporated.

Step 203 combines the pixels belonging to the same character (for the Chinese character, at least the same stroke) into one called a connected block for subsequent processing.

Subsequent step 204 discards the connected blocks in the subtracted image and in the binary image that match the preset features (the preset features herein correspond to the features of the non-text regions in the image).

Step 204: After merging the connected blocks in the subtracted image and the binary image, discarding the preset features in the subtracted image and the binary image (the preset feature here corresponds to the feature of the non-text region in the image) Connected block.

At least one of the following processing is performed on the connected block of each color channel and the connected block of the binary image f2 in the subtractive image f1:

1) Discard the connectivity in the connected block that is still smaller than the pixel area threshold (for example, 4 pixels) A connected block whose area is still smaller than a pixel area threshold (for example, 4 pixels) is regarded as an unsupported character;

2) discard the connected block corresponding to the background color: the length of either side of the connected block is greater than the first preset ratio of the edge length of the corresponding image (for example, 0.8 times);

3) Discard the connected block corresponding to the border: the length of any side of the connected block is greater than the threshold length of the frame (such as 65 pixels), and the ratio of the pixel area of the connected block to the bounding box product is less than the ratio threshold (such as 0.22). The bounding box of the connected block is the smallest rectangle that includes all the pixels contained in the connected block (the sides of the rectangle correspond to the x and y axes of the image, so it can be uniquely determined)

Alternatively, in view of the case where the image includes characters such as Chinese characters that are not connected to the strokes, step 206 may be performed to merge the strokes of the characters in the image (such as Chinese characters and i and j in the English characters).

Step 205: merging the positional relationships (such as distance, intersection) of the connected blocks of each color channel in the color-reduced image into new connected blocks, and based on the positional relationship (eg, distance, for the connected blocks in the binary image) Cross) merges into new connected blocks.

1) A connected block whose merge distance is smaller than the distance threshold (distance refers to the Chebyshev distance d of the center point of the bounding box of the two connected blocks).

2) Take the maximum value of the average values of the respective lengths and widths of the two connected blocks, and set ms (max((a1+b1)/2.0, (a2+b2)/2.0))), a1, b1 are the first The length and width a2, b2 of the bounding box of one connected block are the length and width of the bounding box of the second connected block, and 0.4 ms is taken as the distance threshold. Then, if the preset condition is met, for example, 0.4 ms < 1 or 1 < 0.4 ms < 3, and the distance d < 3; the selected two connected blocks are merged.

3) For the connected block of each channel of the RGB three-color channel of the subtractive image f1, and the connected block of the binary image f2, the merged bounding box has a connected block in which the intersecting portion conforms to the preset intersecting feature. For example, if there is an intersection of the bounding boxes of the two connected blocks, the area of the intersecting portion is greater than a preset 10% of the area of the smaller of the two bounding boxes, and the area of the intersecting portion is less than 10% of the area of the image, The bounding box has a connected block of intersections.

4) merging the connected blocks that are aligned and satisfy the preset alignment and merging rules (alignment means that the bounding boxes of the connected blocks are aligned horizontally or vertically, ie: 1) the bounding boxes of the two connected blocks are of the same height, and The position is consistent in the vertical direction; 2) the width of the bounding boxes of the two connected blocks are uniform, and the positions in the horizontal direction are uniform) merge.

An example of an alignment merge rule is: the merge of two connected blocks (ie, the smallest bounding box containing two bounding boxes) and the bounding box area of the two connected blocks. If the area of the bounding box is smaller than the area threshold of the image area (for example, 10%), the bounding box of the two connected blocks is merged.

Step 206: Combine the connected blocks of each color channel of the RGB three-color channel of the subtractive image f1 and the connected blocks of the binary image f2 in a vertical and horizontal direction, respectively, to obtain an image. Candidate text area (including text line area and text column area).

The purpose is to connect a single text (such as a Chinese character) into a text line or column: based on the join merge rule (the same join merge rule is used for the merge of the horizontal direction and the vertical direction, which is described later). A horizontal merge, then a vertical merge, and finally a horizontal merge.

Generally, the horizontally arranged text in the image is more common than the vertical text, so in step 206, the connected blocks are first merged horizontally, so that the horizontally arranged characters are first merged, and the horizontal characters are erroneously vertically merged. Possibly, then merge the connected blocks vertically, and merge the ones that do not satisfy the horizontal merge rule but satisfy the vertical merge rule; but in this process, because the bounding box of the connected block may be changed, a new satisfaction level merge is generated. The bounding box pair of rules, so do another merging of connected blocks in the horizontal direction.

An example of a connection merge rule is that a bounding box of two connected blocks satisfies at least one of the following conditions: connecting two connected blocks as new connected blocks:

1) the center distance of the bounding box of the two communicating blocks on the reference axis (horizontal axis or vertical axis) (the distance between the coordinates of the two bounding boxes at the center of the corresponding reference axis) or the edge distance The minimum distance (the distance between the edge coordinates of the two bounding boxes on the reference axis) is less than the first of the minimum side lengths of the side lengths of the two bounding boxes corresponding to the reference axis (the side length coincident with the reference axial direction) Preset ratio (eg 0.15 times);

Since the coordinate ranges of the two bounding boxes on the respective reference axes may be separated or partially coincident, the use of a smaller distance between the center distance or the edge distance enables the most accurate representation of the bounding boxes of the two connected blocks. Corresponding reference to the distance in the axial direction.

2) the bounding box of the two connecting blocks has a distance in a direction perpendicular to the reference axis that is smaller than a second preset ratio (eg, twice) of the minimum side length of the side lengths of the two bounding boxes corresponding to the reference axis;

3) the difference between the side lengths of the bounding boxes of the two connected blocks in the reference axis (the difference in the side lengths of the corresponding reference axes of the bounding boxes of the two connected blocks) is smaller than the bounding box of the two connected blocks The third preset ratio (eg, 30%) of the minimum side length of the side length of the reference axis.

Step 207: extract a specific area on the target image corresponding to the position of the bounding box (that is, the text area of the candidate including the character line or the character string) corresponding to the connected block connected together, and for each extracted specific area, based on the specific area The probability of including a text line or a character string corresponds to whether or not a character line or a character string is included in the specific area.

In the foregoing steps 201 to 206, the bounding box obtained by connecting the subtractive image f1 and the binary image f2, that is, the new bounding box obtained by the union of the bounding boxes connected in a row, is rectangular in shape, and also Is a potential area including a character line or a character string (that is, a candidate text area), and extracts a region of interest (ROIregion of interest, that is, the aforementioned specific region, from the target image I in the target image I) The area to be processed by the frame, circle, ellipse, irregular polygon, etc.), with the specific sliding window step length such as the shortest side length S of the area as the window side length, 0.5S is the sliding window step sliding window Determined by a pre-trained convolutional neural network (CNN) classifier, the probability p_w containing text in each sliding window is obtained, and all p_w are averaged to obtain a candidate text area which is a text line (or a character string). Probability p_l, if the probability p_l is greater than a preset probability threshold (taken 0.5), it is determined that there is a text line (or a character column) in the region of interest.

In step 208, the overlapping bounding boxes are merged into one bounding box and output as a region containing text.

Steps 201 to 204 ensure the positional accuracy of the bounding box (that is, the potential text area) (even if the bounding box is another image element instead of a text line (or a character string), the corresponding text line can be accurately The image elements are discarded, and the probability threshold filtering in step 208 ensures that the bounding boxes through the filtering contain text lines (or character columns), and the bounding boxes through the filtering have relatively accurate positions, without requiring non-maximum suppression, directly All overlapping bounding boxes are combined into one bounding box and output.

Convolutional neural network training steps:

For the data (the image containing the text), mark the Chinese characters therein, and then filter the output of the above step 206 (before the convolutional neural network filtering), select the portion close to the label, and cut the bounding box according to the method in the above step 208. For sliding windows, manually separate windows belonging to text and not belonging to text, all windows are scaled to 32*32 pixels.

These windows are constructed to train and validate the data, and the convolutional neural networks shown in Figures 10 and 11 are trained. Each data is randomly clipped to a 27*27 pixel size during training and randomly flipped. The convolutional neural network is trained by Stochastic gradient descent (SGD). The batch size (batch_size) of the training is 50, the weight attenuation (weight_decay) is 0.0005, the momentum (momentum) is 0.9, and the learning rate is (learning rate). Calculated by the following formula: lr = base_lr * (1 + 0.0001 * ititer) ^ (-0.75), iter is the number of iterations, the first 100,000 iterations, base_lr takes 0.001, then take 0.0001.

A description will be given of the division of the functional structure of the text detection system provided by the embodiment of the present invention. Referring to the text detection system 12 provided by the embodiment of the present invention, the first merging unit of the subtractive binary processing unit 121 is included. 122 second merging unit 123 determining unit 124

The subtractive color binary processing unit 121 is configured to perform color reduction processing on each of the three color channels of the target image to obtain a subtractive color image, and convert the target image into a binary image;

a first merging unit 122 configured to merge connected blocks having the same color in the reduced color image, and merge connected blocks having the same color in the binary image;

a second merging unit 123 configured to connect the connected blocks of each color channel of the color-reduced image three-color channel and the connected blocks in the binary image in a vertical and horizontal direction Combining to obtain candidate text regions in the target image;

The determining unit 124 is configured to extract a specific area on the target image corresponding to the position of the candidate text area, and determine, according to the comparison result of the extracted probability of including the text area in the specific area and a preset probability threshold. Whether the extracted specific area contains a text line or a character column.

Optionally, the subtractive color binary processing unit 121 is further configured to quantize each of the red, green, and blue color channels of the target image into K levels to obtain K levels of intervals;

The luminance of each pixel in the target image in the RGB three-color channel is mapped to the corresponding channel quantization interval, K is an integer and 255>K>1.

Optionally, the first merging unit 122 is further configured to establish, for each pixel in the subtracted image and the binary image as a single connected block, perform a parallel check execution on the pixel. The following processing:

The first merging unit 122 is further configured to merge the connected blocks to which the two adjacent pixels of the same color belong to the same connected block if the color of any one of the pixels adjacent to the pixel is the same.

The first merging unit 122 is further configured to determine a pixel area of each of the connected blocks, and if the pixel area of the connected block is smaller than a pixel area threshold, merge the connected block with the connected block Adjacent connecting blocks, and the color of the connected block is set to the color of the connected connected block.

Optionally, the system further includes:

a discarding processing unit 125 configured to be in the subtractive image in the first merging unit 122 Connected blocks having the same color are merged, and after the connected blocks having the same color in the binary image are merged, the connected blocks in the subtracted image and in the binary image that conform to the preset features are discarded; the preset features include At least one of the following:

Discarding the connected block in the connected block whose area is smaller than the pixel area threshold;

And discarding, in the connected block, a connected block of a first preset ratio whose length is greater than a side length of the corresponding image;

A connected block in which any one of the connected blocks is longer than a frame length threshold and the ratio of the pixel area to the bounding box product is smaller than a ratio threshold is discarded.

Optionally, the system further includes

a third merging unit 126 configured to merge the connected blocks having the same color in the reduced color image at the first merging unit 122, and merge the connected blocks having the same color in the binary image, based on The positional relationship of the connected blocks of each color channel in the color reduction image is separately merged into a new connected block, and the connected blocks in the binary image are merged into a new connected block based on a positional relationship;

The third merging unit 126 is further configured to perform at least one of the following processes:

a connected block whose merge distance is less than a distance threshold;

Taking a maximum value of the average values of the respective lengths and widths of any two of the connected blocks, and if the maximum value satisfies a preset condition, combining the selected two connected blocks;

The merged bounding box has a connected block that intersects and the intersecting portion conforms to the preset cross feature;

Merges the connected blocks whose bounding box is aligned and meets the preset alignment merge rules.

Optionally, the second merging unit 123 is further configured to perform merging in a horizontal direction, merging in a vertical direction, and merging in a horizontal direction according to different types of connection merging rules; wherein the connection merging rule includes:

The two connected blocks selected by the connection satisfying at least one of the following conditions are new connected blocks:

The bounding box of the two connecting blocks in the center distance or the edge distance in the reference axis a minimum distance, less than a first predetermined ratio of the minimum side length of the side lengths of the reference axes corresponding to the bounding boxes of the two connected blocks;

The bounding boxes of the two connecting blocks have a distance in a direction perpendicular to the reference axis smaller than a minimum side length of the bounding boxes of the two connecting blocks in a side length perpendicular to the reference axis Two preset ratios;

The difference between the side lengths of the bounding boxes of the two connecting blocks in the reference axis is smaller than the third preset ratio of the minimum side length of the side lengths of the bounding boxes of the two connecting blocks corresponding to the reference axis .

Optionally, the determining unit 124 is further configured to extract a region of interest on the target image, and obtain a bounding box connecting the subtractive image and the binary image to a specific sliding window step. The long sliding window is determined by the bounding box connected to the convolutional neural network classifier in the subtractive image and the binary image, and the probability of containing characters in each sliding window is obtained;

The determining unit 124 is further configured to average an probability of including characters in the sliding window, and obtain a probability that the candidate text region includes a character row or a character string;

The determining unit 124 is further configured to determine that a text line or a character string exists in the region of interest if the obtained threshold value is greater than a preset probability.

It is to be understood that the functional division of the character detection system shown in FIG. 12 is exemplarily applicable to the division of the functional structure of the electronic device provided by the embodiment of the present invention, and the person skilled in the art according to FIG. 12 and In the description of FIG. 12, the functional structure may be easily modified, such as merging the functional units of the part, or further dividing the functional units. Therefore, the functional structure of the character detecting system provided by the embodiment of the present invention is Not limited to Figure 12.

An embodiment of the present invention provides a non-volatile storage medium, where the computer storage medium stores executable instructions for executing the file detecting method illustrated in FIG. 2 or FIG. 5, where the storage medium includes : Mobile storage devices, random access memory (RAM, Random Access Memory), read-only memory (ROM, Read-Only Memory), disk or optical disk, etc. A medium that can store program code.

In summary, the embodiments of the present invention have the following beneficial effects:

The embodiment of the invention provides a method, a system, a device and a storage medium for detecting characters in an image, which are suitable for locating characters such as printed Chinese characters in an image in a network album, and the output result can be used as an input of a character recognition system to help ultimately generate Accurate text recognition results.

It can be understood by those skilled in the art that all or part of the steps of implementing the above method embodiments may be completed by hardware related to program instructions, and the foregoing program may be stored in a computer readable storage medium, and the program is executed when executed. The foregoing storage medium includes: a mobile storage device, a random access memory (RAM), a read-only memory (ROM), a magnetic disk, or an optical disk. A medium that can store program code.

Alternatively, the above-described integrated unit of the present invention may be stored in a computer readable storage medium if it is implemented in the form of a software function module and sold or used as a standalone product. Based on such understanding, the technical solution of the embodiments of the present invention may be embodied in the form of a software product in essence or in the form of a software product, which is stored in a storage medium and includes a plurality of instructions for making A computer device (which may be a personal computer, server, or network device, etc.) performs all or part of the methods described in various embodiments of the present invention. The foregoing storage medium includes various media that can store program codes, such as a mobile storage device, a RAM, a ROM, a magnetic disk, or an optical disk.

The above is only a specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily think of changes or substitutions within the technical scope of the present invention. It should be covered by the scope of the present invention. Therefore, the scope of the invention should be determined by the scope of the appended claims.

Claims

A text detection method comprising:

Performing color reduction processing on each of the three color channels of the target image to obtain a subtractive image;

Converting the target image into a binary image;

Combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image;

And connecting the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a vertical manner and a horizontal direction, respectively, to obtain the target image Candidate text area;

Extracting a specific region on the target image corresponding to a position of the candidate text region;

And determining, according to the comparison result of the extracted probability of including the text area in the specific area and the preset probability threshold, whether the extracted specific area includes a character line or a character string.
The method of claim 1, wherein the color reduction processing is performed on each of the three color channels of the target image to obtain a subtractive image, comprising:

Each of the three color channels of the target image is quantized by K levels to obtain K levels of intervals;

The luminance of each pixel in the target image in the red, green and blue color channels is mapped into the corresponding channel quantization interval, K is an integer and 255>K>1.
The method of claim 1, wherein the combining the connected blocks having the same color in the subtractive image, and combining the connected blocks having the same color in the binary image, comprises:

For each pixel in the subtracted image and in the binary image as a single connected block, establishing a parallel collection for the pixel performs the following processing:

If the color of any one of the pixels adjacent to the pixel is the same, the connected blocks to which the two adjacent pixels of the same color belong are merged into the same connected block;

Determining the pixel area of each of the connected blocks, if the pixel area of the connected block Less than the pixel area threshold, the connected block is incorporated into the connected block adjacent to the connected block, and the color of the connected block is set to the color of the incorporated connected block.
The method of claim 1 further comprising:

Combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image,

And discarding the connected block in the subtracted image and the binary image that meets the preset feature;

The preset feature includes at least one of the following:

a connected block in the connected block whose area is smaller than a pixel area threshold;

a connected block of any one of the connected blocks having a length greater than a first predetermined ratio of a side length of the corresponding image;

Any one of the connected blocks is longer than the frame length threshold, and the ratio of the pixel area to the bounding box is smaller than the ratio threshold.
The method of claim 1 further comprising:

Combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image,

And combining the positional relationship of the connected blocks of each color channel in the color reduction image into a new connected block;

And connecting the connected blocks in the binary image to a new connected block based on a positional relationship;

The merging into a new connected block includes at least one of the following:

a connected block whose merge distance is less than a distance threshold;

Taking a maximum value of the average values of the respective lengths and widths of any two of the connected blocks, and if the maximum value satisfies a preset condition, combining the selected two connected blocks;

The merged bounding box has a connected block that intersects and the intersecting portion conforms to the preset cross feature;

Merges the connected blocks whose bounding box is aligned and meets the preset alignment merge rules.
The method of claim 1 wherein said subtracting image of said three color channel The connected blocks of each color channel and the connected blocks in the binary image are combined in a vertical and horizontal direction, respectively, to obtain candidate text regions in the target image, including:

Merging in the horizontal direction, merging in the vertical direction, and merging in the horizontal direction based on different types of connection merge rules;

The connection merge rule includes at least one of the following:

a minimum distance of a center distance or an edge distance of the bounding box of the two connecting blocks in the reference axial direction, less than a minimum side length of the side lengths of the bounding boxes of the two connecting blocks corresponding to the reference axis a preset ratio;

The bounding boxes of the two connecting blocks have a distance in a direction perpendicular to the reference axis smaller than a minimum side length of the bounding boxes of the two connecting blocks in a side length perpendicular to the reference axis Two preset ratios;

The difference between the side lengths of the bounding boxes of the two connecting blocks in the reference axis is smaller than the third preset ratio of the minimum side length of the side lengths of the bounding boxes of the two connecting blocks corresponding to the reference axis .
The method according to any one of claims 1 to 6, wherein said extracting a specific region on a position corresponding to said candidate text region on said target image, based on said extracted specific region containing text region The comparison result of the probability and the preset probability threshold determines whether the extracted specific area contains a text line or a character string, including:

Extracting a specific area on the target image, and obtaining a bounding box connecting the subtractive image and the binary image;

Deriving a bounding box obtained by connecting the subtracted image and the binary image into a convolutional neural network classifier with a specific sliding window step sliding window, and obtaining a probability that each sliding window contains text ;

Having an average of the probability of including text in the sliding window, and obtaining a probability that the candidate text area includes a character line or a character string;

If the obtained threshold is greater than the preset probability threshold, it is determined that there is a text line or a character string in the specific area.
A text detection system comprising:

a subtractive binary processing unit configured to perform color reduction processing on each of the three color channels of the target image to obtain a subtractive image;

The subtractive binary processing unit is further configured to convert the target image into a binary image;

a first merging unit configured to merge connected blocks having the same color in the reduced color image;

The first merging unit is further configured to merge the connected blocks having the same color in the binary image;

a second merging unit configured to merge the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a vertical and horizontal direction Obtaining a candidate text region in the target image;

a determining unit configured to extract a specific area on the target image corresponding to a position of the candidate text area;

The determining unit is further configured to determine whether the extracted specific area includes a character line or a character string based on a comparison result between the extracted probability of including the text area in the specific area and a preset probability threshold.
The system of claim 8 wherein

The subtractive color processing unit is further configured to quantize each of the three color channels of the target image into K levels to obtain K levels of intervals;

The subtractive binary processing unit is further configured to map the luminance of each pixel in the target image in the interval of the corresponding color channel to the corresponding channel quantization interval, where K is an integer and 255>K>1.
The system of claim 8 wherein

The first merging unit is further configured to be in the subtractive image and in the binary image Each pixel is treated as a separate connected block, and the following processing is performed for the parallel collection of the pixels:

The first merging unit is further configured to merge the connected blocks to which the two adjacent pixels of the same color belong to the same connected block if the color of any one of the pixels adjacent to the pixel is the same;

The first merging unit is further configured to determine a pixel area of each of the connected blocks, and if the pixel area of the connected block is smaller than a pixel area threshold, merge the connected block with the connected block Adjacent connected blocks, and the color of the connected block is set to the color of the connected connected block.
The system of claim 7 further comprising:

a discarding processing unit configured to merge the connected blocks having the same color in the subtracted image in the first merging unit, and merge the connected blocks having the same color in the binary image, and discard the subtractive image a connected block that conforms to a preset feature in the middle and binary images; the preset feature includes at least one of the following:

a connected block in the connected block whose area is smaller than a pixel area threshold;

a connected block of any one of the connected blocks having a length greater than a first predetermined ratio of a side length of the corresponding image;

Any one of the connected blocks is longer than the frame length threshold, and the ratio of the pixel area to the bounding box is smaller than the ratio threshold.
The system of claim 8 further comprising:

a fourth merging unit configured to merge the connected blocks having the same color in the subtractive image in the first merging unit, and merge the connected blocks having the same color in the binary image, based on the Positional relationships of connected blocks of each color channel in the subtractive image are respectively merged into new connected blocks, and merged into new connected blocks based on the positional relationship for the connected blocks in the binary image;

The fourth merging unit is further configured to merge into a new connected block by using at least one of the following manners:

a connected block whose merge distance is less than a distance threshold;

Taking a maximum value of the average values of the respective lengths and widths of any two of the connected blocks, and if the maximum value satisfies a preset condition, combining the selected two connected blocks;

The merged bounding box has a connected block that intersects and the intersecting portion conforms to the preset cross feature;

Merges the connected blocks whose bounding box is aligned and meets the preset alignment merge rules.
The system of claim 8 wherein

The second merging unit is further configured to perform merging in the horizontal direction, merging in the vertical direction, and merging in the horizontal direction according to different types of connection merging rules; wherein the connection merging rule includes:

Connect the two connected blocks selected by at least one of the following conditions to the new connected block:

a minimum distance of a center distance or an edge distance of the bounding box of the two connecting blocks in the reference axial direction, less than a minimum side length of the side lengths of the bounding boxes of the two connecting blocks corresponding to the reference axis a preset ratio;

The bounding boxes of the two connecting blocks have a distance in a direction perpendicular to the reference axis smaller than a minimum side length of the bounding boxes of the two connecting blocks in a side length perpendicular to the reference axis Two preset ratios;

The difference between the side lengths of the bounding boxes of the two connecting blocks in the reference axis is smaller than the third preset ratio of the minimum side length of the side lengths of the bounding boxes of the two connecting blocks corresponding to the reference axis .
A system according to any one of claims 8 to 13, wherein

The determining unit is further configured to extract a specific area on the target image, and connect the bounding box in the subtractive image and the binary image to a specific sliding window step sliding window Determining the subtractive color image and the bounding box sent to the convolutional neural network classifier in the binary image, and obtaining a probability that each of the sliding windows contains characters;

The determining unit is further configured to average an probability of including characters in the sliding window, and obtain a probability that the candidate text area includes a character line or a character string;

The determining unit is further configured to determine that a character line or a character string exists in the specific area if the obtained threshold value is greater than a preset probability.
A text detecting device includes: a memory and a processor, wherein the memory stores executable instructions for causing the processor to perform the following operations:

Performing color reduction processing on each of the three color channels of the target image to obtain a subtractive image;

Converting the target image into a binary image;

Combining the connected blocks having the same color in the subtracted image, and merging the connected blocks having the same color in the binary image;

And connecting the connected blocks of each color channel of the three-color channel of the subtractive image and the connected blocks in the binary image in a vertical manner and a horizontal direction, respectively, to obtain the target image Candidate text area;

Extracting a specific region on the target image corresponding to a position of the candidate text region;

And determining, according to the comparison result of the extracted probability of including the text area in the specific area and the preset probability threshold, whether the extracted specific area includes a character line or a character string.
A storage medium storing executable instructions for performing the character detecting method according to any one of claims 1 to 7.