CN113498521A

CN113498521A - Text detection method and device and storage medium

Info

Publication number: CN113498521A
Application number: CN202080000057.5A
Authority: CN
Inventors: 李月; 黄光伟; 饶天珉
Original assignee: BOE Technology Group Co Ltd
Current assignee: BOE Technology Group Co Ltd
Priority date: 2020-01-21
Filing date: 2020-01-21
Publication date: 2021-10-12
Also published as: WO2021146951A1

Abstract

A text detection method and device and a storage medium are provided. The text detection method comprises the following steps: acquiring a text characteristic image corresponding to the text image based on the text image; taking a partial region close to a first edge of the text characteristic image in the text characteristic image as a basic region, wherein the first edge of the text characteristic image corresponds to the first edge of the text image, the text to be detected is close to the first edge of the text image, and at least partial pixels in the basic region are positive pixels; grouping at least part of the positive pixels in the basic area to obtain at least one connected domain; expanding the at least one connected domain along a direction far away from the first edge of the text feature image to obtain at least one final connected domain; and determining at least one characteristic box corresponding to at least one final connected domain, and mapping the at least one characteristic box to the text image to obtain at least one text box, wherein the text box comprises a text box of the text to be detected.

Description

Text detection method and device and storage medium

Technical Field

Embodiments of the present disclosure relate to a text detection method, a text detection apparatus, and a storage medium.

Background

With the development of science and technology, when a user reads foreign articles and needs to query when encountering new words, the user is not limited to query by using dictionaries, electronic dictionaries, mobile phone APP (application programs) and the like, and can also query by using a translation pen, for example. The dictionary is not easy to carry, and the browsing and inquiring efficiency is low; the mobile phone APP and the electronic dictionary are input by using a keyboard, so that time is consumed, the operation is complicated, the thought is easy to break, and the energy is dispersed. Compared with the prior art, the translation pen has the advantages of being convenient to use, easy to carry, closer to reading habits of users and the like, and can provide good translation and query experience for the users when the users read foreign language articles.

Disclosure of Invention

At least one embodiment of the present disclosure provides a text detection method, including: acquiring a text characteristic image corresponding to a text image based on the text image; taking a partial area, close to a first edge of the text feature image, in the text feature image as a basic area, wherein the first edge of the text feature image corresponds to the first edge of the text image, a text to be detected in the text image is close to the first edge of the text image, and at least part of pixels in the basic area are positive pixels; grouping at least part of the positive pixels in the basic area to obtain at least one connected domain; expanding the at least one connected domain along a direction far away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain; and determining at least one characteristic box corresponding to the at least one final connected domain, and mapping the at least one characteristic box to the text image to obtain at least one text box, wherein the at least one text box comprises the text box of the text to be detected.

For example, in the text detection method provided by some embodiments of the present disclosure, in a case that the text feature image includes w columns and h rows of pixels, the base region includes h_baseRows w columns of pixels, where h, w, h_baseAre all positive integers, and h_base/h≤1/2。

For example, in the text detection method provided by some embodiments of the present disclosure, each pixel in the text feature image has a connection probability with an immediately adjacent pixel; grouping at least some of the positive pixels in the base region to obtain the at least one connected component, comprising: grouping at least part of the positive pixels in the base region according to the connection probability between each positive pixel and the directly adjacent pixel in the at least part of the positive pixels in the base region based on a co-searching algorithm to obtain the at least one connected domain.

For example, in a text detection method provided by some embodiments of the present disclosure, based on the union-search algorithm, grouping each positive pixel of the at least some positive pixels in the base region according to a connection probability between the positive pixel and an immediately adjacent pixel to obtain the at least one connected domain includes: constructing an index set based on the at least part of the positive pixels in the base region, wherein the index set comprises the at least part of the positive pixels in the base region, and each positive pixel corresponds to one root node in the index set, and the initial value of the root node of each positive pixel is itself; in response to any directly adjacent pixel of each positive pixel in the index set being a positive pixel and the each positive pixel having a positive connection relationship with the directly adjacent pixel, setting a value of a root node of the directly adjacent pixel to a value of the root node of the each positive pixel; and taking each group of positive pixels with the same root node value as a connected domain to obtain the at least one connected domain.

For example, in a text detection method provided by some embodiments of the present disclosure, in a case that a connection probability between each positive pixel and an immediately adjacent pixel in the base region is greater than a connection probability threshold, it is determined that the each positive pixel and the immediately adjacent pixel have the positive connection relationship therebetween.

For example, in some embodiments of the present disclosure, a text detection method is provided, in which the pixels immediately adjacent to each positive pixel in the base region include: a pixel directly adjacent to the each positive pixel in a first direction perpendicular to a first edge of the text feature image, and a pixel directly adjacent to the each positive pixel in a second direction parallel to the first edge of the text feature image.

For example, in some embodiments of the present disclosure, a text detection method is provided in which each positive pixel in the base region has four directly adjacent pixels.

For example, in a text detection method provided by some embodiments of the present disclosure, expanding the at least one connected component in a direction away from the first edge of the text feature image to obtain the at least one final connected component corresponding to the at least one connected component includes: extracting a positive pixel in a current connected domain, which is farthest from a first edge of the text feature image in a first direction perpendicular to the first edge of the text feature image, as a first positive pixel; taking a pixel in the text feature image that is on a side of the first positive pixel away from a first edge of the text feature image and that is directly adjacent to the first positive pixel as a first adjacent pixel; in response to the first neighboring pixel being a positive pixel and the first positive pixel having a positive connection relationship with the first neighboring pixel, modifying a value of a root node of the first neighboring pixel to a value of a root node of the first positive pixel and adding the first neighboring pixel to a first set of neighboring pixels; expanding the first set of neighboring pixels in a second direction parallel to a first edge of the text feature image; and expanding the current connected component to include all pixels in the first adjacent pixel set, and continuing to expand the current connected component in a direction far away from the first edge of the text feature image until expansion cannot be continued.

For example, in some embodiments of the present disclosure, a text detection method is provided, in which expanding the first neighboring pixel set in a second direction parallel to a first edge of the text feature image includes: adding a positive pixel directly adjacent to any pixel in the first set of adjacent pixels in a second direction parallel to a first edge of the text feature image and having a positive connection relationship to the first set of adjacent pixels until expansion of the first set of adjacent pixels in a direction parallel to the first edge of the text feature image cannot continue.

For example, in the text detection method provided by some embodiments of the present disclosure, the at least one final connected component includes a connected component in the basic region that cannot be expanded in a direction away from the first edge of the text feature image.

For example, in a text detection method provided in some embodiments of the present disclosure, acquiring the text feature image corresponding to the text image based on the text image includes: and processing the text image by using a text detection neural network to obtain the text characteristic image and obtain the connection probability between each pixel in the text characteristic image and the directly adjacent pixel.

For example, in a text detection method provided in some embodiments of the present disclosure, the text detection neural network includes first to sixth convolution modules, first to fifth downsampling modules, first to fourth upsampling modules, and a classifier; processing the text image by using the text detection neural network to obtain the text feature image and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel, wherein the processing comprises the following steps: performing convolution processing on the text image by using a first convolution module to obtain a first convolution feature map group; using a first downsampling module to perform downsampling processing on the first convolution feature map group to obtain a first downsampling feature map group; performing convolution processing on the first downsampling feature map group by using a second convolution module to obtain a second convolution feature map group; using a second downsampling module to perform downsampling processing on the second convolution feature map group to obtain a second downsampling feature map group, and using a fifth dimensionality reduction module to perform dimensionality reduction processing on the second convolution feature map group to obtain a fifth dimensionality reduction feature map group; performing convolution processing on the second downsampling feature map group by using a third convolution module to obtain a third convolution feature map group; using a third downsampling module to perform downsampling processing on the third convolution feature map group to obtain a third downsampling feature map group, and using a fourth dimensionality reduction module to perform dimensionality reduction processing on the third convolution feature map group to obtain a fourth dimensionality reduction feature map group; performing convolution processing on the third downsampling feature map group by using a fourth convolution module to obtain a fourth convolution feature map group; using a fourth downsampling module to perform downsampling processing on the fourth convolution feature map group to obtain a fourth downsampling feature map group, and using a third dimensionality reduction module to perform dimensionality reduction processing on the fourth convolution feature map group to obtain a third dimensionality reduction feature map group; performing convolution processing on the fourth downsampling feature map group by using a fifth convolution module to obtain a fifth convolution feature map group; using a fifth downsampling module to perform downsampling processing on the fifth convolution feature map group to obtain a fifth downsampling feature map group, and using a second dimensionality reduction module to perform dimensionality reduction processing on the fifth convolution feature map group to obtain a second dimensionality reduction feature map group; performing convolution processing on the fifth downsampling feature map group by using a sixth convolution module to obtain a sixth convolution feature map group; performing upsampling processing on the sixth convolution feature map group by using a first upsampling module to obtain a first upsampling feature map group; using a first dimensionality reduction module to perform dimensionality reduction processing on the first upsampling feature map group to obtain a first dimensionality reduction feature map group; fusing the first dimension reduction feature map group and the second dimension reduction feature map group to obtain a first fused feature map group; using a second upsampling module to perform upsampling processing on the first fused feature map group to obtain a second upsampled feature map group; fusing the second up-sampling feature map group and the third dimension reduction feature map group to obtain a second fused feature map group; using a third upsampling module to perform upsampling processing on the second fused feature map group to obtain a third upsampled feature map group; performing fusion processing on the third up-sampling feature map group and the fourth dimension reduction feature map group to obtain a third fusion feature map group; performing upsampling processing on the third fused feature map group by using a fourth upsampling module to obtain a fourth upsampled feature map group; performing fusion processing on the fourth up-sampling feature map group and the fifth dimension reduction feature map group to obtain a fourth fusion feature map group; classifying the fourth fusion characteristic image group by using a classifier to obtain a text classification predicted image and a connection probability predicted image; and obtaining the text characteristic image based on the text classification predicted image and the connection probability predicted image, and obtaining the connection probability between each pixel in the text characteristic image and the directly adjacent pixel.

For example, in a text detection method provided by some embodiments of the present disclosure, the text classification predicts that each pixel in an image has a type probability, and the connection probability predicts that each pixel in the image has a connection probability between the pixel and a directly adjacent pixel; based on the text classification predicted image and the connection probability predicted image, obtaining the text feature image and obtaining the connection probability between each pixel and the adjacent pixel in the text feature image, including: and taking the pixel with the type probability being greater than or equal to the type probability threshold value in the text classification prediction image as a positive pixel, and taking the pixel with the type probability being smaller than the type probability threshold value in the text classification prediction image as a negative pixel to obtain the text feature image, wherein the connection probability between each pixel and the directly adjacent pixel in the text feature image can be correspondingly inquired from the connection probability prediction image.

For example, in a text detection method provided in some embodiments of the present disclosure, determining the at least one feature box corresponding to the at least one final connected component includes: performing contour detection on the at least one final connected domain by using a contour detection algorithm to obtain a contour of the at least one final connected domain; and processing the outline of the at least one final connected domain by using a minimum circumscribed rectangle algorithm to obtain the at least one feature box corresponding to the at least one final connected domain.

For example, some embodiments of the present disclosure provide a text detection method, further including: and determining the text box of the text to be detected from the at least one text box.

For example, in a text detection method provided in some embodiments of the present disclosure, determining a text box of the text to be detected from the at least one text box includes: constructing a virtual detection box in the text image; and calculating the overlapping area of the virtual detection box and each text box, and taking the text box with the largest overlapping area with the virtual detection box as the text box of the text to be detected.

For example, some embodiments of the present disclosure provide a text detection method, further including: and identifying the text to be detected based on the text box of the text to be detected.

For example, some embodiments of the present disclosure provide a text detection method, further including: acquiring the text image by using an image acquisition element of an interpretive pen; when the text image is collected, the pen point of the translation pen is located on one side, close to the first edge of the text image, of the text to be detected, and the text image comprises the text to be detected.

At least one embodiment of the present disclosure further provides a text detection apparatus, including: a memory for storing a text image and computer readable instructions; and the processor is used for reading the text image and executing the computer readable instructions, and when the computer readable instructions are executed by the processor, the text detection method provided by any embodiment of the disclosure is executed.

For example, some embodiments of the present disclosure provide a text detection apparatus, further including: and the image acquisition element is used for acquiring the text image.

For example, in the text detection apparatus provided in some embodiments of the present disclosure, the text detection apparatus is an interpreter pen, wherein the image capturing element is disposed on the interpreter pen, and the interpreter pen is used for selecting the text to be detected.

At least one embodiment of the present disclosure also provides a storage medium that stores non-transitory computer-readable instructions, wherein the computer-readable instructions, when executed by a computer, can perform the text detection method provided by any one of the embodiments of the present disclosure.

Drawings

To more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings of the embodiments will be briefly introduced below, and it is apparent that the drawings in the following description relate only to some embodiments of the present disclosure and are not limiting to the present disclosure.

FIG. 1 is a schematic diagram illustrating the operation of an interpretation pen;

fig. 2 is an exemplary flowchart of a text detection method according to at least one embodiment of the present disclosure;

fig. 3 is a schematic diagram of a text image according to at least one embodiment of the present disclosure;

fig. 4 is a schematic diagram of a text detection neural network according to at least one embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a pixel adjacency provided by at least one embodiment of the present disclosure;

fig. 6 is a schematic diagram of a text feature image according to at least one embodiment of the present disclosure;

fig. 7 is an exemplary flowchart corresponding to step S400 shown in fig. 2 provided in at least one embodiment of the present disclosure;

fig. 8 is an exemplary flowchart corresponding to step S600 shown in fig. 2 provided in at least one embodiment of the present disclosure;

fig. 9 is an operation diagram corresponding to step S600 shown in fig. 2 according to at least one embodiment of the present disclosure;

fig. 10 is a schematic block diagram of a text detection apparatus according to at least one embodiment of the present disclosure; and

fig. 11 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described below clearly and completely in conjunction with the accompanying drawings of the embodiments of the present disclosure. It is to be understood that the described embodiments are only a few embodiments of the present disclosure, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the described embodiments of the disclosure without any inventive step, are within the scope of protection of the disclosure.

Unless otherwise defined, technical or scientific terms used herein shall have the ordinary meaning as understood by one of ordinary skill in the art to which this disclosure belongs. The use of "first," "second," and similar terms in this disclosure is not intended to indicate any order, quantity, or importance, but rather is used to distinguish one element from another. The word "comprising" or "comprises", and the like, means that the element or item listed before the word covers the element or item listed after the word and its equivalents, but does not exclude other elements or items. The terms "connected" or "coupled" and the like are not restricted to physical or mechanical connections, but may include electrical connections, whether direct or indirect. "upper", "lower", "left", "right", and the like are used merely to indicate relative positional relationships, and when the absolute position of the object being described is changed, the relative positional relationships may also be changed accordingly.

The present disclosure is illustrated by the following specific examples. A detailed description of known functions and known parts (elements) may be omitted in order to keep the following description of the embodiments of the present disclosure clear and concise. When any element of an embodiment of the present disclosure appears in more than one drawing, that element is identified in each drawing by the same or similar reference numeral.

The translation pen generally includes a scanning translation pen (abbreviated as "translation pen") and an translating translation pen (abbreviated as "translation pen"). When the translation pen is used, the vertical pen body is required to slide (i.e. scan) on the text to be translated, and the use mode is different from the common pen using habit, so that an adaptation process is usually required when a user uses the translation pen. Different from the use mode of the translation pen, when the translation pen is used, the pen point is only required to be aligned to the lower part of the text to be translated slightly, and corresponding recognition and translation can be carried out.

The working principle of the current translation pen is mainly as follows: firstly, clicking a pen point of a translation pen below a text to be detected (such as, but not limited to, english words), and shooting by using a pen body camera of the translation pen to obtain a text image, for example, obtaining the text image shown in fig. 1; then, performing traversal text detection processing on each pixel position of the whole text image to obtain all text boxes on the text image (as shown by solid line boxes surrounding each word in fig. 1); and finding out a text box near the pen point, namely a text box of the text to be detected (namely a text box surrounding the text to be detected), and identifying and translating the text in the text box. When text detection is performed, traversal processing needs to be performed on the whole text image, but most text boxes detected on the text image are redundant (namely are irrelevant to the text to be detected), so that the response speed of the translation pen is limited, and the working efficiency of the translation pen is reduced.

If the detection and recognition of the text can be performed by focusing only on the region near the pen tip (i.e., the lower region in the text image shown in fig. 1), the processing speed can be greatly increased, and the response time and the occupation of computing resources can be reduced. However, since the translation pen needs to recognize texts with different font sizes, if the region to be detected in the text image is artificially defined, the following problems may occur: on the one hand, if the artificially defined area to be detected is too large, the beneficial effects (i.e., increasing processing speed, reducing response time and computational resource occupancy, etc.) may not be apparent; on the other hand, if the artificially defined area to be detected is small, the text in the large font cannot be completely detected and recognized because the artificially defined area cannot cover the text in the large font, and the use range of the translation pen is limited.

At least one embodiment of the present disclosure provides a text detection method. The detection method comprises the following steps: acquiring a text characteristic image corresponding to the text image based on the text image; taking a partial region close to a first edge of the text characteristic image in the text characteristic image as a basic region, wherein the first edge of the text characteristic image corresponds to the first edge of the text image, a text to be detected in the text image is close to the first edge of the text image, and at least partial pixels in the basic region are positive pixels; grouping at least part of the positive pixels in the basic area to obtain at least one connected domain; expanding the at least one connected domain along a direction far away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain; and determining at least one characteristic box corresponding to at least one final connected domain, and mapping the at least one characteristic box to the text image to obtain at least one text box, wherein the at least one text box comprises a text box of the text to be detected.

Some embodiments of the present disclosure also provide a text detection apparatus and a storage medium corresponding to the above text detection method.

The text detection method provided by the embodiment of the disclosure can perform text detection by adopting the idea of connected domain based on the preset basic region, thereby reducing the operation amount of text detection (namely, reducing the number of traversal times) and reducing the response time of text detection. The text detection method is suitable for the translation pen and the like, can improve the processing speed of the translation pen, and improves the use experience of a user.

Some embodiments of the present disclosure and examples thereof are described in detail below with reference to the accompanying drawings.

Fig. 2 is an exemplary flowchart of a text detection method according to at least one embodiment of the present disclosure. For example, the text detection method provided by the embodiment of the disclosure may be applied to a text image acquired by a translation pen, but is not limited thereto. For example, as shown in fig. 2, the text detection method includes, but is not limited to, steps S100 to S600.

Step S100: based on the text image, a text feature image corresponding to the text image is acquired.

For example, in step S100, the text image may include an image photographed by an image pickup device or element. For example, in some embodiments, the text detection method further includes, before step S100, step S000: a text image is collected.

For example, in some examples, a text image may be captured using, for example, a translation pen. For example, the translation pen may include an image capture element, such as a camera; for example, the camera may be disposed on, for example, a body of the stylus. Thus, step S000, i.e., capturing a text image, may be performed using an interpreter pen (a camera on the interpreter pen). For example, when capturing a text image using the image capturing element of the interpreter pen, the pen tip of the interpreter pen is generally pointed below the text to be detected, so that relative to the text image, the pen tip of the interpreter pen is equivalent to pointing on the side of the text to be detected near the edge of the text image. To distinguish from other edges of the text image, the edge is referred to as a first edge of the text image (shown with reference to a first edge FE of the text image in fig. 3).

For example, the text image may be a grayscale image or a color image. The shape of the text image may be rectangular, diamond, circular, etc., and embodiments of the present disclosure are not limited in this regard. In the embodiments of the present disclosure, the text image is illustrated as a rectangle, but should not be construed as a limitation to the present disclosure.

For example, the text image may be an original image directly captured by the image capturing device or element, or may be an image obtained by preprocessing the original image. For example, to avoid the influence of data quality, data imbalance and the like of the text image on the character recognition, the text detection method provided by the embodiment of the disclosure may further include an operation of preprocessing the text image before performing text detection on the text image. Preprocessing may eliminate extraneous or noisy information in the text image to facilitate better processing of the text image. The pre-processing may include, for example, scaling, cropping, Gamma (Gamma) correction, image enhancement, or noise reduction filtering of the text image.

For example, the text image includes at least one text including the text to be detected. For example, the text to be detected is typically near a first edge (e.g., a lower edge) of the text image. It should be noted that the text to be detected is the text that the user wants to detect. A text image refers to a form in which text is visually presented, such as a picture, video, etc. including the text.

For example, the text to be detected may include: a word in one of the languages english, french, german, spanish, etc., or a word or phrase in one of the languages chinese, japanese, korean, etc.; but is not limited thereto.

Fig. 3 is a schematic diagram of a text image according to at least one embodiment of the present disclosure. For example, as shown in fig. 3, the text image includes a plurality of texts, for example, a text may be an english word (e.g., "technology", "the" in fig. 3, etc.), one or a string of numbers (e.g., "61622214" in fig. 3, etc.), and the like, but is not limited thereto. For example, in the text image shown in fig. 3, the text to be detected may be "technology"; for example, in some examples, when "Tecent" is selected as the text to be detected using the translation pen, the pen point of the translation pen is below the "Tecent" (near the first edge FE), and shooting is performed using a camera provided on the body of the translation pen to obtain the text image shown in fig. 3.

For example, in some embodiments, in step S100, the text image may be processed using a text detection neural network to obtain a text feature image, and obtain a connection probability between each pixel in the text feature image and an immediately adjacent pixel.

Fig. 4 is a schematic diagram of a text detection neural network according to at least one embodiment of the present disclosure. For example, as shown in fig. 4, the text detection neural network includes first to sixth convolution modules, first to fifth downsampling modules, first to fourth upsampling modules, and a classifier.

For example, each of the first through sixth convolution modules may include a convolution layer. Convolutional layers are the core layers of convolutional neural networks. The convolutional layer may apply several convolutional kernels (also called filters) to the input image to extract various types of features of the input image. Each convolution kernel may extract one type of feature. The convolution kernel is generally initialized in the form of a random decimal matrix, and the convolution kernel can be learned to obtain a reasonable weight in the training process of the convolutional neural network. The result obtained after applying a convolution kernel to the input image is called a feature image (feature map), and the number of feature images is equal to the number of convolution kernels. For example, in the embodiment of the present disclosure, as shown in fig. 4, a text image is taken as an input image. It should be noted that the embodiments of the present disclosure do not limit the number of convolution layers included in the first to sixth convolution modules.

For example, in some embodiments, each of the convolution modules described above may also include an activation layer. The activation layer includes an activation function that is used to introduce non-linear factors into the convolutional neural network so that the convolutional neural network can better solve more complex problems. The activation function may include a linear correction unit (ReLU) function, a linear correction unit with leakage function (leakyreu), a Sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function), etc. The ReLU function and the leakyreu function are non-saturated non-linear functions, and the Sigmoid function and the tanh function are saturated non-linear functions.

For example, in some embodiments, each convolution module described above may also include, for example, a Batch Normalization (BN) layer or the like. For example, the batch normalization layer is configured to perform batch normalization processing on feature images of a small batch (mini-batch) of samples (i.e., input images) so that the gray-scale values of pixels of each feature image vary within a predetermined range, thereby reducing the calculation difficulty and improving the contrast. For example, the predetermined range may be [ -1, 1], but is not limited thereto. For example, the batch normalization layer may perform batch normalization on each feature image according to the mean and variance of the feature images of each small batch of samples.

For example, each of the first to fifth downsampling modules may include a downsampling layer. On one hand, the down-sampling layer can be used for reducing the scale of an input image, simplifying the complexity of calculation and reducing the phenomenon of overfitting to a certain extent; on the other hand, the downsampling layer may perform feature compression to extract main features of the input image. The downsampling layer can reduce the size of the feature images without changing the number of feature images. For example, an input image with a size of 12 × 12 is sampled by a 2 × 2 downsampling layer filter, and then a 6 × 6 feature image can be obtained, which means that 4 pixels on the input image are combined into 1 pixel in the feature image.

For example, the downsampling layer may perform downsampling processing by using a downsampling method such as maximum pooling (max pooling), average pooling (average pooling), span convolution (strained convolution), downsampling (e.g., selecting fixed pixels), and demux output (demultiplexing, splitting an input image into a plurality of smaller images). For example, in some embodiments, the down-sampling factors of the down-sampling layers in the first to fifth down-sampling modules are each 1/(2 × 2), which the present disclosure includes but is not limited to.

For example, each of the first to fourth upsampling modules may include an upsampling layer. For example, the upsampling layer may perform upsampling processing by using an upsampling method such as span transposed convolution (trellis transformed convolution), interpolation algorithm, and the like. The interpolation algorithm may include, for example, an interpolation value, bilinear interpolation, Bicubic interpolation (Bicubic interpolation), and the like. The upsampling process is used to increase the size of the feature image, thereby increasing the data size of the feature image. For example, in some embodiments, the upsampling factors of the upsampling layers in the first through fourth upsampling modules are each 2 × 2, and the present disclosure includes but is not limited thereto.

For example, each of the first through fifth dimension reduction modules may include convolution layers employing 1 × 1 convolution kernels. For example, each dimension reduction module can perform dimension reduction by using 1 × 1 convolution to check data, and reduce the number of feature images, thereby reducing the number of parameters in subsequent processing and reducing the amount of calculation to increase the processing speed. For example, in some embodiments, each of the first to fifth dimension reduction modules may include 101 × 1 convolution kernels, so that each dimension reduction module may output 10 feature images correspondingly.

For example, the classifier may include two softmax classifiers, a first softmax classifier and a second softmax classifier, respectively. The first softmax classifier is used for performing text classification prediction on whether each pixel is a text pixel (i.e., a positive pixel) or a non-text pixel (i.e., a negative pixel), and the second softmax classifier performs connection classification prediction on whether each pixel has a connection (link) relationship with four pixels directly adjacent to the pixel. It should be noted that any other feasible method for performing the text classification prediction and the connection classification prediction may be adopted in the present disclosure, including but not limited to the first and second softmax classifiers described above.

Note that, in the present disclosure, each of these layers, such as a convolution layer, a downsampling layer, and an upsampling layer, refers to a corresponding processing operation, that is, convolution processing, downsampling processing, upsampling processing, and the like, and description thereof will not be repeated below.

For example, processing a text image using a text detection neural network to obtain a corresponding text feature image includes: performing convolution processing on the text image by using a first convolution module to obtain a first convolution feature map group; using a first downsampling module to perform downsampling processing on the first convolution feature map group to obtain a first downsampling feature map group; performing convolution processing on the first downsampling feature map group by using a second convolution module to obtain a second convolution feature map group; using a second downsampling module to perform downsampling processing on the second convolution feature map group to obtain a second downsampling feature map group, and using a fifth dimensionality reduction module to perform dimensionality reduction processing on the second convolution feature map group to obtain a fifth dimensionality reduction feature map group; performing convolution processing on the second downsampling feature map group by using a third convolution module to obtain a third convolution feature map group; using a third downsampling module to perform downsampling processing on the third convolution feature map group to obtain a third downsampling feature map group, and using a fourth dimensionality reduction module to perform dimensionality reduction processing on the third convolution feature map group to obtain a fourth dimensionality reduction feature map group; performing convolution processing on the third downsampling feature map group by using a fourth convolution module to obtain a fourth convolution feature map group; using a fourth downsampling module to perform downsampling processing on the fourth convolution feature map group to obtain a fourth downsampling feature map group, and using a third dimensionality reduction module to perform dimensionality reduction processing on the fourth convolution feature map group to obtain a third dimensionality reduction feature map group; performing convolution processing on the fourth downsampling feature map group by using a fifth convolution module to obtain a fifth convolution feature map group; using a fifth downsampling module to perform downsampling processing on the fifth convolution feature map group to obtain a fifth downsampling feature map group, and using a second dimensionality reduction module to perform dimensionality reduction processing on the fifth convolution feature map group to obtain a second dimensionality reduction feature map group; performing convolution processing on the fifth downsampling feature map group by using a sixth convolution module to obtain a sixth convolution feature map group; performing upsampling processing on the sixth convolution feature map group by using a first upsampling module to obtain a first upsampling feature map group; using a first dimensionality reduction module to perform dimensionality reduction processing on the first up-sampling feature map group to obtain a first dimensionality reduction feature map group; fusing the first dimension reduction feature graph group and the second dimension reduction feature graph group to obtain a first fused feature graph group; using a second up-sampling module to perform up-sampling processing on the first fusion feature graph group to obtain a second up-sampling feature graph group; fusing the second up-sampling feature map group and the third dimension reduction feature map group to obtain a second fused feature map group; performing upsampling processing on the second fusion feature map group by using a third upsampling module to obtain a third upsampling feature map group; fusing the third up-sampling feature map group and the fourth dimension-reduction feature map group to obtain a third fused feature map group; performing upsampling processing on the third fused feature map group by using a fourth upsampling module to obtain a fourth upsampled feature map group; fusing the fourth up-sampling feature map group and the fifth dimension reduction feature map group to obtain a fourth fused feature map group; classifying the fourth fusion characteristic image group by using a classifier to obtain a text classification predicted image and a connection probability predicted image; and obtaining a text characteristic image based on the text classification predicted image and the connection probability predicted image, and obtaining the connection probability between each pixel in the text characteristic image and the directly adjacent pixel.

For example, in embodiments of the present disclosure, each feature map group typically includes a plurality of feature images.

For example, in an embodiment of the present disclosure, as shown in fig. 4, the fusion process may include a bit alignment addition process ADD. For example, the alignment adding process ADD generally refers to adding a numerical value of each row and each column of an image matrix of each channel of one set of input images to a numerical value of each row and each column of an image matrix of a corresponding channel of another set of input images. For example, the number of channels of the two sets of images inputted as the alignment addition processing ADD is the same, and for example, the number of channels of the image outputted as the alignment addition processing ADD is also the same as the number of channels of any one set of inputted images. Thus, "fusion processing" means that each pixel in each feature image in one feature map group is added to the value of the corresponding pixel of the corresponding feature image in the other feature map group to obtain a new feature image. The "fusion process" does not change the number and size of the feature images.

For example, in some embodiments, the text classification predictive picture includes 2 feature images and the connection probability predictive picture includes 8 feature images. Note that the value of a pixel in each of the feature images in the text classification prediction image and the connection probability prediction image is equal to or greater than 0 and equal to or less than 1, and indicates the text prediction probability or the connection prediction probability. The feature image in the text classification prediction image represents a probability map of whether each pixel is a text, and the feature image in the connection probability prediction image represents a probability map of whether each pixel is connected with a pixel immediately adjacent to the pixel.

For example, the 2 feature images in the text classification prediction image include a text probability image and a non-text probability image, the text probability image represents the prediction probability (i.e., the type probability of each pixel) that each pixel belongs to the text, the non-text probability image represents the prediction probability that each pixel belongs to the non-text, and the values of the corresponding pixel points of the 2 feature images are added to 1. For example, in some embodiments, a type probability threshold may be set, e.g., 0.75; if the prediction probability of a pixel belonging to the text is greater than or equal to the type probability threshold, the pixel is represented as belonging to the text, that is, the pixel is a positive pixel (positive pixel); if the prediction probability of a pixel belonging to text is less than the type probability threshold, it indicates that the pixel belongs to non-text, i.e. the pixel is a negative pixel (negative pixel).

Fig. 5 is a schematic diagram of a pixel adjacency provided in at least one embodiment of the present disclosure. For example, in some embodiments, as shown in fig. 4, in the direction R1, the pixel PX3 and the pixel PX4 are directly adjacent to the pixel PX0, and in the direction C1, the pixel PX1 and the pixel PX2 are directly adjacent to the pixel PX0, that is, the pixels PX1 to PX4 are four pixels directly adjacent to the pixel PX0, and are respectively located above, below, left, and right of the pixel PX 0. For example, in some embodiments, the array of pixels in each feature image is arranged in a plurality of rows and columns. For example, the direction C1 may represent a first direction, e.g., a column direction, perpendicular to a first edge (including a first edge of the text image and a first edge of the text feature image); the direction R1 may represent a second direction, e.g., a line direction, parallel to the first edge (including the first edge of the text image and the first edge of the text feature image).

For example, the 8 feature images in the connection probability prediction image may include a first connection classification image, a second connection classification image, a third connection classification image, a fourth connection classification image, a fifth connection classification image, a sixth connection classification image, a seventh connection classification image, and an eighth connection classification image. For example, as shown in fig. 4, for the pixel PX0, the value of the pixel PX0 in the first connected classified image represents a connection prediction probability pointing from the pixel PX0 to the pixel PX1 direction, and the value of the pixel PX0 in the second connected classified image represents a connection prediction probability pointing from the pixel PX0 to the pixel PX1 direction; the value of the pixel PX0 in the third connected classified image represents a connection prediction probability directed from the pixel PX0 to the pixel PX2 direction, and the value of the pixel PX0 in the fourth connected classified image represents a non-connection prediction probability directed from the pixel PX0 to the pixel PX2 direction; the value of the pixel PX0 in the fifth connected classified image represents a connection prediction probability directed from the pixel PX0 to the pixel PX3 direction, and the value of the pixel PX0 in the sixth connected classified image represents a non-connection prediction probability directed from the pixel PX0 to the pixel PX3 direction; the value of the pixel PX0 in the seventh connected classified image represents a connection prediction probability directed from the pixel PX0 to the pixel PX4 direction, and the value of the pixel PX0 in the eighth connected classified image represents a non-connection prediction probability directed from the pixel PX0 to the pixel PX4 direction. It should be understood that the values of the corresponding pixel points of the first and second connection classification images are added to 1, the values of the corresponding pixel points of the third and fourth connection classification images are added to 1, the values of the corresponding pixel points of the fifth and sixth connection classification images are added to 1, and the values of the corresponding pixel points of the seventh and eighth connection classification images are added to 1.

For example, in some embodiments, a connection probability threshold may be set, e.g., 0.7; when the connection prediction probability of two directly adjacent pixels is larger than or equal to the connection probability threshold value, the two adjacent pixels can be connected with each other; when the connection prediction probability of two directly adjacent pixels is smaller than the connection probability threshold value, the two directly adjacent pixels cannot be connected with each other.

It should be noted that the type probability threshold and the connection probability threshold are merely exemplary, and the type probability threshold and the connection probability threshold may be set according to the actual application requirements.

For example, in some embodiments, the text feature image is a binary image, but is not so limited. For example, in some embodiments, classifying the predicted image and the connection probability predicted image based on text, obtaining a text feature image, and obtaining a connection probability between each pixel in the text feature image and an immediately adjacent pixel may include: each pixel in the text probability image in the text classification prediction image is binarized according to the comparison relationship between the pixel value (prediction probability of text, namely type probability) and the type probability threshold value, so as to obtain a text feature image, and the connection probability between each pixel in the text feature image and the directly adjacent pixel can be correspondingly inquired from the connection probability prediction image. For example, in a text probability image, if the prediction probability of a pixel belonging to text is greater than or equal to a type probability threshold, the pixel is regarded as a positive pixel (positive pixel), that is, the text prediction probability of the positive pixel is greater than or equal to the type probability threshold; if the prediction probability of a pixel belonging to the text is smaller than the type probability threshold, the pixel is taken as a negative pixel (negative pixel), that is, the text prediction probability of the negative pixel is smaller than the type probability threshold; thereby, a text feature image comprising positive and negative pixels can be obtained.

Fig. 6 is a schematic diagram of a text feature image according to at least one embodiment of the present disclosure. As shown in fig. 6, the text feature image includes positive pixels (as shown by each gray square in fig. 6) and negative pixels (as shown by each white square in fig. 6).

It should be understood that the size of the text feature image is the same as the size of each feature image in the text classification prediction image and the connection probability prediction image.

It should be noted that the text detection neural network shown in fig. 4 is schematic. In practical applications, the operation of step S100 may also be performed by using a neural network having another structural form; of course, the text detection neural network shown in fig. 4 may also be partially modified to obtain a new text detection neural network that can also perform the operation of step S100. For example, in some examples, the fourth upsampling module and the fifth dimensionality reduction module and the corresponding fusion process in the text detection neural network shown in fig. 4 may be omitted, while the third fused feature map group is classified using a classifier to obtain a text classification predicted image and a connection probability predicted image. It should be noted that the embodiments of the present disclosure are not limited to this.

It should be understood that, in the text detection method provided in some examples, it may also be set that: each pixel in the text feature image is directly adjacent to 8 pixels above, below, to the left, to the right, above left, below left, above right, and below right; in this case, the connection probability prediction image may correspondingly include 16 feature images. Embodiments of the present disclosure include, but are not limited to, this. For example, compared with a text detection method in which each pixel has 8 directly adjacent pixels, the text detection method in which each pixel has 4 directly adjacent pixels can reduce the amount of calculation, increase the processing speed, and improve the problem that text blocking may occur in a subsequently obtained text box.

Step S200: and taking a partial area close to the first edge of the text characteristic image in the text characteristic image as a basic area, wherein at least part of pixels in the basic area are positive pixels.

For example, the first edge of the text feature image corresponds to the first edge of the text image, and the text to be detected in the text image is close to the first edge of the text image (refer to the related description of fig. 3).

For example, in some embodiments, as shown in fig. 6, the lower partial region in the text feature image (i.e., the partial region near the first edge of the text feature image, as shown by the dashed box in fig. 6) may be used as the base region, and at least part of the pixels in the base region are positive pixels (as shown by the gray squares in the dashed box in fig. 6).

For example, in some embodiments, assuming that the size of the text feature image is h × w (i.e., includes h rows and w columns of pixels), the size of the base region may be set to h_baseW (i.e. including h)_baseRows w columns of pixels), where h, w, h_baseAre all positive integers, and h_baseH is less than or equal to 1. For example, in some examples, h_baseH is less than or equal to 1/2; for example, in some examples, h_baseThe value range of/h is, for example, 1/10-1/2, such as 1/5-2/5, such as 1/4-1/3. E.g. h_baseThe value range of/h can be set according to the actual application requirements, for example, the value range of the font size to be recognized and the size of the coverage range of the text image can be set according to the needs. It should be noted that if h_baseThe value of/h is too small, canThe method can cause that the basic area does not include the positive pixel, and further cause that the text detection method provided by the embodiment of the disclosure cannot be effectively implemented; if h is_baseIf the value of/h is too large, the reduction of the operation amount of text detection is not obvious, and the beneficial effect of the embodiment of the disclosure is further reduced; thus, h_baseThe value of/h should be reasonably set according to the actual application requirements.

For example, since the length of the text to be detected may not be fixed, for example, english words are usually different in length, in the embodiment of the present disclosure, the width of the base region may be set to be the same as the width of the text feature image, i.e., both w.

Step S300: at least a portion of the positive pixels in the base region are grouped to obtain at least one connected component.

For example, in step S300, at least some of the positive pixels in the base region may be grouped according to the connection probability between each positive pixel and the directly adjacent pixel in the base region based on a union-check algorithm to obtain at least one Connected Components (Connected Components).

For example, in some embodiments, the union-lookup algorithm may include: firstly, constructing an index set based on at least part of positive pixels in the base region, for example, the index set comprises at least part of positive pixels in the base region, each positive pixel corresponds to one root node in the index set, and the initial value of the root node of each positive pixel is itself; then, in response to that any one of directly adjacent pixels of each positive pixel in the index set is a positive pixel and that each positive pixel has a positive connection relationship with the directly adjacent pixel, setting the value of the root node of the directly adjacent pixel as the value of the root node of the positive pixel; and finally, taking each group of positive pixels with the same root node value as a connected domain to obtain at least one connected domain. It should be noted that the specific process of the above-mentioned union-search algorithm is illustrative, and the embodiment of the present disclosure is not limited thereto. For example, in some examples, the at least some positive pixels in the base region used to construct the index set include all positive pixels in the base region; for example, in other examples, at least some of the positive pixels in the base region used for constructing the index set include the positive pixels in one or several rows (which may be set according to actual requirements) in the base region, for example, the row or rows closest to the first edge of the text feature image, so that the amount of computation may be reduced, and the processing speed may be increased. Embodiments of the present disclosure are not limited in this regard.

For example, the directly adjacent pixels of each positive pixel include pixels directly adjacent to each positive pixel in a first direction perpendicular to a first edge of the text feature image and pixels directly adjacent to each positive pixel in a second direction parallel to the first edge of the text feature image. For example, each positive pixel has four directly adjacent pixels.

For example, in an embodiment of the present disclosure, when the connection probability between two directly adjacent pixels is greater than the connection probability threshold, there is a positive connection relationship between the two.

Illustratively, in the text feature image shown in fig. 6, all positive pixels in the base region are grouped, resulting in four connected domains.

For example, in some embodiments, in order to prevent the influence of noise, the at least one connected component may be denoised. For example, in some examples, a connected component having an area smaller than T1 pixels or a width (or height) smaller than a width (or height) of T2 pixels in the at least one connected component may be removed, and one or more connected components remaining after the denoising process is performed may be used to determine a final connected component corresponding to the text to be detected (refer to the related description in step S400 below). For example, in some examples, T1 may be, for example, 100 ~ 300, such as 200, but is not limited thereto; for example, in some examples, T2 may be, for example, 5-15, such as 10, but is not limited thereto. It should be understood that the values of T1 and T2 can be set according to the actual application requirements.

Step S400: and expanding the at least one connected domain along the direction far away from the first edge of the text characteristic image to obtain at least one final connected domain corresponding to the at least one connected domain.

For example, in step S400, the at least one final connected component includes a final connected component corresponding to the text to be detected.

Fig. 7 is an exemplary flowchart corresponding to step S400 shown in fig. 2 provided in at least one embodiment of the present disclosure. Step S400 shown in fig. 7 will be described in detail below with reference to the text feature image shown in fig. 6.

For example, as shown in fig. 7, at least one connected component is expanded in a direction away from the first edge of the text feature image to obtain at least one final connected component corresponding to the at least one connected component, that is, step S400, which includes steps S410 to S450.

Step S410: and extracting a positive pixel in the current connected domain which is farthest away from the first edge of the text feature image in a first direction perpendicular to the first edge of the text feature image as a first positive pixel.

For example, in step S410, the current connected domain is at least one connected domain in the base area. For example, as shown in fig. 6, the positive pixels in the current connected domain that are farthest from the first edge of the text feature image in a first direction (i.e., a bottom-to-top column direction) perpendicular to the first edge (i.e., the lower edge of the text feature image shown in fig. 6) include pixel points 1-5, and thus, pixel points 1-5 are all taken as the first positive pixels. For example, as shown in FIG. 6, the first positive pixels (i.e., pixels 1-5) are in the same row. For example, as shown in fig. 6, pixel points 1-2 belong to the same connected domain, and therefore pixel points 1-2 have the same root node; pixel point 3-5 belongs to the same connected domain, so pixel point 3-5 has the same root node (different from the root node of pixel point 1-2).

Step S420: and taking a pixel which is on one side of the first positive pixel far away from the first edge of the text feature image and is directly adjacent to the first positive pixel in the text feature image as a first adjacent pixel.

For example, as shown in fig. 6, five pixel points on the row of pixel points 1-5 that are directly adjacent to pixel points 1-5, respectively, are taken as first neighboring pixels. For example, as shown in FIG. 6, the first neighboring pixel includes pixels 6-8, etc.; wherein pixel 6 is directly adjacent to pixel 1, pixel 7 is directly adjacent to pixel 2, pixel 8 is directly adjacent to pixel 4, and neither of the first adjacent pixels of pixels 3 and 5 has a reference numeral.

Step S430: in response to the first neighboring pixel being a positive pixel and the first positive pixel having a positive connection relationship with the first neighboring pixel, modifying a value of a root node of the first neighboring pixel to a value of the root node of the first positive pixel, and adding the first neighboring pixel to the first set of neighboring pixels.

For example, in some embodiments, when the probability of connection between a first positive pixel and a first neighboring pixel is greater than a connection probability threshold, there is a positive connection relationship between the two.

For example, in some embodiments, the first set of neighboring pixels has a similar form to the aforementioned index set, i.e., each pixel in the first set of neighboring pixels also has a corresponding root node. For example, in some examples, as shown in fig. 6, pixel 6 is a positive pixel and pixel 6 has a positive connection relationship with pixel 1, so that pixel 6 can be added to the first neighboring pixel set, and the value of the root node of pixel 6 is the same as the value of the root node of pixel 1. Similarly, the pixel point 7 may also be added to the first neighboring pixel, and the value of the root node of the pixel point 7 is the same as the value of the root node of the pixel point 2, that is, the same as the values of the root nodes of the pixel points 1 and 6; the pixel 8 may also be added to the first neighboring pixel, and the value of the root node of the pixel 8 is the same as the value of the root node of the pixel 3.

Step S440: the first set of neighboring pixels is expanded in a second direction parallel to a first edge of the text feature image.

For example, in some embodiments, step S440 may include: adding a positive pixel that is directly adjacent to any pixel in the first set of adjacent pixels in a second direction parallel to the first edge of the text feature image and has a positive connection relationship to the first set of adjacent pixels until the first set of adjacent pixels cannot continue to be expanded in a direction parallel to the first edge of the text feature image.

For example, in some embodiments, the determination condition of the positive connection relationship in step S440 is the same as the determination condition in the aforementioned step S430.

For example, in some examples, as shown in fig. 6, the pixel 9 is a positive pixel and the pixel 9 has a positive connection relationship with the pixel 6, so that the pixel 9 can be added to the first neighboring pixel set, and the value of the root node of the pixel 9 is the same as the value of the root node of the pixel 6; further, the pixel 10 is a positive pixel and the pixel 10 has a positive connection relationship with the pixel 9, so that the pixel 10 can be added to the first neighboring pixel set, and the value of the root node of the pixel 10 is the same as the value of the root node of the pixel 9. For example, as shown in fig. 6, the first neighboring pixel set only includes pixels 6-8 before expansion, and includes pixels 6-11 after expansion. Where pixels 6-7 and 9-11 have the same root node.

Step S450: and expanding the current connected domain to include all pixels in the first adjacent pixel set, and continuing to expand the current connected domain in a direction far away from the first edge of the text feature image until expansion cannot be continued.

For example, as shown in fig. 6, the connected component (first connected component) in the base region including pixels 1-2 further includes pixels 6-11 after the first expansion, and the connected component (second connected component) including pixels 3-5 further includes pixels 8 after the first expansion.

For example, the operations of step S410 and step S450 may be repeated based on the first expanded connected component to complete the second expansion of the connected component. For example, in the second expansion, the pixels (i.e., the pixels 6 to 11) in the first neighboring pixel set obtained in the first expansion may be used as the first positive pixels. For example, as shown in fig. 6, after the second expansion, the first connected component further includes pixels 12-14, and the second connected component further includes pixels 15-16.

By analogy, after multiple expansions, as shown in fig. 6, the first connected domain further includes pixel points 6 to 14, 17, and 19 to 20 outside the basic region, and the second connected domain further includes pixel points 8, 15 to 16, 18, and 21 outside the basic region. Thereby, two final connected domains can be obtained respectively.

It should be noted that, in the embodiment of the present disclosure, the expansion of the connected component in the text feature image shown in fig. 6 is exemplary, and not limiting. For example, in some embodiments, the connected domains in the base region that can be expanded outward (out of the base region) may be one or more, and are not limited to the two shown in fig. 6. For example, in some embodiments, after two or more connected domains in the base region are expanded outwards, a final connected domain may be formed together, and is not limited to one final connected domain for each connected domain. For example, in some embodiments, the basic region further includes a connected component whose area does not change after the processing of step S400, such as a connected component that cannot be expanded outward (i.e., cannot be expanded in a direction away from the first edge of the text feature image), and such a connected component also serves as a final connected component after the processing of step S400.

Step S500: and determining at least one characteristic box corresponding to at least one final connected domain, and mapping the at least one characteristic box to the text image to obtain at least one text box, wherein the at least one text box comprises a text box of the text to be detected.

For example, in some embodiments, determining at least one feature box corresponding to at least one final connected domain may include: carrying out contour detection on the at least one final connected domain by using a contour detection algorithm to obtain a contour of the at least one final connected domain; and processing the outline of the at least one final connected domain by using a minimum circumscribed rectangle algorithm to obtain at least one feature box corresponding to the at least one final connected domain. For example, the contour detection algorithm may include, but is not limited to, the OpenCV contour detection (findContours) function; for example, the minimum bounding rectangle algorithm may include, but is not limited to, the minimum bounding rectangle (minAreaRect) function of OpenCV.

For example, in an embodiment of the present disclosure, the feature box may be a rectangular box, and accordingly, the text box may also be a rectangular box. It should be noted that the embodiments of the present disclosure include but are not limited thereto.

For example, in some embodiments, as shown in FIG. 3, at least one text box may be obtained after at least one feature box in the text feature image is mapped into the text image (as shown by the solid line box in FIG. 3). For example, mapping includes both scaling and projection processes. For example, taking 1/(2 × 2) as an example that the size of the text feature image is 1 of the size of the text image, the width and the height of the feature box are respectively enlarged by two times in the scale transformation process; in the projection process, the relative positions of the text box and the text image and the relative positions of the feature box and the text feature image are kept consistent, so that the corresponding text box can be obtained. For example, as shown in FIG. 3, each text box includes a text therein.

For example, as shown in fig. 3, in the text detection method provided by the embodiment of the present disclosure, only a partial region near a text to be detected (as indicated by "technology" in fig. 3) in the text image needs to be detected, so that only text boxes of the partial text in the text image (text boxes including the text to be detected) are obtained. In contrast, the conventional text detection method corresponding to the text image shown in fig. 1 needs to perform traversal detection on the entire area of the text image to obtain text boxes of all texts in the text image. Therefore, the text detection method provided by the embodiment of the disclosure can reduce the operation amount of text detection (i.e. reduce the number of traversal times), and reduce the response time of text detection.

Step S600: and determining the text box of the text to be detected from at least one text box.

For example, in some embodiments, the text image is captured by a camera disposed on the body of the stylus, and the text to be detected is selected by the tip of the stylus. Because the relative position of the pen point of the translation pen and the camera is fixed, the relative position of the pen point of the translation pen (assuming that the pen point of the translation pen is virtualized in the plane of the text image, namely the virtual pen point) and the text image shot by the camera is also fixed. Thus, step S600 may be implemented based on the above-described principle.

Fig. 8 is an exemplary flowchart corresponding to step S600 shown in fig. 2, provided in at least one embodiment of the present disclosure, and fig. 9 is an operation diagram corresponding to step S600 shown in fig. 2, provided in at least one embodiment of the present disclosure. Step S600 shown in fig. 8 will be described in detail below with reference to fig. 9.

For example, as shown in fig. 8, a text box of the text to be detected is determined from at least one text box, i.e., step S600 includes steps S610 to S620.

Step S610: constructing a virtual detection box in the text image;

step S620: and calculating the overlapping area of the virtual detection box and each text box, and taking the text box with the largest overlapping area with the virtual detection box as the text box of the text to be detected.

For example, in some embodiments, as shown in fig. 9, a tip of a translation pen, i.e., a virtual tip, may be first virtualized in a text image (as shown by a solid gray box in fig. 9). For example, in some examples, a virtual pen tip (as indicated by the black dots in fig. 9) may be disposed on a first edge of the text image, but is not limited thereto; for example, in other examples, the virtual pen tip may be disposed outside the text image and near the first edge. For example, as shown in fig. 9, the virtual pen tip may be generally disposed on a perpendicular bisector of the first edge of the text image, or may be disposed near the perpendicular bisector of the first edge of the text image, which is not limited by the embodiments of the present disclosure. It should be understood that the virtual pen tip may be configured according to the actual application requirements, and the embodiments of the present disclosure are not limited thereto.

Then, a virtual detection frame with a height H and a width W is constructed with the virtual pen tip as the middle point of the bottom side of the virtual detection frame (as shown by the dashed line frame in fig. 9). For example, in some embodiments, H — H1+ H2, where H1 represents the minimum value of the distance between the virtual pen tip and the center of each text box in the text image in the first direction (i.e., the column direction) perpendicular to the first edge, and H2 is a preset height value; for example, H2 may be set to a height value of, for example, 30 pixels, but is not limited thereto. For example, in some embodiments, the width W is a preset width value; for example, W may be set to a width value of, for example, 60 pixels, but is not limited thereto. It should be understood that H2 and W may be set according to practical application requirements, and embodiments of the present disclosure are not limited thereto.

For example, in some embodiments, after determining a text box of a text to be detected, the text detection method provided in the embodiments of the present disclosure may further include: and performing text recognition processing on the text to be detected based on the text box of the text to be detected. For example, a common text processing method may be used for the text recognition processing, and the embodiment of the present disclosure is not limited thereto. For example, commonly used text processing methods may include, but are not limited to, text recognition processing using a neural network (e.g., a multiple target correction attention network (MORAN), etc.).

For example, in practical applications, text translation may be performed based on the result of the text recognition processing to obtain and output a translation result of the text to be detected. For example, the results of the text recognition process are indexed using a dictionary database to retrieve the translation results. For example, the translation result of the text to be detected may be displayed on a display, or may be output by voice through a speaker or the like.

It should be noted that, in the embodiment of the present disclosure, the flow of the text detection method may include more or less operations, and the operations may be executed sequentially or in parallel. Although the flow of the text detection method described above includes a plurality of operations that occur in a particular order, it should be clearly understood that the order of the plurality of operations is not limited. The text detection method described above may be performed once or may be performed a plurality of times according to a predetermined condition.

It should be noted that, in the embodiment of the present disclosure, various functional modules and functional layers in the text detection neural network and the text detection neural network may be implemented by software, hardware, firmware, or any combination thereof, so as to execute corresponding processing procedures.

The text detection method provided by the embodiment of the disclosure can be used for text detection by adopting the idea of a connected domain based on a preset basic region, so that the calculation amount of the text detection (namely, the number of traversal times) can be reduced, and the response time of the text detection is reduced.

At least one embodiment of the present disclosure further provides a text detection device. Fig. 10 is a schematic block diagram of a text detection apparatus according to at least one embodiment of the present disclosure.

For example, as shown in fig. 10, the text detection apparatus 1000 includes a memory 1001 and a processor 1002. It should be understood that the components of the text detection apparatus 1000 shown in fig. 10 are only exemplary and not restrictive, and the text detection apparatus 1000 may also include other components according to the actual application.

For example, memory 1001 is used to store textual images as well as computer-readable instructions; the processor 1002 is configured to read a text image and execute computer-readable instructions, and the computer-readable instructions are executed by the processor 1002 to perform one or more steps of the text detection method according to any one of the above embodiments.

For example, in some embodiments, as shown in fig. 10, the text detection apparatus may further include an image capture element 1003. For example, the image capture element 1003 is used to capture text images. For example, the image capturing element 1003 is an image capturing device or element described in the above embodiments of the text detection method, and for example, the image capturing element 1003 may be various types of cameras.

For example, in some embodiments, the text detection device 1000 may be a translation pen, but is not limited thereto. For example, the translation pen is used to select the text to be detected. For example, image capture element 1003 may be disposed on a writing pen, and image capture element 1003 may be a camera disposed on a writing pen, for example.

It should be noted that the memory 1001 and the processor 1002 may also be integrated into the decoding pen, that is, the image capturing element 1003, the memory 1001 and the processor 1002 may be integrated into the decoding pen, and the embodiments of the present disclosure include but are not limited thereto.

For example, the text detection apparatus 1000 may further include an output unit configured to output a recognition result and/or a translation result of the text to be detected. For example, the output unit may include a display, a speaker, and the like, the display may be configured to display a recognition result and/or a translation result of the text to be detected, and the speaker may be configured to output the recognition result and/or the translation result of the text to be detected in the form of voice. For example, the translation pen may further include a communication module, which is configured to implement communication between the translation pen and the output unit, for example, to transmit the translation result to the output unit.

For example, the processor 1002 may control other components in the text detection apparatus 1000 to perform desired functions. The processor 1002 may be a Central Processing Unit (CPU), Tensor Processor (TPU), or the like, having data processing capabilities and/or program execution capabilities. The Central Processing Unit (CPU) may be an X86 or ARM architecture, etc. The GPU may be separately integrated directly onto the motherboard, or built into the north bridge chip of the motherboard. The GPU may also be built into the Central Processing Unit (CPU).

For example, memory 1001 may include any combination of one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), USB memory, flash memory, and the like. One or more computer-readable instructions may be stored on the computer-readable storage medium and executed by processor 1002 to implement the various functions of text detection apparatus 1000.

For example, components such as the image capturing element 1003, the memory 1001, the memory 1230, and the output unit may communicate with each other via a network connection. The network may include a wireless network, a wired network, and/or any combination of wireless and wired networks. The network may include a local area network, the Internet, a telecommunications network, an Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination thereof, and/or the like. The wired network may communicate by using twisted pair, coaxial cable, or optical fiber transmission, for example, and the wireless network may communicate by using 3G/4G/5G mobile communication network, bluetooth, Zigbee, or WiFi, for example. The present disclosure is not limited herein as to the type and function of the network.

For example, for a detailed description of a process of performing the text detection processing by the text detection apparatus 1000, reference may be made to the related description in the embodiment of the text detection method, and repeated parts are not described herein again.

The technical effects of the text detection device provided by the embodiment of the present disclosure can refer to the corresponding descriptions about the text detection method in the above embodiments, and are not described herein again.

At least one embodiment of the present disclosure also provides a storage medium. Fig. 11 is a schematic diagram of a storage medium according to at least one embodiment of the present disclosure. For example, as shown in FIG. 11, one or more computer readable instructions 1101 may be stored non-transitory on a storage medium 1100. For example, the computer readable instructions 1101, when executed by a computer, are capable of performing one or more steps according to the text detection method described above.

For example, the storage medium 1100 may be applied to the text detection apparatus 1000 described above, for example, as the memory 1001 in the text detection apparatus 1000. For the description of the storage medium 1100, reference may be made to the description of the memory in the embodiment of the text detection apparatus 100, and repeated descriptions are omitted.

For technical effects of the storage medium provided by the embodiments of the present disclosure, reference may be made to corresponding descriptions about the text detection method in the foregoing embodiments, and details are not repeated here.

For the present disclosure, there are the following points to be explained:

(1) the drawings of the embodiments of the disclosure only relate to the structures related to the embodiments of the disclosure, and other structures can refer to the common design.

(2) For purposes of clarity, the thickness of layers or the size of regions in the figures used to describe embodiments of the present disclosure are exaggerated or reduced, i.e., the figures are not drawn to scale. It will be understood that when an element such as a layer, film, region, or substrate is referred to as being "on" or "under" another element, it can be "directly on" or "under" the other element or intervening elements may be present.

(3) Without conflict, embodiments of the present disclosure and features of the embodiments may be combined with each other to arrive at new embodiments.

The above description is only exemplary of the present disclosure and is not intended to limit the scope of the present disclosure, and any person skilled in the art can easily conceive of changes or substitutions within the technical scope of the present disclosure and shall be covered by the scope of the present disclosure. Accordingly, the scope of the disclosure is to be determined by the claims that follow.

Claims

A text detection method, comprising:

acquiring a text characteristic image corresponding to a text image based on the text image;

taking a partial area, close to a first edge of the text feature image, in the text feature image as a basic area, wherein the first edge of the text feature image corresponds to the first edge of the text image, a text to be detected in the text image is close to the first edge of the text image, and at least part of pixels in the basic area are positive pixels;

grouping at least part of the positive pixels in the basic area to obtain at least one connected domain;

expanding the at least one connected domain along a direction far away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain; and

and determining at least one characteristic box corresponding to the at least one final connected domain, and mapping the at least one characteristic box to the text image to obtain at least one text box, wherein the at least one text box comprises the text box of the text to be detected.
The text detection method of claim 1, wherein the base region comprises h rows and w columns of pixels in the case that the text feature image comprises h rows and w columns of pixels_baseThe rows w of the pixels are arranged in rows,

wherein h, w, h_baseAre all positive integers, and h_base/h≤1/2。
The text detection method according to claim 1 or 2, wherein each pixel in the text feature image has a connection probability with an immediately adjacent pixel;

grouping at least some of the positive pixels in the base region to obtain the at least one connected component, comprising:

grouping at least part of the positive pixels in the base region according to the connection probability between each positive pixel and the directly adjacent pixel in the at least part of the positive pixels in the base region based on a co-searching algorithm to obtain the at least one connected domain.
The text detection method of claim 3, wherein grouping the at least some positive pixels in the base region according to a connection probability between each positive pixel and an immediately adjacent pixel in the at least some positive pixels in the base region based on the union-search algorithm to obtain the at least one connected domain comprises:

constructing an index set based on the at least part of the positive pixels in the base region, wherein the index set comprises the at least part of the positive pixels in the base region, and each positive pixel corresponds to one root node in the index set, and the initial value of the root node of each positive pixel is itself;

in response to any directly adjacent pixel of each positive pixel in the index set being a positive pixel and the each positive pixel having a positive connection relationship with the directly adjacent pixel, setting a value of a root node of the directly adjacent pixel to a value of the root node of the each positive pixel; and

and taking each group of positive pixels with the same root node value as a connected domain to obtain the at least one connected domain.
The text detection method according to claim 4, wherein it is determined that each positive pixel and an immediately adjacent pixel in the base region have the positive connection relationship therebetween in a case where a connection probability between the each positive pixel and the immediately adjacent pixel is greater than a connection probability threshold.
The text detection method of claim 4 or 5, wherein the directly neighboring pixels of each positive pixel in the base region comprise:

a pixel directly adjacent to the each positive pixel in a first direction perpendicular to a first edge of the text feature image, and a pixel directly adjacent to the each positive pixel in a second direction parallel to the first edge of the text feature image.
The text detection method of any one of claims 4-6, wherein each positive pixel in the base region has four directly adjacent pixels.
The text detection method of any one of claims 4-7, wherein expanding the at least one connected component in a direction away from the first edge of the text feature image to obtain the at least one final connected component corresponding to the at least one connected component comprises:

extracting a positive pixel in a current connected domain, which is farthest from a first edge of the text feature image in a first direction perpendicular to the first edge of the text feature image, as a first positive pixel;

taking a pixel in the text feature image that is on a side of the first positive pixel away from a first edge of the text feature image and that is directly adjacent to the first positive pixel as a first adjacent pixel;

in response to the first neighboring pixel being a positive pixel and the first positive pixel having a positive connection relationship with the first neighboring pixel, modifying a value of a root node of the first neighboring pixel to a value of a root node of the first positive pixel and adding the first neighboring pixel to a first set of neighboring pixels;

expanding the first set of neighboring pixels in a second direction parallel to a first edge of the text feature image; and

and expanding the current connected domain to include all pixels in the first adjacent pixel set, and continuing to expand the current connected domain in a direction far away from the first edge of the text feature image until expansion cannot be continued.
The text detection method of claim 8, wherein expanding the first set of adjacent pixels in a second direction parallel to a first edge of the text feature image comprises:

adding a positive pixel directly adjacent to any pixel in the first set of adjacent pixels in a second direction parallel to a first edge of the text feature image and having a positive connection relationship to the first set of adjacent pixels until expansion of the first set of adjacent pixels in a direction parallel to the first edge of the text feature image cannot continue.
The text detection method according to claim 8 or 9, wherein the at least one final connected component comprises a connected component within the basic region that cannot expand in a direction away from the first edge of the text feature image.
The text detection method of any one of claims 3-10, wherein obtaining the text feature image corresponding to the text image based on the text image comprises:

and processing the text image by using a text detection neural network to obtain the text characteristic image and obtain the connection probability between each pixel in the text characteristic image and the directly adjacent pixel.
The text detection method of claim 11, wherein the text detection neural network comprises first to sixth convolution modules, first to fifth downsampling modules, first to fourth upsampling modules, and a classifier;

processing the text image by using the text detection neural network to obtain the text feature image and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel, wherein the processing comprises the following steps:

performing convolution processing on the text image by using a first convolution module to obtain a first convolution feature map group;

using a first downsampling module to perform downsampling processing on the first convolution feature map group to obtain a first downsampling feature map group;

performing convolution processing on the first downsampling feature map group by using a second convolution module to obtain a second convolution feature map group;

using a second downsampling module to perform downsampling processing on the second convolution feature map group to obtain a second downsampling feature map group, and using a fifth dimensionality reduction module to perform dimensionality reduction processing on the second convolution feature map group to obtain a fifth dimensionality reduction feature map group;

performing convolution processing on the second downsampling feature map group by using a third convolution module to obtain a third convolution feature map group;

using a third downsampling module to perform downsampling processing on the third convolution feature map group to obtain a third downsampling feature map group, and using a fourth dimensionality reduction module to perform dimensionality reduction processing on the third convolution feature map group to obtain a fourth dimensionality reduction feature map group;

performing convolution processing on the third downsampling feature map group by using a fourth convolution module to obtain a fourth convolution feature map group;

using a fourth downsampling module to perform downsampling processing on the fourth convolution feature map group to obtain a fourth downsampling feature map group, and using a third dimensionality reduction module to perform dimensionality reduction processing on the fourth convolution feature map group to obtain a third dimensionality reduction feature map group;

performing convolution processing on the fourth downsampling feature map group by using a fifth convolution module to obtain a fifth convolution feature map group;

using a fifth downsampling module to perform downsampling processing on the fifth convolution feature map group to obtain a fifth downsampling feature map group, and using a second dimensionality reduction module to perform dimensionality reduction processing on the fifth convolution feature map group to obtain a second dimensionality reduction feature map group;

performing convolution processing on the fifth downsampling feature map group by using a sixth convolution module to obtain a sixth convolution feature map group;

performing upsampling processing on the sixth convolution feature map group by using a first upsampling module to obtain a first upsampling feature map group;

using a first dimensionality reduction module to perform dimensionality reduction processing on the first upsampling feature map group to obtain a first dimensionality reduction feature map group;

fusing the first dimension reduction feature map group and the second dimension reduction feature map group to obtain a first fused feature map group;

using a second upsampling module to perform upsampling processing on the first fused feature map group to obtain a second upsampled feature map group;

fusing the second up-sampling feature map group and the third dimension reduction feature map group to obtain a second fused feature map group;

using a third upsampling module to perform upsampling processing on the second fused feature map group to obtain a third upsampled feature map group;

performing fusion processing on the third up-sampling feature map group and the fourth dimension reduction feature map group to obtain a third fusion feature map group;

performing upsampling processing on the third fused feature map group by using a fourth upsampling module to obtain a fourth upsampled feature map group;

performing fusion processing on the fourth up-sampling feature map group and the fifth dimension reduction feature map group to obtain a fourth fusion feature map group;

classifying the fourth fusion characteristic image group by using a classifier to obtain a text classification predicted image and a connection probability predicted image; and

and obtaining the text characteristic image based on the text classification predicted image and the connection probability predicted image, and obtaining the connection probability between each pixel in the text characteristic image and the directly adjacent pixel.
The text detection method of claim 12, wherein each pixel in the text classification predictive image has a type probability and each pixel in the connection probability predictive image has a connection probability between the pixel and an immediately adjacent pixel;

based on the text classification predicted image and the connection probability predicted image, obtaining the text feature image and obtaining the connection probability between each pixel and the adjacent pixel in the text feature image, including:

and taking the pixel with the type probability being greater than or equal to the type probability threshold value in the text classification prediction image as a positive pixel, and taking the pixel with the type probability being smaller than the type probability threshold value in the text classification prediction image as a negative pixel to obtain the text feature image, wherein the connection probability between each pixel and the directly adjacent pixel in the text feature image can be correspondingly inquired from the connection probability prediction image.
The text detection method of any one of claims 1-13, wherein determining the at least one feature box corresponding to the at least one final connected component comprises:

performing contour detection on the at least one final connected domain by using a contour detection algorithm to obtain a contour of the at least one final connected domain; and processing the outline of the at least one final connected domain by using a minimum circumscribed rectangle algorithm to obtain the at least one feature box corresponding to the at least one final connected domain.
The text detection method according to any one of claims 1-14, further comprising: and determining the text box of the text to be detected from the at least one text box.
The text detection method of claim 15, wherein determining the text box of the text to be detected from the at least one text box comprises:

constructing a virtual detection box in the text image; and

and calculating the overlapping area of the virtual detection box and each text box, and taking the text box with the largest overlapping area with the virtual detection box as the text box of the text to be detected.
The text detection method according to claim 15 or 16, further comprising: and identifying the text to be detected based on the text box of the text to be detected.
The text detection method according to any one of claims 1-17, further comprising: acquiring the text image by using an image acquisition element of an interpretive pen;

wherein, when the text image is collected, the pen point of the translation pen is arranged on one side of the text to be detected, which is close to the first edge of the text image,

the text image comprises the text to be detected.
A text detection apparatus comprising:

a memory for storing a text image and computer readable instructions;

a processor for reading the text image and executing the computer readable instructions, which when executed by the processor perform the text detection method according to any one of claims 1-18.
The text detection apparatus of claim 19, further comprising:

an image capture element for capturing the text image.
The text detection device of claim 20, wherein the text detection device is a translation pen, wherein,

the image acquisition element is arranged on the interpretation pen, and the interpretation pen is used for selecting the text to be detected.
A storage medium storing, non-transitory, computer-readable instructions, wherein the computer-readable instructions, when executed by a computer, are capable of performing the text detection method of any of claims 1-18.