WO2021146951A1 - 文本检测方法及装置、存储介质 - Google Patents
文本检测方法及装置、存储介质 Download PDFInfo
- Publication number
- WO2021146951A1 WO2021146951A1 PCT/CN2020/073622 CN2020073622W WO2021146951A1 WO 2021146951 A1 WO2021146951 A1 WO 2021146951A1 CN 2020073622 W CN2020073622 W CN 2020073622W WO 2021146951 A1 WO2021146951 A1 WO 2021146951A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- image
- pixel
- feature
- feature map
- Prior art date
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 141
- 238000012545 processing Methods 0.000 claims description 99
- 238000005070 sampling Methods 0.000 claims description 75
- 238000013519 translation Methods 0.000 claims description 65
- 230000009467 reduction Effects 0.000 claims description 64
- 238000000034 method Methods 0.000 claims description 24
- 230000004927 fusion Effects 0.000 claims description 20
- 238000013528 artificial neural network Methods 0.000 claims description 19
- 230000008569 process Effects 0.000 claims description 19
- 238000007499 fusion processing Methods 0.000 claims description 13
- 230000004044 response Effects 0.000 claims description 12
- 238000004422 calculation algorithm Methods 0.000 claims description 11
- 238000010845 search algorithm Methods 0.000 claims description 7
- 238000004891 communication Methods 0.000 abstract description 7
- 238000013507 mapping Methods 0.000 abstract description 2
- 239000010410 layer Substances 0.000 description 25
- 230000006870 function Effects 0.000 description 20
- 238000010586 diagram Methods 0.000 description 15
- 238000004364 calculation method Methods 0.000 description 9
- 238000010606 normalization Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000008859 change Effects 0.000 description 4
- 238000013527 convolutional neural network Methods 0.000 description 4
- 238000007781 pre-processing Methods 0.000 description 4
- 238000012937 correction Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000003672 processing method Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 239000012792 core layer Substances 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 229910003460 diamond Inorganic materials 0.000 description 1
- 239000010432 diamond Substances 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 239000002346 layers by function Substances 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 238000011946 reduction process Methods 0.000 description 1
- -1 region Substances 0.000 description 1
- 229920006395 saturated elastomer Polymers 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000000007 visual effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
Definitions
- the embodiments of the present disclosure relate to a text detection method, a text detection device, and a storage medium.
- dictionaries electronic dictionaries
- mobile apps applications
- a translation pen can also use, for example, a translation pen.
- the dictionary is not easy to carry, and the efficiency of flipping and querying is low; the mobile APP and electronic dictionary use keyboard input, which is not only time-consuming and cumbersome to operate, but also easy to interrupt ideas and distract energy.
- the translation pen has the advantages of being convenient to use, easy to carry, and closer to the user's reading habits. It can provide users with a good translation and query experience when they read foreign articles.
- At least one embodiment of the present disclosure provides a text detection method, including: obtaining a text feature image corresponding to the text image based on a text image; A partial area is used as a basic area, wherein the first edge of the text feature image corresponds to the first edge of the text image, the text to be detected in the text image is close to the first edge of the text image, and the basic At least part of the pixels in the area are positive pixels; at least part of the positive pixels in the basic area are grouped to obtain at least one connected domain; and the at least one connected domain is along a distance away from the first edge of the text feature image Expand the direction to obtain at least one final connected domain corresponding to the at least one connected domain; and determine at least one feature box corresponding to the at least one final connected domain, and map the at least one feature box to the text In the image, at least one text box is obtained, wherein the at least one text box includes the text box of the text to be detected.
- the base area includes h base rows and w columns of pixels, where h, w, h base are positive integers, and h base / h ⁇ 1 / 2.
- each pixel in the text feature image has a connection probability with directly adjacent pixels; and at least part of the positive pixels in the basic area are grouped ,
- To obtain the at least one connected domain including: based on a union search algorithm, according to the connection probability between each positive pixel in the at least part of the positive pixels in the basic region and the directly adjacent pixels, The at least part of the positive pixels in the basic area are grouped to obtain the at least one connected domain.
- Grouping the at least part of the positive pixels in the basic area to obtain the at least one connected domain includes: constructing an index set based on the at least part of the positive pixels in the basic area, wherein, The index set includes the at least part of the positive pixels in the base area, and in the index set, each positive pixel corresponds to a root node, and the initial value of the root node of each positive pixel is itself; response If any directly adjacent pixel of each positive pixel in the index set is a positive pixel and there is a positive connection relationship between each positive pixel and the directly adjacent pixel, the directly adjacent pixel is The value of the root node of the pixel of is set to the value of the root node of each positive pixel; and each group of positive pixels having the same root node value is used as a connected domain to obtain the at
- connection probability between each positive pixel in the basic area and the directly adjacent pixel is greater than the connection probability threshold, it is determined that each positive pixel is The pixel has the positive connection relationship with the directly adjacent pixel.
- the pixels directly adjacent to each positive pixel in the basic area include: in a first direction perpendicular to the first edge of the text feature image A pixel directly adjacent to each positive pixel, and a pixel directly adjacent to each positive pixel in a second direction parallel to the first edge of the text feature image.
- each positive pixel in the basic area has four directly adjacent pixels.
- the at least one connected domain is expanded in a direction away from the first edge of the text feature image, so as to obtain all the connected domains corresponding to the at least one connected domain.
- the at least one final connected domain includes: extracting a positive pixel in the current connected domain that is farthest from the first edge of the text feature image in a first direction perpendicular to the first edge of the text feature image as the first A positive pixel; a pixel in the text feature image that is on the side of the first positive pixel away from the first edge of the text feature image and directly adjacent to the first positive pixel is taken as the first adjacent pixel
- the value of the root node of the first neighboring pixel is modified to the first The value of the root node of a positive pixel, and the first neighboring pixel is added to the first neighboring pixel set;
- expanding the first set of adjacent pixels in a second direction parallel to the first edge of the text feature image includes: In the second direction of the first edge of the text feature image, a positive pixel that is directly adjacent to any pixel in the first adjacent pixel set and has a positive connection relationship is added to the first adjacent pixel set until it cannot continue to The first set of adjacent pixels is expanded in a direction parallel to the first edge of the text feature image.
- the at least one final connected domain includes a connected domain in the basic region that cannot be expanded in a direction away from the first edge of the text feature image.
- obtaining the text feature image corresponding to the text image based on the text image includes: processing the text image using a text detection neural network, In order to obtain the text feature image, and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel.
- the text detection neural network includes first to sixth convolution modules, first to fifth down-sampling modules, first to fourth up-sampling modules, and Classifier; use the text detection neural network to process the text image to obtain the text feature image, and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel, It includes: using a first convolution module to perform convolution processing on the text image to obtain a first convolution feature map group; using a first down-sampling module to perform down-sampling processing on the first convolution feature map group to Obtain a first down-sampling feature map group; use a second convolution module to perform convolution processing on the first down-sampled feature map group to obtain a second convolution feature map group; use a second down-sampling module to perform convolution processing on the first The second convolution feature map group is subjected to down-sampling processing to obtain a second down-sampled feature map group, and the fifth dimensionality
- each pixel in the text classification prediction image has a type probability
- each pixel in the connection probability prediction image has the pixel and the directly adjacent one.
- the connection probability between pixels; based on the text classification prediction image and the connection probability prediction image, the text feature image is obtained, and the connection probability between each pixel in the text feature image and its neighboring pixels is obtained , Including: taking a pixel with a type probability greater than or equal to a type probability threshold in the text classification prediction image as a positive pixel, and using a pixel with a type probability less than the type probability threshold in the text classification prediction image as a negative pixel, and The text feature image is obtained, and the connection probability between each pixel in the text feature image and the directly adjacent pixel can be correspondingly queried from the connection probability prediction image.
- determining the at least one feature box corresponding to the at least one final connected domain includes: performing contour detection on the at least one final connected domain using a contour detection algorithm, To obtain the contour of the at least one final connected domain; and use the minimum bounding rectangle algorithm to process the contour of the at least one final connected domain to obtain the at least one feature frame corresponding to the at least one final connected domain.
- the text detection method provided by some embodiments of the present disclosure further includes: determining the text box of the text to be detected from the at least one text box.
- determining the text box of the text to be detected from the at least one text box includes: constructing a virtual detection box in the text image; For the overlap area between the virtual detection box and each text box, the text box having the largest overlap area with the virtual detection box is used as the text box of the text to be detected.
- the text detection method provided by some embodiments of the present disclosure further includes: performing recognition processing on the text to be detected based on the text box of the text to be detected.
- the text detection method provided by some embodiments of the present disclosure further includes: using the image acquisition element of the translation pen to collect the text image; wherein, when the text image is collected, the point of the pen point of the translation pen is on the waiting Detecting a side of the text close to the first edge of the text image, the text image including the text to be detected.
- At least one embodiment of the present disclosure further provides a text detection device, including: a memory, configured to store text images and computer-readable instructions; a processor, configured to read the text images and run the computer-readable instructions, When the computer-readable instructions are executed by the processor, the text detection method provided in any embodiment of the present disclosure is executed.
- a text detection device including: a memory, configured to store text images and computer-readable instructions; a processor, configured to read the text images and run the computer-readable instructions, When the computer-readable instructions are executed by the processor, the text detection method provided in any embodiment of the present disclosure is executed.
- the text detection device provided by some embodiments of the present disclosure further includes: an image collection element for collecting the text image.
- the text detection device is a translation pen, wherein the image acquisition element is arranged on the translation pen, and the translation pen is used to select the to-be-detected text.
- At least one embodiment of the present disclosure further provides a storage medium that non-temporarily stores computer-readable instructions, wherein when the computer-readable instructions are executed by a computer, the text detection method provided in any embodiment of the present disclosure can be executed.
- Figure 1 is a schematic diagram of the working principle of a point translation pen
- FIG. 2 is an exemplary flowchart of a text detection method provided by at least one embodiment of the present disclosure
- FIG. 3 is a schematic diagram of a text image provided by at least one embodiment of the present disclosure.
- FIG. 4 is a schematic diagram of a text detection neural network provided by at least one embodiment of the present disclosure.
- FIG. 5 is a schematic diagram of a pixel adjacency relationship provided by at least one embodiment of the present disclosure
- FIG. 6 is a schematic diagram of a text feature image provided by at least one embodiment of the present disclosure.
- FIG. 7 is an exemplary flowchart corresponding to step S400 shown in FIG. 2 according to at least one embodiment of the present disclosure
- FIG. 8 is an exemplary flowchart corresponding to step S600 shown in FIG. 2 according to at least one embodiment of the present disclosure
- FIG. 9 is a schematic diagram of an operation corresponding to step S600 shown in FIG. 2 according to at least one embodiment of the present disclosure.
- FIG. 10 is a schematic block diagram of a text detection device provided by at least one embodiment of the present disclosure.
- FIG. 11 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
- Translation pens usually include scanning translation pens ("scan translation pens” for short) and point translation pens ("dot translation pens” for short).
- scanning translation pens scanning translation pens
- point translation pens point translation pens
- the user usually needs an adaptation process when using the scanning pen. Different from the use mode of the scanning pen, when using the translation pen, you only need to point the pen tip under the text to be translated, and then you can perform the corresponding recognition and translation with a single tap, so the use method is more flexible and closer to the user's The pen habit can provide a better user experience.
- the working principle of the current point translation pen is mainly: first click the tip of the point translation pen under the text to be detected (for example, English words, but not limited to this), and use the pen body camera of the point translation pen to capture the text image, for example, The text image shown in Figure 1; then, the traversal text detection process is performed on each pixel position of the entire text image to obtain all text boxes on the text image (as shown by the solid line boxes surrounding each word in Figure 1) Show); Then find the text box near the pen tip, that is, the text box of the text to be detected (that is, the text box surrounding the text to be detected), and identify and translate the text.
- the processing speed can be greatly increased, and the response time and the occupation of computing resources can be reduced.
- the point translation pen needs to recognize texts of different font sizes, if the area to be detected in the text image is artificially limited, the following problems may occur: On the one hand, if the artificially limited area that needs to be detected is too large, the beneficial effect (that is, to improve Processing speed, reduction of response time and computing resource occupancy, etc.) may not be obvious; on the other hand, if the artificially limited area to be detected is too small, it may not be able to cover the text in large fonts, resulting in incomplete detection and Recognizing text in large fonts will limit the scope of use of point translation.
- At least one embodiment of the present disclosure provides a text detection method.
- the detection method includes: obtaining a text feature image corresponding to the text image based on the text image; taking a partial area of the text feature image close to the first edge of the text feature image as the basic area, wherein the first edge of the text feature image corresponds to For the first edge of the text image, the text to be detected in the text image is close to the first edge of the text image, and at least part of the pixels in the basic area are positive pixels; at least part of the positive pixels in the basic area are grouped to obtain at least one Connected domains; expand at least one connected domain in a direction away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain; and determine at least one feature frame corresponding to the at least one final connected domain , And map at least one feature box to the text image to obtain at least one text box, where the at least one text box includes a text box of the text to be detected.
- Some embodiments of the present disclosure also provide a text detection device and a storage medium corresponding to the above text detection method.
- the text detection method provided by the embodiments of the present disclosure can perform text detection based on a pre-set basic area and adopt the idea of connected domains, thereby reducing the amount of calculations for text detection (that is, reducing the number of traversals), and reducing the response of text detection time.
- the text detection method is suitable for point translation pens, etc., which can increase the processing speed of point translation pens and improve user experience.
- Fig. 2 is an exemplary flowchart of a text detection method provided by at least one embodiment of the present disclosure.
- the text detection method provided by the embodiment of the present disclosure can be applied to a text image obtained by a translation pen, but is not limited thereto.
- the text detection method includes but is not limited to step S100 to step S600.
- Step S100 Based on the text image, a text feature image corresponding to the text image is obtained.
- the text image may include an image captured by an image capture device or component.
- the text detection method before step S100, the text detection method further includes step S000: collecting text images.
- a text image can be captured using, for example, a point translation pen.
- the translation pen may include an image acquisition element, such as a camera; for example, the camera may be set on the pen body of the translation pen, for example. Therefore, the point translation pen (camera on the point translation pen) can be used to execute step S000, that is, to collect text images.
- the tip of the translation pen when using the image acquisition component of a translation pen to collect a text image, the tip of the translation pen generally points below the text to be detected, so compared to the text image, the tip of the translation pen is equivalent to a point close to the text to be detected One side of the edge of the image. In order to distinguish it from other edges of the text image, this edge is called the first edge of the text image (refer to the first edge FE of the text image in FIG. 3).
- the text image can be a grayscale image or a color image.
- the shape of the text image may be a rectangle, a diamond, a circle, etc., which is not limited in the embodiment of the present disclosure.
- the text image is a rectangle as an example for description, but it should not be regarded as a limitation of the present disclosure.
- the text image can be an original image directly collected by an image collection device or component, or an image obtained after preprocessing the original image.
- the text detection method provided by the embodiments of the present disclosure may further include an operation of preprocessing the text image .
- Preprocessing can eliminate irrelevant information or noise information in the text image, so as to better process the text image.
- the preprocessing may include, for example, processing such as scaling, cropping, gamma correction, image enhancement, or noise reduction filtering on the text image.
- the text image includes at least one text, and the at least one text includes the text to be detected.
- the text to be detected is usually close to the first edge (for example, the lower edge) of the text image. It should be noted that the text to be detected is the text that the user wants to detect.
- a text image refers to a form of presenting text in a visual manner, such as pictures and videos including text.
- the text to be detected may include: a word in one of languages such as English, French, German, and Spanish, or a word or word in one of languages such as Chinese, Japanese, and Korean; but it is not limited to this.
- FIG. 3 is a schematic diagram of a text image provided by at least one embodiment of the present disclosure.
- the text image includes multiple texts.
- a text can be an English word (for example, "Tecent”, “the”, etc. in FIG. 3), one or a string of numbers (for example, "61622214” etc.) in FIG. 3, but not limited thereto.
- Fig. 3 In the text image shown in Fig.
- the text to be detected may be "Tecent”; for example, in some examples, when the translation pen is used to select “Tecent” as the text to be detected, the tip of the translation pen is at "Tecent" Below (near the first edge FE), and use the camera set on the pen body of the translation pen to take a picture to obtain the text image shown in Figure 3.
- a text detection neural network may be used to process the text image to obtain a text feature image, and obtain the difference between each pixel in the text feature image and the directly adjacent pixel. Probability of connection between.
- FIG. 4 is a schematic diagram of a text detection neural network provided by at least one embodiment of the present disclosure.
- the text detection neural network includes first to sixth convolution modules, first to fifth down-sampling modules, first to fourth up-sampling modules, and classifiers.
- each of the first to sixth convolution modules may include a convolution layer.
- the convolutional layer is the core layer of the convolutional neural network.
- the convolutional layer can apply several convolution kernels (also called filters) to the input image to extract multiple types of features of the input image.
- Each convolution kernel can extract one type of feature.
- the convolution kernel is generally initialized in the form of a random decimal matrix.
- the convolution kernel will learn to obtain reasonable weights.
- the result obtained after applying a convolution kernel to the input image is called a feature map, and the number of feature images is equal to the number of convolution kernels.
- a text image is used as an input image. It should be noted that the embodiments of the present disclosure do not limit the number of convolutional layers included in the first to sixth convolution modules.
- each convolution module described above may further include an activation layer.
- the activation layer includes an activation function, which is used to introduce non-linear factors to the convolutional neural network, so that the convolutional neural network can better solve more complex problems.
- the activation function may include a linear correction unit (ReLU) function, a leaky linear correction unit function (LeakyReLU), a sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function).
- the ReLU function and LeakyReLU function are unsaturated nonlinear functions, and the Sigmoid function and tanh function are saturated nonlinear functions.
- each of the foregoing convolution modules may further include, for example, a batch normalization (BN) layer.
- the batch normalization layer is used to perform batch normalization processing on feature images of mini-batch samples (that is, input images), so that the gray value of each feature image pixel changes within a predetermined range, thereby reducing calculations. Difficulty, improve contrast.
- the predetermined range may be [-1, 1], but is not limited to this.
- the batch normalization layer may perform batch normalization processing on each feature image according to the mean value and variance of the feature image of each small batch of samples.
- each of the first to fifth down-sampling modules may include a down-sampling layer.
- the down-sampling layer can be used to reduce the scale of the input image, simplify the calculation complexity, and reduce the phenomenon of over-fitting to a certain extent; on the other hand, the down-sampling layer can also perform feature compression to extract the input image Main features.
- the down-sampling layer can reduce the size of feature images, but does not change the number of feature images. For example, if an input image with a size of 12 ⁇ 12 is sampled by a 2 ⁇ 2 down-sampling layer filter, a 6 ⁇ 6 feature image can be obtained, which means that 4 pixels on the input image are merged into features 1 pixel in the image.
- the downsampling layer can use max pooling, average pooling, strided convolution, decimation, such as selecting fixed pixels, and demultiplexing output ( demuxout, which splits the input image into multiple smaller images) and other down-sampling methods for down-sampling processing.
- the downsampling factors of the downsampling layers in the first to fifth downsampling modules are all 1/(2 ⁇ 2), and the present disclosure includes but is not limited to this.
- each of the first to fourth upsampling modules may include an upsampling layer.
- the up-sampling layer may use up-sampling methods such as strided transposed convolution and interpolation algorithms for up-sampling processing.
- the interpolation algorithm may include, for example, algorithms such as interpolation, bilinear interpolation, and bicubic interpolation (Bicubic Interprolation).
- Upsampling is used to increase the size of the feature image, thereby increasing the data volume of the feature image.
- the upsampling factors of the upsampling layers in the first to fourth upsampling modules are all 2 ⁇ 2, and the present disclosure includes but is not limited to this.
- each of the first to fifth dimensionality reduction modules may include a convolutional layer using a 1 ⁇ 1 convolution kernel.
- each of the aforementioned dimensionality reduction modules can use a 1 ⁇ 1 convolution kernel to reduce the dimensionality of the data, reduce the number of feature images, thereby reducing the number of parameters in the subsequent processing, and reducing the amount of calculation to increase the processing speed.
- each of the first to fifth dimensionality reduction modules may include 10 1 ⁇ 1 convolution kernels, so that each dimensionality reduction module can correspondingly output 10 feature images.
- the classifier may include two softmax classifiers, namely a first softmax classifier and a second softmax classifier.
- the first softmax classifier is used to predict whether each pixel is a text pixel (that is, a positive pixel) or a non-text pixel (that is, a negative pixel).
- the second softmax classifier determines whether each pixel is directly adjacent to the four pixels There is a link relationship for link classification prediction. It should be noted that in the present disclosure, any other feasible methods can also be used to perform text classification prediction and connection classification prediction, including but not limited to the above-mentioned first and second softmax classifiers.
- the convolutional layer, down-sampling layer, up-sampling layer, etc. each refer to the corresponding processing operation, that is, convolution processing, down-sampling processing, up-sampling processing, etc. Repeat the description again.
- using the text detection neural network to process the text image to obtain the corresponding text feature image includes: using the first convolution module to perform convolution processing on the text image to obtain the first convolution feature map group;
- the down-sampling module performs down-sampling processing on the first convolution feature map group to obtain the first down-sampled feature map group; uses the second convolution module to perform convolution processing on the first down-sampled feature map group to obtain the second convolution Product feature map group; use the second down-sampling module to perform down-sampling processing on the second convolution feature map group to obtain the second down-sampled feature map group, and use the fifth dimensionality reduction module to perform the second convolution feature map group Dimensionality reduction processing to obtain the fifth dimensionality reduction feature map group; use the third convolution module to perform convolution processing on the second down-sampling feature map group to obtain the third convolution feature map group; use the third down-sampling module to The third convolution feature map group is down-sampled to obtain the third down-sampled feature
- each feature map group usually includes multiple feature images.
- the fusion processing may include the bit addition processing ADD.
- bit-addition processing ADD usually refers to combining the value of each row and column of the image matrix of each channel of a group of input images with each row and each row of the image matrix of the corresponding channel of another group of input images. Add the values of a column.
- the number of channels of the two sets of images as input to the bit addition process ADD is the same.
- the number of channels of the image output from the bit addition process ADD is also the same as the number of channels of any set of images input.
- fusion processing means adding each pixel in each feature image in one feature map group to the value of the corresponding pixel in the corresponding feature image in another feature map group to obtain a new feature image .
- Fusion processing does not change the number and size of feature images.
- the text classification prediction image includes 2 feature images
- the connection probability prediction image includes 8 feature images.
- the value of the pixel in each feature image in the text classification prediction image and the connection probability prediction image is greater than or equal to 0 and less than or equal to 1, and represents the text prediction probability or the connection prediction probability.
- the feature image in the text classification prediction image represents the probability map of whether each pixel is text
- the feature image in the connection probability prediction image represents the probability map of whether each pixel is connected to the pixel directly adjacent to the pixel.
- the two feature images in the text classification prediction image include a text probability image and a non-text probability image.
- the text probability image indicates the predicted probability of each pixel belonging to the text (that is, the type probability of each pixel), and the non-text probability image indicates that each pixel belongs to
- the values of the corresponding pixels of the two feature images add up to 1.
- the type probability threshold may be set, for example, 0.75; if the predicted probability of a pixel belonging to the text is greater than or equal to the type probability threshold, it means that the pixel belongs to the text, that is, the pixel is a positive pixel. pixel); if the predicted probability of a pixel belonging to text is less than the type probability threshold, it means that the pixel is non-text, that is, the pixel is a negative pixel.
- FIG. 5 is a schematic diagram of a pixel adjacency relationship provided by at least one embodiment of the present disclosure.
- the pixel PX3 and the pixel PX4 are directly adjacent to the pixel PX0
- the pixel PX1 and the pixel PX2 are directly adjacent to the pixel PX0, that is,
- the pixels PX1 to PX4 are four pixels directly adjacent to the pixel PX0, and are respectively located above, below, to the left, and to the right of the pixel PX0.
- the pixel array in each feature image is arranged in multiple rows and multiple columns.
- the direction C1 may indicate a first direction perpendicular to the first edge (including the first edge of the text image and the first edge of the text feature image), such as the column direction; the direction R1 may indicate parallel to the first edge (including the first edge of the text image).
- the first edge of the image and the first edge of the text feature image) in the second direction such as the row direction.
- the eight feature images in the connection probability prediction image may include the first connection classification image, the second connection classification image, the third connection classification image, the fourth connection classification image, the fifth connection classification image, the sixth connection classification image, The seventh connection classification image and the eighth connection classification image.
- the eighth connection classification image may include the first connection classification image, the second connection classification image, the third connection classification image, the fourth connection classification image, the fifth connection classification image, the sixth connection classification image, The seventh connection classification image and the eighth connection classification image.
- the value of the pixel PX0 in the first connection classification image represents the connection prediction probability from the pixel PX0 to the direction of the pixel PX1
- the value of the pixel PX0 in the second connection classification image represents the slave pixel PX0 points to the non-connection prediction probability of the pixel PX1
- the value of the pixel PX0 in the third connection classification image represents the connection prediction probability from the pixel PX0 to the pixel PX2 direction
- the value of the pixel PX0 in the fourth connection classification image represents the slave pixel PX0
- the value of the pixel PX0 in the fifth connection classification image represents the connection prediction probability from the pixel PX0 to the pixel PX3 direction
- the value of the pixel PX0 in the sixth connection classification image represents the point from the pixel PX0
- connection classification image and the second connection classification image is 1, and the sum of the corresponding pixel values of the third connection classification image and the fourth connection classification image is 1.
- the sum of the values of the corresponding pixels of the fifth connection classification image and the sixth connection classification image is 1, and the sum of the values of the corresponding pixels of the seventh connection classification image and the eighth connection classification image is 1.
- connection probability threshold can be set, for example, 0.7; when the connection prediction probability of two directly adjacent pixels is greater than or equal to the connection probability threshold, it means that the two adjacent pixels can communicate with each other. Connection; when the predicted connection probability of two directly adjacent pixels is less than the connection probability threshold, it means that the two directly adjacent pixels cannot be connected to each other.
- type probability threshold and connection probability threshold are only illustrative, and the type probability threshold and connection probability threshold can be set according to actual application requirements.
- the text feature image is a binary image, but it is not limited thereto.
- the text feature image is obtained, and the connection probability between each pixel in the text feature image and the directly adjacent pixel may include: Text classification predicts the text probability in the image. Each pixel in the image is binarized according to the comparison between its pixel value (the predicted probability of belonging to the text, that is, the type probability) and the type probability threshold to obtain the text feature image, and the text The connection probability between each pixel in the feature image and the directly adjacent pixel can be correspondingly queried from the connection probability prediction image.
- a picture can be obtained Text feature images including positive and negative pixels.
- FIG. 6 is a schematic diagram of a text feature image provided by at least one embodiment of the present disclosure. As shown in FIG. 6, the text feature image includes positive pixels (as shown by each gray square in FIG. 6) and negative pixels (as shown by each white square in FIG. 6).
- the size of the text feature image is the same as the size of each feature image in the text classification prediction image and the connection probability prediction image.
- the text detection neural network shown in FIG. 4 is schematic. In practical applications, a neural network with other structures can also be used to perform the operation of step S100; of course, the text detection neural network shown in FIG. 4 can also be partially modified to obtain a new one that can also perform the operation of step S100 Text detection neural network.
- the fourth upsampling module and fifth dimensionality reduction module in the text detection neural network shown in FIG. 4 and the corresponding fusion processing can be omitted, and at the same time, the third fusion feature map group is used to perform Classification processing to obtain a text classification prediction image and a connection probability prediction image. It should be noted that the embodiments of the present disclosure do not limit this.
- each pixel in the text feature image and its top, bottom, left, right, top left, bottom left, top right, and bottom right The 8 pixels are directly adjacent to each other; in this case, the connection probability prediction image can correspondingly include 16 feature images.
- the embodiments of the present disclosure include but are not limited thereto.
- the text detection method in which each pixel has 4 directly adjacent pixels can reduce the amount of calculation, increase the processing speed, and improve the follow-up There may be text sticking problems in the resulting text box.
- Step S200 Use a partial area in the text feature image close to the first edge of the text feature image as a basic area, where at least some pixels in the basic area are positive pixels.
- the first edge of the text feature image corresponds to the first edge of the text image
- the to-be-detected text in the text image is close to the first edge of the text image (refer to the related description in FIG. 3).
- the lower partial area in the text feature image that is, the partial area close to the first edge of the text feature image, as shown by the dashed box in FIG. 6
- the base area, at least some of the pixels in the base area are positive pixels (as shown by the gray squares in the dashed box in FIG. 6).
- the size of the base area can be set to h base *w (that is, including h base rows and w columns). Pixels), where h, w, and h base are all positive integers, and h base /h ⁇ 1.
- h base /h ⁇ 1/2; for example, in some examples, the value range of h base /h is, for example, 1/10 to 1/2, such as 1/5 to 2/5, For example, 1/4 to 1/3, etc.
- the value range of h base /h can be set according to actual application requirements, for example, it can be set according to the range of the font size that needs to be recognized and the size of the coverage area of the text image. It should be noted that if the value of h base /h is too small, positive pixels may not be included in the base area, and the text detection method provided by the embodiments of the present disclosure cannot be effectively implemented; if the value of h base /h is If it is too large, it may result in an insignificant reduction in the amount of calculation for text detection, thereby reducing the beneficial effects of the embodiments of the present disclosure; therefore, the value of h base /h should be set reasonably according to actual application requirements.
- the width of the basic region may be set to be the same as the width of the text feature image, that is, even Is w.
- Step S300 Group at least part of the positive pixels in the basic area to obtain at least one connected domain.
- step S300 based on the union search algorithm, according to the connection probability between each positive pixel in the basic area and the directly adjacent pixels, at least part of the positive pixels in the basic area can be grouped to obtain at least one Connected Components.
- the union search algorithm may include: first, construct an index set based on at least part of the positive pixels in the base area, for example, the index set includes at least part of the positive pixels in the base area, and the index set , Each positive pixel corresponds to a root node, and the initial value of the root node of each positive pixel is itself; then, in response to any directly adjacent pixel of each positive pixel in the index set, it is a positive pixel and the Each positive pixel has a positive connection relationship with the directly adjacent pixel, and the value of the root node of the directly adjacent pixel is set to the value of the root node of the positive pixel; finally, it will have the same value of the root node
- Each group of positive pixels is regarded as a connected domain to obtain at least one connected domain.
- At least part of the positive pixels in the base area used to construct the index set includes all the positive pixels in the base area; for example, in other examples, at least part of the base area used to construct the index set
- the positive pixels include the positive pixels in one or several rows (which can be set according to actual requirements) that are closest to the first edge of the text feature image in the basic area, so that the amount of calculation can be reduced and the processing speed can be improved.
- the embodiments of the present disclosure do not limit this.
- each positive pixel includes pixels directly adjacent to each positive pixel in the first direction perpendicular to the first edge of the text feature image, and pixels directly adjacent to the first edge parallel to the text feature image. Pixels directly adjacent to each positive pixel in the second direction. For example, each positive pixel has four directly adjacent pixels.
- connection probability between two directly adjacent pixels when the connection probability between two directly adjacent pixels is greater than the connection probability threshold, there is a positive connection relationship between the two pixels.
- the above-mentioned at least one connected domain may be denoised.
- the connected domains with an area less than T1 pixels or a connected domain with a width (or height) less than T2 pixels in the above-mentioned at least one connected domain may be removed, and denoising will be performed.
- the one or more connected domains remaining after processing are used to determine the final connected domain corresponding to the text to be detected (refer to the related description in step S400 below).
- T1 may be, for example, 100 to 300, such as 200, but is not limited thereto; for example, in some examples, T2 may be, for example, 5 to 15, such as 10, but is not limited thereto. It should be understood that the values of T1 and T2 can be set according to actual application requirements.
- Step S400 Expand at least one connected domain in a direction away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain.
- At least one final connected domain includes a final connected domain corresponding to the text to be detected.
- FIG. 7 is an exemplary flowchart corresponding to step S400 shown in FIG. 2 provided by at least one embodiment of the present disclosure.
- step S400 shown in FIG. 7 will be described in detail with reference to the text feature image shown in FIG. 6.
- step S400 includes the step of S410 to step S450.
- Step S410 Extract the positive pixel farthest from the first edge of the text feature image in the first direction perpendicular to the first edge of the text feature image in the current connected domain as the first positive pixel.
- the current connected domain is at least one connected domain in the basic area.
- the farthest positive pixel of the first edge of the text feature image includes pixel points 1-5, so that the pixel points 1-5 are all regarded as the first positive pixel.
- the first positive pixels ie, pixel points 1-5) are located in the same row.
- pixel points 1-2 belong to the same connected domain, so pixel points 1-2 have the same root node; pixel points 3-5 belong to the same connected domain, so pixel points 3-5 have the same The root node of (different from the root node of pixel 1-2).
- Step S420 Use a pixel in the text feature image that is on the side of the first positive pixel away from the first edge of the text feature image and directly adjacent to the first positive pixel as the first adjacent pixel.
- the first neighboring pixels include pixels 6-8, etc.; among them, the pixel 6 is directly adjacent to the pixel 1, the pixel 7 is directly adjacent to the pixel 2, and the pixel 8 is directly adjacent to the pixel. 4 is directly adjacent, and the first adjacent pixels of pixels 3 and 5 are not given reference numerals.
- Step S430 In response to the first neighboring pixel being a positive pixel and there is a positive connection relationship between the first positive pixel and the first neighboring pixel, the value of the root node of the first neighboring pixel is modified to the value of the root node of the first positive pixel , And add the first neighboring pixel to the first neighboring pixel set.
- connection probability between the first positive pixel and the first neighboring pixel when the connection probability between the first positive pixel and the first neighboring pixel is greater than the connection probability threshold, there is a positive connection relationship between the two.
- the first set of neighboring pixels has a form similar to the aforementioned index set, that is, each pixel in the first set of neighboring pixels also has a corresponding root node.
- the pixel point 6 is a positive pixel and the pixel point 6 and the pixel point 1 have a positive connection relationship, so that the pixel point 6 can be added to the first adjacent pixel set, and the pixel point 6
- the value of the root node is the same as the value of the root node of pixel point 1.
- pixel 7 can also be added to the first neighboring pixel, and the value of the root node of pixel 7 is the same as the value of the root node of pixel 2, that is, the value of the root node of pixels 1 and 6 is the same; 8 can also be added to the first neighboring pixel, and the value of the root node of pixel 8 is the same as the value of the root node of pixel 3.
- Step S440 Expand the first set of adjacent pixels in a second direction parallel to the first edge of the text feature image.
- step S440 may include: directly adjacent to any pixel in the first adjacent pixel set in a second direction parallel to the first edge of the text feature image and having a positive connection relationship. The pixels are added to the first set of neighboring pixels until the first set of neighboring pixels can no longer be expanded in a direction parallel to the first edge of the text feature image.
- the judgment condition of the positive connection relationship in step S440 is the same as the judgment condition in the aforementioned step S430.
- the pixel point 9 is a positive pixel and the pixel point 9 and the pixel point 6 have a positive connection relationship, so that the pixel point 9 can be added to the first adjacent pixel set, and the pixel point 9
- the value of the root node is the same as the value of the root node of the pixel point 6; further, the pixel point 10 is a positive pixel and the pixel point 10 and the pixel point 9 have a positive connection relationship, so that the pixel point 10 can also be added to the first adjacent pixel set ,
- the value of the root node of the pixel point 10 is the same as the value of the root node of the pixel point 9.
- the first adjacent pixel set only includes pixels 6-8 before expansion, and includes pixels 6-11 after expansion. Among them, pixels 6-7 and 9-11 have the same root node.
- Step S450 Extend the current connected domain to include all pixels in the first adjacent pixel set, and continue to expand the current connected domain in a direction away from the first edge of the text feature image until it cannot continue to expand.
- the connected domain (the first connected domain) including pixels 1-2 in the basic area after the first expansion also includes pixels 6-11, including the connected domains of pixels 3-5
- the (second connected domain) includes pixel 8 after the first expansion.
- step S410 and step S450 may be repeated to complete the second expansion of the connected domain.
- the pixels in the first adjacent pixel set obtained during the first expansion ie, pixel points 6-11
- the first connected domain further includes pixels 12-14
- the second connected domain further includes pixels 15-16.
- the first connected domain also includes pixels 6-14, 17 and 19-20 outside the basic area
- the second connected domain also includes pixels 6-14, 17 and 19-20 outside the basic area. Pixels 8, 15-16, 18 and 21.
- two final connected domains can be obtained respectively.
- the expansion of the connected domain in the text feature image shown in FIG. 6 is exemplary rather than restrictive.
- the basic region also includes connected domains whose area does not change after the processing of step S400, for example, it cannot be expanded outward (that is, it cannot be expanded in a direction away from the first edge of the text feature image)
- the connected domains are also regarded as final connected domains after being processed in step S400.
- Step S500 Determine at least one feature box corresponding to the at least one final connected domain, and map the at least one feature box to the text image to obtain at least one text box, where the at least one text box includes a text box of the text to be detected.
- determining at least one feature box corresponding to at least one final connected component may include: performing contour detection on at least one final connected component using a contour detection algorithm to obtain the contour of at least one final connected component;
- the circumscribed rectangle algorithm processes the contour of at least one final connected domain to obtain at least one feature frame corresponding to the at least one final connected domain.
- the contour detection algorithm may include but is not limited to the OpenCV contour detection (findContours) function;
- the minimum bounding rectangle algorithm may include but is not limited to the OpenCV minimum bounding rectangle (minAreaRect) function.
- the characteristic box may be a rectangular box, and correspondingly, the text box may also be a rectangular box. It should be noted that the embodiments of the present disclosure include but are not limited to this.
- mapping includes two processes, scale transformation and projection. For example, taking the size of the text feature image as 1/(2 ⁇ 2) of the size of the text image as an example, in the process of scale transformation, the width and height of the feature box are respectively expanded twice; during the projection process, keep The relative position of the text box and the text image and the relative position of the feature box and the text feature image are consistent, so that the corresponding text box can be obtained.
- each text box includes one text.
- the text detection method provided by the embodiment of the present disclosure only a part of the area near the text to be detected (shown as "Tecent" in FIG. 3) in the text image needs to be detected. Thereby, only the text box of part of the text in the text image (including the text box of the text to be detected) is obtained.
- the common text detection method corresponding to the text image shown in FIG. 1 requires traversal detection of the entire area of the text image to obtain the text box of all the text in the text image. Therefore, the text detection method provided by the embodiments of the present disclosure can reduce the calculation amount of text detection (that is, reduce the number of traversals), and reduce the response time of text detection.
- Step S600 Determine the text box of the text to be detected from at least one text box.
- the text image is captured by a camera set on the pen body of the translation pen, and the text to be detected is selected by the tip of the translation pen. Since the relative position of the tip of the translation pen and the camera is fixed, the relative position of the tip of the translation pen (assuming a virtual pen tip is virtual on the plane where the text image is located) and the text image captured by the camera is also stable. Therefore, step S600 can be implemented based on the foregoing principle.
- FIG. 8 is an exemplary flowchart corresponding to step S600 shown in FIG. 2 provided by at least one embodiment of the present disclosure
- FIG. 9 is an exemplary flowchart provided by at least one embodiment of the present disclosure corresponding to the step S600 shown in FIG. A schematic diagram of the operation of step S600.
- step S600 shown in FIG. 8 will be described in detail with reference to FIG. 9.
- the text box of the text to be detected is determined from at least one text box, that is, step S600 includes step S610 to step S620.
- Step S610 construct a virtual detection frame in the text image
- Step S620 Calculate the overlap area between the virtual detection box and each text box, and use the text box having the largest overlap area with the virtual detection box as the text box of the text to be detected.
- a pen tip of a translation pen may be virtualized in the text image (shown in the gray solid line box in FIG. 9).
- the virtual pen tip (shown as the black dot in FIG. 9) can be set on the first edge of the text image, but it is not limited to this; for example, in other examples, the virtual pen tip can be set on the text image. Outside the image, and close to the first edge.
- the virtual pen tip may generally be set on the vertical line of the first edge of the text image, or may be set near the vertical line of the first edge of the text image.
- the embodiment of the present disclosure does not do this. limit. It should be understood that the virtual pen tip can be set according to actual application requirements, which is not limited in the embodiments of the present disclosure.
- H H1+H2
- H1 represents the smallest distance between the virtual pen tip and the center of each text box in the text image in the first direction (ie, the column direction) perpendicular to the first edge
- H2 is a preset height value; for example, H2 can be set to a height value of, for example, 30 pixels, but is not limited to this.
- the width W is a preset width value; for example, W can be set to a width value of, for example, 60 pixels, but it is not limited thereto. It should be understood that H2 and W can be set according to actual application requirements, which are not limited in the embodiments of the present disclosure.
- the text detection method provided in the embodiments of the present disclosure may further include: performing text recognition processing on the text to be detected based on the text box of the text to be detected.
- a common text processing method can be used to perform text recognition processing, which is not limited in the embodiments of the present disclosure.
- commonly used text processing methods may include, but are not limited to, the use of neural networks (such as multi-objective corrective attention network (MORAN), etc.) for text recognition processing.
- MORAN multi-objective corrective attention network
- text translation can also be performed based on the result of text recognition processing to obtain and output the translation result of the text to be detected.
- a dictionary database is used to index the results of text recognition processing to retrieve translation results.
- the translation result of the text to be detected can be displayed on a display, or can be output via a speaker or the like.
- the flow of the above-mentioned text detection method may include more or fewer operations, and these operations may be executed sequentially or in parallel.
- the flow of the text detection method described above includes multiple operations appearing in a specific order, it should be clearly understood that the order of the multiple operations is not limited.
- the text detection method described above can be executed once or multiple times according to predetermined conditions.
- the text detection neural network and various functional modules and functional layers in the text detection neural network can be implemented by software, hardware, firmware, or any combination thereof, so as to execute The corresponding process.
- the text detection method provided by the embodiments of the present disclosure can perform text detection based on a pre-set basic area and adopt the idea of connected domains, thereby reducing the amount of calculations for text detection (that is, reducing the number of traversals), and reducing the response of text detection
- this text detection method is suitable for point-translation pens, which can increase the processing speed of point-translation pens and improve user experience.
- FIG. 10 is a schematic block diagram of a text detection device provided by at least one embodiment of the present disclosure.
- the text detection device 1000 includes a memory 1001 and a processor 1002. It should be understood that the components of the text detection device 1000 shown in FIG. 10 are only exemplary and not restrictive. According to actual application requirements, the text detection device 1000 may also include other components.
- the memory 1001 is used to store text images and computer-readable instructions; the processor 1002 is used to read text images and run computer-readable instructions, and the computer-readable instructions are executed when the processor 1002 runs according to any of the above-mentioned embodiments.
- the text detection device may further include an image acquisition element 1003.
- the image capture element 1003 is used to capture text images.
- the image capture element 1003 is the image capture device or element described in the embodiment of the text detection method.
- the image capture element 1003 may be various types of cameras.
- the text detection device 1000 may be a point translation pen, but it is not limited thereto.
- the translation pen is used to select the text to be detected.
- the image acquisition component 1003 may be arranged on a point translation pen, for example, the image acquisition component 1003 may be a camera arranged on a point translation pen.
- the memory 1001 and the processor 1002 can also be integrated in the translation pen, that is, the image acquisition element 1003, the memory 1001, and the processor 1002 can all be integrated in the translation pen.
- the embodiments of the present disclosure include but not Limited to this.
- the text detection device 1000 may further include an output unit configured to output the recognition result and/or the translation result of the text to be detected.
- the output unit may include a display, a speaker, etc.
- the display may be used to display the recognition result and/or translation result of the text to be detected, and the speaker may be used to output the recognition result and/or translation result of the text to be detected in the form of voice.
- the translation pen may also include a communication module, which is used to implement communication between the translation pen and the output unit, for example, to transmit the translation result to the output unit.
- the processor 1002 may control other components in the text detection apparatus 1000 to perform desired functions.
- the processor 1002 may be a central processing unit (CPU), a tensor processor (TPU), or other devices with data processing capabilities and/or program execution capabilities.
- the central processing unit (CPU) can be an X86 or ARM architecture.
- the GPU can be directly integrated on the motherboard alone or built into the north bridge chip of the motherboard. The GPU can also be built into the central processing unit (CPU).
- the memory 1001 may include any combination of one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
- Volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
- the non-volatile memory may include, for example, read only memory (ROM), hard disk, erasable programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, flash memory, and the like.
- ROM read only memory
- EPROM erasable programmable read only memory
- CD-ROM portable compact disk read only memory
- USB memory flash memory, and the like.
- One or more computer-readable instructions may be stored on the computer-readable storage medium, and the processor 1002 may run the computer-readable instructions to implement various functions of the text detection apparatus 1000.
- the network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network.
- the network may include a local area network, the Internet, a telecommunications network, the Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination of the above networks, and so on.
- the wired network may, for example, use twisted pair, coaxial cable, or optical fiber transmission for communication, and the wireless network may use, for example, a 3G/4G/5G mobile communication network, Bluetooth, Zigbee, or WiFi.
- the present disclosure does not limit the types and functions of the network here.
- FIG. 11 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
- one or more computer-readable instructions 1101 may be stored non-transitory on the storage medium 1100.
- the computer-readable instructions 1101 are executed by a computer, one or more steps in the text detection method described above can be executed.
- the storage medium 1100 can be applied to the above-mentioned text detection device 1000, for example, it can be used as the memory 1001 in the text detection device 1000.
- the description of the storage medium 1100 reference may be made to the description of the memory in the embodiment of the text detection apparatus 100, and the repetition is not repeated here.
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Theoretical Computer Science (AREA)
- Image Analysis (AREA)
Abstract
Description
Claims (22)
- 一种文本检测方法,包括:基于文本图像,获取对应于所述文本图像的文本特征图像;将所述文本特征图像中的靠近所述文本特征图像的第一边缘的部分区域作为基础区域,其中,所述文本特征图像的第一边缘对应于所述文本图像的第一边缘,所述文本图像中的待检测文本靠近所述文本图像的第一边缘,所述基础区域中的至少部分像素为正像素;对所述基础区域中的至少部分正像素进行分组,以得到至少一个连通域;将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的至少一个最终连通域;以及确定所述至少一个最终连通域对应的至少一个特征框,并将所述至少一个特征框映射到所述文本图像中,以得到至少一个文本框,其中,所述至少一个文本框包括所述待检测文本的文本框。
- 根据权利要求1所述的文本检测方法,其中,在所述文本特征图像包括h行w列像素的情况下,所述基础区域包括h base行w列像素,其中,h、w、h base均为正整数,且h base/h≤1/2。
- 根据权利要求1或2所述的文本检测方法,其中,所述文本特征图像中的每个像素与直接相邻的像素之间具有连接概率;对所述基础区域中的至少部分正像素进行分组,以得到所述至少一个连通域,包括:基于并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域。
- 根据权利要求3所述的文本检测方法,其中,基于所述并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域,包括:基于所述基础区域中的所述至少部分正像素构建索引集合,其中,所 述索引集合包括所述基础区域中的所述至少部分正像素,且在所述索引集合中,每个正像素对应一个根节点,每个正像素的根节点的初始值为其自身;响应于所述索引集合中的每个正像素的任一直接相邻的像素为正像素且所述每个正像素与所述直接相邻的像素之间具有正连接关系,将所述直接相邻的像素的根节点的值设置为所述每个正像素的根节点的值;以及将具有相同根节点的值的每组正像素作为一个连通域,以得到所述至少一个连通域。
- 根据权利要求4所述的文本检测方法,其中,在所述基础区域中的每个正像素与直接相邻的像素之间的连接概率大于连接概率阈值情况下,确定所述每个正像素与所述直接相邻的像素之间具有所述正连接关系。
- 根据权利要求4或5所述的文本检测方法,其中,所述基础区域中的每个正像素的直接相邻的像素包括:在垂直于所述文本特征图像的第一边缘的第一方向上与所述每个正像素直接相邻的像素,以及在平行于所述文本特征图像的第一边缘的第二方向上与所述每个正像素直接相邻的像素。
- 根据权利要求4-6任一项所述的文本检测方法,其中,所述基础区域中的每个正像素具有四个直接相邻的像素。
- 根据权利要求4-7任一项所述的文本检测方法,其中,将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的所述至少一个最终连通域,包括:提取当前的连通域中的在垂直于所述文本特征图像的第一边缘的第一方向上距离所述文本特征图像的第一边缘最远的正像素作为第一正像素;将所述文本特征图像中在所述第一正像素的远离所述文本特征图像的第一边缘的一侧且与所述第一正像素直接相邻的像素作为第一邻近像素;响应于所述第一邻近像素为正像素且所述第一正像素与所述第一邻近像素之间具有正连接关系,将所述第一邻近像素的根节点的值修改为所述第一正像素的根节点的值,并将所述第一邻近像素加入第一邻近像素集合;在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展;以及将当前的连通域扩展为包括所述第一邻近像素集合中的全部像素,并继续将当前的连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,直到无法继续扩展为止。
- 根据权利要求8所述的文本检测方法,其中,在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展,包括:将在平行于所述文本特征图像的第一边缘的第二方向上与所述第一邻近像素集合中的任一像素直接相邻且具有正连接关系的正像素加入所述第一邻近像素集合,直到无法继续在平行于所述文本特征图像的第一边缘的方向上对所述第一邻近像素集合进行扩展为止。
- 根据权利要求8或9所述的文本检测方法,其中,所述至少一个最终连通域包括所述基本区域内的无法沿远离所述文本特征图像的第一边缘的方向进行扩展的连通域。
- 根据权利要求3-10任一项所述的文本检测方法,其中,基于所述文本图像,获取对应于所述文本图像的所述文本特征图像,包括:使用文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
- 根据权利要求11所述的文本检测方法,其中,所述文本检测神经网络包括第一至第六卷积模块、第一至第五下采样模块、第一至第四上采样模块、以及分类器;使用所述文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率,包括:使用第一卷积模块对所述文本图像进行卷积处理,以得到第一卷积特征图组;使用第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到 第二卷积特征图组;使用第二下采样模块对所述第二卷积特征图组进行下采样处理,以得到第二下采样特征图组,且使用第五降维模块对所述第二卷积特征图组进行降维处理,以得到第五降维特征图组;使用第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用第四降维模块对所述第三卷积特征图组进行降维处理,以得到第四降维特征图组;使用第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用第三降维模块对所述第四卷积特征图组进行降维处理,以得到第三降维特征图组;使用第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用第二降维模块对所述第五卷积特征图组进行降维处理,以得到第二降维特征图组;使用第六卷积模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用第一上采样模块对所述第六卷积特征图组进行上采样处理,以得到第一上采样特征图组;使用第一降维模块对所述第一上采样特征图组进行降维处理,以得到第一降维特征图组;对所述第一降维特征图组和所述第二降维特征图组进行融合处理,以得到第一融合特征图组;使用第二上采样模块对所述第一融合特征图组进行上采样处理,以得到第二上采样特征图组;对所述第二上采样特征图组和所述第三降维特征图组进行融合处理,以得到第二融合特征图组;使用第三上采样模块对所述第二融合特征图组进行上采样处理,以得到第三上采样特征图组;对所述第三上采样特征图组和所述第四降维特征图组进行融合处理,以得到第三融合特征图组;使用第四上采样模块对所述第三融合特征图组进行上采样处理,以得到第四上采样特征图组;对所述第四上采样特征图组和所述第五降维特征图组进行融合处理,以得到所述第四融合特征图组;使用分类器对所述第四融合特征图组进行分类处理,以得到文本分类预测图像和连接概率预测图像;以及基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
- 根据权利要求12所述的文本检测方法,其中,所述文本分类预测图像中的每个像素具有类型概率,所述连接概率预测图像中的每个像素具有所述像素与直接相邻的像素之间的连接概率;基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与其相邻像素之间的连接概率,包括:将所述文本分类预测图像中的类型概率大于或等于类型概率阈值的像素作为正像素,将所述文本分类预测图像中的类型概率小于所述类型概率阈值的像素作为负像素,以得到所述文本特征图像,所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率可以对应地从连接概率预测图像中查询得到。
- 根据权利要求1-13任一项所述的文本检测方法,其中,确定所述至少一个最终连通域对应的所述至少一个特征框,包括:使用轮廓检测算法对所述至少一个最终连通域进行轮廓检测,以得到所述至少一个最终连通域的轮廓;使用最小外接矩形算法对所述至少一个最终连通域的轮廓进行处理,以得到所述至少一个最终连通域对应的所述至少一个特征框。
- 根据权利要求1-14任一项所述的文本检测方法,还包括:从所述 至少一个文本框中确定出所述待检测文本的文本框。
- 根据权利要求15所述的文本检测方法,其中,从所述至少一个文本框中确定出所述待检测文本的文本框,包括:在所述文本图像中构建虚拟检测框;以及计算所述虚拟检测框与各个文本框的重叠面积,将与所述虚拟检测框具有最大重叠面积的文本框作为所述待检测文本的文本框。
- 根据权利要求15或16所述的文本检测方法,还包括:基于所述待检测文本的文本框,对所述待检测文本进行识别处理。
- 根据权利要求1-17任一项所述的文本检测方法,还包括:使用点译笔的图像采集元件采集所述文本图像;其中,在采集所述文本图像时,所述点译笔的笔尖点在所述待检测文本的靠近所述文本图像的第一边缘的一侧,所述文本图像包括所述待检测文本。
- 一种文本检测装置,包括:存储器,用于存储文本图像以及计算机可读指令;处理器,用于读取所述文本图像,并运行所述计算机可读指令,所述计算机可读指令被所述处理器运行时执行根据权利要求1-18任一项所述的文本检测方法。
- 根据权利要求19所述的文本检测装置,还包括:图像采集元件,用于采集所述文本图像。
- 根据权利要求20所述的文本检测装置,其中,所述文本检测装置为点译笔,其中,所述图像采集元件设置在所述点译笔上,所述点译笔用于选择所述待检测文本。
- 一种存储介质,非暂时性地存储计算机可读指令,其中,当所述计算机可读指令由计算机执行时能够执行根据权利要求1-18任一项所述的文本检测方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202080000057.5A CN113498521A (zh) | 2020-01-21 | 2020-01-21 | 文本检测方法及装置、存储介质 |
PCT/CN2020/073622 WO2021146951A1 (zh) | 2020-01-21 | 2020-01-21 | 文本检测方法及装置、存储介质 |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/CN2020/073622 WO2021146951A1 (zh) | 2020-01-21 | 2020-01-21 | 文本检测方法及装置、存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021146951A1 true WO2021146951A1 (zh) | 2021-07-29 |
Family
ID=76991755
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/073622 WO2021146951A1 (zh) | 2020-01-21 | 2020-01-21 | 文本检测方法及装置、存储介质 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN113498521A (zh) |
WO (1) | WO2021146951A1 (zh) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807351A (zh) * | 2021-09-18 | 2021-12-17 | 京东鲲鹏(江苏)科技有限公司 | 一种场景文字检测方法和装置 |
CN116916047A (zh) * | 2023-09-12 | 2023-10-20 | 北京点聚信息技术有限公司 | 一种版式文件识别数据智能存储方法 |
CN116993976A (zh) * | 2023-07-17 | 2023-11-03 | 中国科学院自动化研究所 | 引用图像分割模型训练方法及引用图像分割方法 |
CN117894030A (zh) * | 2024-01-18 | 2024-04-16 | 广州宏途数字科技有限公司 | 一种校园智慧纸笔的文本识别方法及系统 |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116092087B (zh) * | 2023-04-10 | 2023-08-08 | 上海蜜度信息技术有限公司 | Ocr识别方法、系统、存储介质及电子设备 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050249430A1 (en) * | 2004-05-07 | 2005-11-10 | Samsung Electronics Co., Ltd. | Image quality improving apparatus and method |
US20120250985A1 (en) * | 2011-03-30 | 2012-10-04 | Jing Xiao | Context Constraints for Correcting Mis-Detection of Text Contents in Scanned Images |
CN110222695A (zh) * | 2019-06-19 | 2019-09-10 | 拉扎斯网络科技(上海)有限公司 | 一种证件图片处理方法及装置、介质、电子设备 |
CN110610166A (zh) * | 2019-09-18 | 2019-12-24 | 北京猎户星空科技有限公司 | 文本区域检测模型训练方法、装置、电子设备和存储介质 |
-
2020
- 2020-01-21 WO PCT/CN2020/073622 patent/WO2021146951A1/zh active Application Filing
- 2020-01-21 CN CN202080000057.5A patent/CN113498521A/zh active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20050249430A1 (en) * | 2004-05-07 | 2005-11-10 | Samsung Electronics Co., Ltd. | Image quality improving apparatus and method |
US20120250985A1 (en) * | 2011-03-30 | 2012-10-04 | Jing Xiao | Context Constraints for Correcting Mis-Detection of Text Contents in Scanned Images |
CN110222695A (zh) * | 2019-06-19 | 2019-09-10 | 拉扎斯网络科技(上海)有限公司 | 一种证件图片处理方法及装置、介质、电子设备 |
CN110610166A (zh) * | 2019-09-18 | 2019-12-24 | 北京猎户星空科技有限公司 | 文本区域检测模型训练方法、装置、电子设备和存储介质 |
Non-Patent Citations (1)
Title |
---|
LI, MO ET AL.: "Caption Detection and Text Content Extraction in News Video", DIANSHI-JISHU: YUEKAN - VIDEO ENGINEERING, vol. 8 (278), no. Suppl. 1, 31 December 2005 (2005-12-31), CN, pages 147 - 149, XP009529346, ISSN: 1002-8692 * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113807351A (zh) * | 2021-09-18 | 2021-12-17 | 京东鲲鹏(江苏)科技有限公司 | 一种场景文字检测方法和装置 |
CN113807351B (zh) * | 2021-09-18 | 2024-01-16 | 京东鲲鹏(江苏)科技有限公司 | 一种场景文字检测方法和装置 |
CN116993976A (zh) * | 2023-07-17 | 2023-11-03 | 中国科学院自动化研究所 | 引用图像分割模型训练方法及引用图像分割方法 |
CN116916047A (zh) * | 2023-09-12 | 2023-10-20 | 北京点聚信息技术有限公司 | 一种版式文件识别数据智能存储方法 |
CN116916047B (zh) * | 2023-09-12 | 2023-11-10 | 北京点聚信息技术有限公司 | 一种版式文件识别数据智能存储方法 |
CN117894030A (zh) * | 2024-01-18 | 2024-04-16 | 广州宏途数字科技有限公司 | 一种校园智慧纸笔的文本识别方法及系统 |
Also Published As
Publication number | Publication date |
---|---|
CN113498521A (zh) | 2021-10-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021146951A1 (zh) | 文本检测方法及装置、存储介质 | |
WO2020200030A1 (zh) | 神经网络的训练方法、图像处理方法、图像处理装置和存储介质 | |
CN107424159B (zh) | 基于超像素边缘和全卷积网络的图像语义分割方法 | |
WO2021073493A1 (zh) | 图像处理方法及装置、神经网络的训练方法、合并神经网络模型的图像处理方法、合并神经网络模型的构建方法、神经网络处理器及存储介质 | |
US11710293B2 (en) | Target detection method and apparatus, computer-readable storage medium, and computer device | |
WO2022148192A1 (zh) | 图像处理方法、图像处理装置以及非瞬时性存储介质 | |
WO2019201035A1 (zh) | 对图像中的对象节点的识别方法、装置、终端及计算机可读存储介质 | |
US11670071B2 (en) | Fine-grained image recognition | |
CN107506761B (zh) | 基于显著性学习卷积神经网络的脑部图像分割方法及系统 | |
US20210398287A1 (en) | Image processing method and image processing device | |
JP7464752B2 (ja) | 画像処理方法、装置、機器及びコンピュータプログラム | |
CN108427924B (zh) | 一种基于旋转敏感特征的文本回归检测方法 | |
CN113239782B (zh) | 一种融合多尺度gan和标签学习的行人重识别系统及方法 | |
WO2021146937A1 (zh) | 文字识别方法、文字识别装置和存储介质 | |
US20230222631A1 (en) | Method and device for removing handwritten content from text image, and storage medium | |
CN111626994B (zh) | 基于改进U-Net神经网络的设备故障缺陷诊断方法 | |
CN110517270B (zh) | 一种基于超像素深度网络的室内场景语义分割方法 | |
CN111353544A (zh) | 一种基于改进的Mixed Pooling-YOLOV3目标检测方法 | |
JP2023501820A (ja) | フェイスパーシング方法および関連デバイス | |
Rao et al. | Exploring deep learning techniques for Kannada handwritten character recognition: a boon for digitization | |
CN111985525A (zh) | 基于多模态信息融合处理的文本识别方法 | |
CN113487610B (zh) | 疱疹图像识别方法、装置、计算机设备和存储介质 | |
Cui et al. | Deep saliency detection via spatial-wise dilated convolutional attention | |
Zhang et al. | A simple and effective static gesture recognition method based on attention mechanism | |
CN111832390B (zh) | 一种手写古文字检测方法 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20915773 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20915773 Country of ref document: EP Kind code of ref document: A1 |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20915773 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.03.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20915773 Country of ref document: EP Kind code of ref document: A1 |