WO2021146951A1 - 文本检测方法及装置、存储介质 - Google Patents

文本检测方法及装置、存储介质 Download PDF

Info

Publication number
WO2021146951A1
WO2021146951A1 PCT/CN2020/073622 CN2020073622W WO2021146951A1 WO 2021146951 A1 WO2021146951 A1 WO 2021146951A1 CN 2020073622 W CN2020073622 W CN 2020073622W WO 2021146951 A1 WO2021146951 A1 WO 2021146951A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
image
pixel
feature
feature map
Prior art date
Application number
PCT/CN2020/073622
Other languages
English (en)
French (fr)
Inventor
李月
黄光伟
饶天珉
Original Assignee
京东方科技集团股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东方科技集团股份有限公司 filed Critical 京东方科技集团股份有限公司
Priority to CN202080000057.5A priority Critical patent/CN113498521A/zh
Priority to PCT/CN2020/073622 priority patent/WO2021146951A1/zh
Publication of WO2021146951A1 publication Critical patent/WO2021146951A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion

Definitions

  • the embodiments of the present disclosure relate to a text detection method, a text detection device, and a storage medium.
  • dictionaries electronic dictionaries
  • mobile apps applications
  • a translation pen can also use, for example, a translation pen.
  • the dictionary is not easy to carry, and the efficiency of flipping and querying is low; the mobile APP and electronic dictionary use keyboard input, which is not only time-consuming and cumbersome to operate, but also easy to interrupt ideas and distract energy.
  • the translation pen has the advantages of being convenient to use, easy to carry, and closer to the user's reading habits. It can provide users with a good translation and query experience when they read foreign articles.
  • At least one embodiment of the present disclosure provides a text detection method, including: obtaining a text feature image corresponding to the text image based on a text image; A partial area is used as a basic area, wherein the first edge of the text feature image corresponds to the first edge of the text image, the text to be detected in the text image is close to the first edge of the text image, and the basic At least part of the pixels in the area are positive pixels; at least part of the positive pixels in the basic area are grouped to obtain at least one connected domain; and the at least one connected domain is along a distance away from the first edge of the text feature image Expand the direction to obtain at least one final connected domain corresponding to the at least one connected domain; and determine at least one feature box corresponding to the at least one final connected domain, and map the at least one feature box to the text In the image, at least one text box is obtained, wherein the at least one text box includes the text box of the text to be detected.
  • the base area includes h base rows and w columns of pixels, where h, w, h base are positive integers, and h base / h ⁇ 1 / 2.
  • each pixel in the text feature image has a connection probability with directly adjacent pixels; and at least part of the positive pixels in the basic area are grouped ,
  • To obtain the at least one connected domain including: based on a union search algorithm, according to the connection probability between each positive pixel in the at least part of the positive pixels in the basic region and the directly adjacent pixels, The at least part of the positive pixels in the basic area are grouped to obtain the at least one connected domain.
  • Grouping the at least part of the positive pixels in the basic area to obtain the at least one connected domain includes: constructing an index set based on the at least part of the positive pixels in the basic area, wherein, The index set includes the at least part of the positive pixels in the base area, and in the index set, each positive pixel corresponds to a root node, and the initial value of the root node of each positive pixel is itself; response If any directly adjacent pixel of each positive pixel in the index set is a positive pixel and there is a positive connection relationship between each positive pixel and the directly adjacent pixel, the directly adjacent pixel is The value of the root node of the pixel of is set to the value of the root node of each positive pixel; and each group of positive pixels having the same root node value is used as a connected domain to obtain the at
  • connection probability between each positive pixel in the basic area and the directly adjacent pixel is greater than the connection probability threshold, it is determined that each positive pixel is The pixel has the positive connection relationship with the directly adjacent pixel.
  • the pixels directly adjacent to each positive pixel in the basic area include: in a first direction perpendicular to the first edge of the text feature image A pixel directly adjacent to each positive pixel, and a pixel directly adjacent to each positive pixel in a second direction parallel to the first edge of the text feature image.
  • each positive pixel in the basic area has four directly adjacent pixels.
  • the at least one connected domain is expanded in a direction away from the first edge of the text feature image, so as to obtain all the connected domains corresponding to the at least one connected domain.
  • the at least one final connected domain includes: extracting a positive pixel in the current connected domain that is farthest from the first edge of the text feature image in a first direction perpendicular to the first edge of the text feature image as the first A positive pixel; a pixel in the text feature image that is on the side of the first positive pixel away from the first edge of the text feature image and directly adjacent to the first positive pixel is taken as the first adjacent pixel
  • the value of the root node of the first neighboring pixel is modified to the first The value of the root node of a positive pixel, and the first neighboring pixel is added to the first neighboring pixel set;
  • expanding the first set of adjacent pixels in a second direction parallel to the first edge of the text feature image includes: In the second direction of the first edge of the text feature image, a positive pixel that is directly adjacent to any pixel in the first adjacent pixel set and has a positive connection relationship is added to the first adjacent pixel set until it cannot continue to The first set of adjacent pixels is expanded in a direction parallel to the first edge of the text feature image.
  • the at least one final connected domain includes a connected domain in the basic region that cannot be expanded in a direction away from the first edge of the text feature image.
  • obtaining the text feature image corresponding to the text image based on the text image includes: processing the text image using a text detection neural network, In order to obtain the text feature image, and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel.
  • the text detection neural network includes first to sixth convolution modules, first to fifth down-sampling modules, first to fourth up-sampling modules, and Classifier; use the text detection neural network to process the text image to obtain the text feature image, and obtain the connection probability between each pixel in the text feature image and the directly adjacent pixel, It includes: using a first convolution module to perform convolution processing on the text image to obtain a first convolution feature map group; using a first down-sampling module to perform down-sampling processing on the first convolution feature map group to Obtain a first down-sampling feature map group; use a second convolution module to perform convolution processing on the first down-sampled feature map group to obtain a second convolution feature map group; use a second down-sampling module to perform convolution processing on the first The second convolution feature map group is subjected to down-sampling processing to obtain a second down-sampled feature map group, and the fifth dimensionality
  • each pixel in the text classification prediction image has a type probability
  • each pixel in the connection probability prediction image has the pixel and the directly adjacent one.
  • the connection probability between pixels; based on the text classification prediction image and the connection probability prediction image, the text feature image is obtained, and the connection probability between each pixel in the text feature image and its neighboring pixels is obtained , Including: taking a pixel with a type probability greater than or equal to a type probability threshold in the text classification prediction image as a positive pixel, and using a pixel with a type probability less than the type probability threshold in the text classification prediction image as a negative pixel, and The text feature image is obtained, and the connection probability between each pixel in the text feature image and the directly adjacent pixel can be correspondingly queried from the connection probability prediction image.
  • determining the at least one feature box corresponding to the at least one final connected domain includes: performing contour detection on the at least one final connected domain using a contour detection algorithm, To obtain the contour of the at least one final connected domain; and use the minimum bounding rectangle algorithm to process the contour of the at least one final connected domain to obtain the at least one feature frame corresponding to the at least one final connected domain.
  • the text detection method provided by some embodiments of the present disclosure further includes: determining the text box of the text to be detected from the at least one text box.
  • determining the text box of the text to be detected from the at least one text box includes: constructing a virtual detection box in the text image; For the overlap area between the virtual detection box and each text box, the text box having the largest overlap area with the virtual detection box is used as the text box of the text to be detected.
  • the text detection method provided by some embodiments of the present disclosure further includes: performing recognition processing on the text to be detected based on the text box of the text to be detected.
  • the text detection method provided by some embodiments of the present disclosure further includes: using the image acquisition element of the translation pen to collect the text image; wherein, when the text image is collected, the point of the pen point of the translation pen is on the waiting Detecting a side of the text close to the first edge of the text image, the text image including the text to be detected.
  • At least one embodiment of the present disclosure further provides a text detection device, including: a memory, configured to store text images and computer-readable instructions; a processor, configured to read the text images and run the computer-readable instructions, When the computer-readable instructions are executed by the processor, the text detection method provided in any embodiment of the present disclosure is executed.
  • a text detection device including: a memory, configured to store text images and computer-readable instructions; a processor, configured to read the text images and run the computer-readable instructions, When the computer-readable instructions are executed by the processor, the text detection method provided in any embodiment of the present disclosure is executed.
  • the text detection device provided by some embodiments of the present disclosure further includes: an image collection element for collecting the text image.
  • the text detection device is a translation pen, wherein the image acquisition element is arranged on the translation pen, and the translation pen is used to select the to-be-detected text.
  • At least one embodiment of the present disclosure further provides a storage medium that non-temporarily stores computer-readable instructions, wherein when the computer-readable instructions are executed by a computer, the text detection method provided in any embodiment of the present disclosure can be executed.
  • Figure 1 is a schematic diagram of the working principle of a point translation pen
  • FIG. 2 is an exemplary flowchart of a text detection method provided by at least one embodiment of the present disclosure
  • FIG. 3 is a schematic diagram of a text image provided by at least one embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a text detection neural network provided by at least one embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of a pixel adjacency relationship provided by at least one embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of a text feature image provided by at least one embodiment of the present disclosure.
  • FIG. 7 is an exemplary flowchart corresponding to step S400 shown in FIG. 2 according to at least one embodiment of the present disclosure
  • FIG. 8 is an exemplary flowchart corresponding to step S600 shown in FIG. 2 according to at least one embodiment of the present disclosure
  • FIG. 9 is a schematic diagram of an operation corresponding to step S600 shown in FIG. 2 according to at least one embodiment of the present disclosure.
  • FIG. 10 is a schematic block diagram of a text detection device provided by at least one embodiment of the present disclosure.
  • FIG. 11 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
  • Translation pens usually include scanning translation pens ("scan translation pens” for short) and point translation pens ("dot translation pens” for short).
  • scanning translation pens scanning translation pens
  • point translation pens point translation pens
  • the user usually needs an adaptation process when using the scanning pen. Different from the use mode of the scanning pen, when using the translation pen, you only need to point the pen tip under the text to be translated, and then you can perform the corresponding recognition and translation with a single tap, so the use method is more flexible and closer to the user's The pen habit can provide a better user experience.
  • the working principle of the current point translation pen is mainly: first click the tip of the point translation pen under the text to be detected (for example, English words, but not limited to this), and use the pen body camera of the point translation pen to capture the text image, for example, The text image shown in Figure 1; then, the traversal text detection process is performed on each pixel position of the entire text image to obtain all text boxes on the text image (as shown by the solid line boxes surrounding each word in Figure 1) Show); Then find the text box near the pen tip, that is, the text box of the text to be detected (that is, the text box surrounding the text to be detected), and identify and translate the text.
  • the processing speed can be greatly increased, and the response time and the occupation of computing resources can be reduced.
  • the point translation pen needs to recognize texts of different font sizes, if the area to be detected in the text image is artificially limited, the following problems may occur: On the one hand, if the artificially limited area that needs to be detected is too large, the beneficial effect (that is, to improve Processing speed, reduction of response time and computing resource occupancy, etc.) may not be obvious; on the other hand, if the artificially limited area to be detected is too small, it may not be able to cover the text in large fonts, resulting in incomplete detection and Recognizing text in large fonts will limit the scope of use of point translation.
  • At least one embodiment of the present disclosure provides a text detection method.
  • the detection method includes: obtaining a text feature image corresponding to the text image based on the text image; taking a partial area of the text feature image close to the first edge of the text feature image as the basic area, wherein the first edge of the text feature image corresponds to For the first edge of the text image, the text to be detected in the text image is close to the first edge of the text image, and at least part of the pixels in the basic area are positive pixels; at least part of the positive pixels in the basic area are grouped to obtain at least one Connected domains; expand at least one connected domain in a direction away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain; and determine at least one feature frame corresponding to the at least one final connected domain , And map at least one feature box to the text image to obtain at least one text box, where the at least one text box includes a text box of the text to be detected.
  • Some embodiments of the present disclosure also provide a text detection device and a storage medium corresponding to the above text detection method.
  • the text detection method provided by the embodiments of the present disclosure can perform text detection based on a pre-set basic area and adopt the idea of connected domains, thereby reducing the amount of calculations for text detection (that is, reducing the number of traversals), and reducing the response of text detection time.
  • the text detection method is suitable for point translation pens, etc., which can increase the processing speed of point translation pens and improve user experience.
  • Fig. 2 is an exemplary flowchart of a text detection method provided by at least one embodiment of the present disclosure.
  • the text detection method provided by the embodiment of the present disclosure can be applied to a text image obtained by a translation pen, but is not limited thereto.
  • the text detection method includes but is not limited to step S100 to step S600.
  • Step S100 Based on the text image, a text feature image corresponding to the text image is obtained.
  • the text image may include an image captured by an image capture device or component.
  • the text detection method before step S100, the text detection method further includes step S000: collecting text images.
  • a text image can be captured using, for example, a point translation pen.
  • the translation pen may include an image acquisition element, such as a camera; for example, the camera may be set on the pen body of the translation pen, for example. Therefore, the point translation pen (camera on the point translation pen) can be used to execute step S000, that is, to collect text images.
  • the tip of the translation pen when using the image acquisition component of a translation pen to collect a text image, the tip of the translation pen generally points below the text to be detected, so compared to the text image, the tip of the translation pen is equivalent to a point close to the text to be detected One side of the edge of the image. In order to distinguish it from other edges of the text image, this edge is called the first edge of the text image (refer to the first edge FE of the text image in FIG. 3).
  • the text image can be a grayscale image or a color image.
  • the shape of the text image may be a rectangle, a diamond, a circle, etc., which is not limited in the embodiment of the present disclosure.
  • the text image is a rectangle as an example for description, but it should not be regarded as a limitation of the present disclosure.
  • the text image can be an original image directly collected by an image collection device or component, or an image obtained after preprocessing the original image.
  • the text detection method provided by the embodiments of the present disclosure may further include an operation of preprocessing the text image .
  • Preprocessing can eliminate irrelevant information or noise information in the text image, so as to better process the text image.
  • the preprocessing may include, for example, processing such as scaling, cropping, gamma correction, image enhancement, or noise reduction filtering on the text image.
  • the text image includes at least one text, and the at least one text includes the text to be detected.
  • the text to be detected is usually close to the first edge (for example, the lower edge) of the text image. It should be noted that the text to be detected is the text that the user wants to detect.
  • a text image refers to a form of presenting text in a visual manner, such as pictures and videos including text.
  • the text to be detected may include: a word in one of languages such as English, French, German, and Spanish, or a word or word in one of languages such as Chinese, Japanese, and Korean; but it is not limited to this.
  • FIG. 3 is a schematic diagram of a text image provided by at least one embodiment of the present disclosure.
  • the text image includes multiple texts.
  • a text can be an English word (for example, "Tecent”, “the”, etc. in FIG. 3), one or a string of numbers (for example, "61622214” etc.) in FIG. 3, but not limited thereto.
  • Fig. 3 In the text image shown in Fig.
  • the text to be detected may be "Tecent”; for example, in some examples, when the translation pen is used to select “Tecent” as the text to be detected, the tip of the translation pen is at "Tecent" Below (near the first edge FE), and use the camera set on the pen body of the translation pen to take a picture to obtain the text image shown in Figure 3.
  • a text detection neural network may be used to process the text image to obtain a text feature image, and obtain the difference between each pixel in the text feature image and the directly adjacent pixel. Probability of connection between.
  • FIG. 4 is a schematic diagram of a text detection neural network provided by at least one embodiment of the present disclosure.
  • the text detection neural network includes first to sixth convolution modules, first to fifth down-sampling modules, first to fourth up-sampling modules, and classifiers.
  • each of the first to sixth convolution modules may include a convolution layer.
  • the convolutional layer is the core layer of the convolutional neural network.
  • the convolutional layer can apply several convolution kernels (also called filters) to the input image to extract multiple types of features of the input image.
  • Each convolution kernel can extract one type of feature.
  • the convolution kernel is generally initialized in the form of a random decimal matrix.
  • the convolution kernel will learn to obtain reasonable weights.
  • the result obtained after applying a convolution kernel to the input image is called a feature map, and the number of feature images is equal to the number of convolution kernels.
  • a text image is used as an input image. It should be noted that the embodiments of the present disclosure do not limit the number of convolutional layers included in the first to sixth convolution modules.
  • each convolution module described above may further include an activation layer.
  • the activation layer includes an activation function, which is used to introduce non-linear factors to the convolutional neural network, so that the convolutional neural network can better solve more complex problems.
  • the activation function may include a linear correction unit (ReLU) function, a leaky linear correction unit function (LeakyReLU), a sigmoid function (Sigmoid function), or a hyperbolic tangent function (tanh function).
  • the ReLU function and LeakyReLU function are unsaturated nonlinear functions, and the Sigmoid function and tanh function are saturated nonlinear functions.
  • each of the foregoing convolution modules may further include, for example, a batch normalization (BN) layer.
  • the batch normalization layer is used to perform batch normalization processing on feature images of mini-batch samples (that is, input images), so that the gray value of each feature image pixel changes within a predetermined range, thereby reducing calculations. Difficulty, improve contrast.
  • the predetermined range may be [-1, 1], but is not limited to this.
  • the batch normalization layer may perform batch normalization processing on each feature image according to the mean value and variance of the feature image of each small batch of samples.
  • each of the first to fifth down-sampling modules may include a down-sampling layer.
  • the down-sampling layer can be used to reduce the scale of the input image, simplify the calculation complexity, and reduce the phenomenon of over-fitting to a certain extent; on the other hand, the down-sampling layer can also perform feature compression to extract the input image Main features.
  • the down-sampling layer can reduce the size of feature images, but does not change the number of feature images. For example, if an input image with a size of 12 ⁇ 12 is sampled by a 2 ⁇ 2 down-sampling layer filter, a 6 ⁇ 6 feature image can be obtained, which means that 4 pixels on the input image are merged into features 1 pixel in the image.
  • the downsampling layer can use max pooling, average pooling, strided convolution, decimation, such as selecting fixed pixels, and demultiplexing output ( demuxout, which splits the input image into multiple smaller images) and other down-sampling methods for down-sampling processing.
  • the downsampling factors of the downsampling layers in the first to fifth downsampling modules are all 1/(2 ⁇ 2), and the present disclosure includes but is not limited to this.
  • each of the first to fourth upsampling modules may include an upsampling layer.
  • the up-sampling layer may use up-sampling methods such as strided transposed convolution and interpolation algorithms for up-sampling processing.
  • the interpolation algorithm may include, for example, algorithms such as interpolation, bilinear interpolation, and bicubic interpolation (Bicubic Interprolation).
  • Upsampling is used to increase the size of the feature image, thereby increasing the data volume of the feature image.
  • the upsampling factors of the upsampling layers in the first to fourth upsampling modules are all 2 ⁇ 2, and the present disclosure includes but is not limited to this.
  • each of the first to fifth dimensionality reduction modules may include a convolutional layer using a 1 ⁇ 1 convolution kernel.
  • each of the aforementioned dimensionality reduction modules can use a 1 ⁇ 1 convolution kernel to reduce the dimensionality of the data, reduce the number of feature images, thereby reducing the number of parameters in the subsequent processing, and reducing the amount of calculation to increase the processing speed.
  • each of the first to fifth dimensionality reduction modules may include 10 1 ⁇ 1 convolution kernels, so that each dimensionality reduction module can correspondingly output 10 feature images.
  • the classifier may include two softmax classifiers, namely a first softmax classifier and a second softmax classifier.
  • the first softmax classifier is used to predict whether each pixel is a text pixel (that is, a positive pixel) or a non-text pixel (that is, a negative pixel).
  • the second softmax classifier determines whether each pixel is directly adjacent to the four pixels There is a link relationship for link classification prediction. It should be noted that in the present disclosure, any other feasible methods can also be used to perform text classification prediction and connection classification prediction, including but not limited to the above-mentioned first and second softmax classifiers.
  • the convolutional layer, down-sampling layer, up-sampling layer, etc. each refer to the corresponding processing operation, that is, convolution processing, down-sampling processing, up-sampling processing, etc. Repeat the description again.
  • using the text detection neural network to process the text image to obtain the corresponding text feature image includes: using the first convolution module to perform convolution processing on the text image to obtain the first convolution feature map group;
  • the down-sampling module performs down-sampling processing on the first convolution feature map group to obtain the first down-sampled feature map group; uses the second convolution module to perform convolution processing on the first down-sampled feature map group to obtain the second convolution Product feature map group; use the second down-sampling module to perform down-sampling processing on the second convolution feature map group to obtain the second down-sampled feature map group, and use the fifth dimensionality reduction module to perform the second convolution feature map group Dimensionality reduction processing to obtain the fifth dimensionality reduction feature map group; use the third convolution module to perform convolution processing on the second down-sampling feature map group to obtain the third convolution feature map group; use the third down-sampling module to The third convolution feature map group is down-sampled to obtain the third down-sampled feature
  • each feature map group usually includes multiple feature images.
  • the fusion processing may include the bit addition processing ADD.
  • bit-addition processing ADD usually refers to combining the value of each row and column of the image matrix of each channel of a group of input images with each row and each row of the image matrix of the corresponding channel of another group of input images. Add the values of a column.
  • the number of channels of the two sets of images as input to the bit addition process ADD is the same.
  • the number of channels of the image output from the bit addition process ADD is also the same as the number of channels of any set of images input.
  • fusion processing means adding each pixel in each feature image in one feature map group to the value of the corresponding pixel in the corresponding feature image in another feature map group to obtain a new feature image .
  • Fusion processing does not change the number and size of feature images.
  • the text classification prediction image includes 2 feature images
  • the connection probability prediction image includes 8 feature images.
  • the value of the pixel in each feature image in the text classification prediction image and the connection probability prediction image is greater than or equal to 0 and less than or equal to 1, and represents the text prediction probability or the connection prediction probability.
  • the feature image in the text classification prediction image represents the probability map of whether each pixel is text
  • the feature image in the connection probability prediction image represents the probability map of whether each pixel is connected to the pixel directly adjacent to the pixel.
  • the two feature images in the text classification prediction image include a text probability image and a non-text probability image.
  • the text probability image indicates the predicted probability of each pixel belonging to the text (that is, the type probability of each pixel), and the non-text probability image indicates that each pixel belongs to
  • the values of the corresponding pixels of the two feature images add up to 1.
  • the type probability threshold may be set, for example, 0.75; if the predicted probability of a pixel belonging to the text is greater than or equal to the type probability threshold, it means that the pixel belongs to the text, that is, the pixel is a positive pixel. pixel); if the predicted probability of a pixel belonging to text is less than the type probability threshold, it means that the pixel is non-text, that is, the pixel is a negative pixel.
  • FIG. 5 is a schematic diagram of a pixel adjacency relationship provided by at least one embodiment of the present disclosure.
  • the pixel PX3 and the pixel PX4 are directly adjacent to the pixel PX0
  • the pixel PX1 and the pixel PX2 are directly adjacent to the pixel PX0, that is,
  • the pixels PX1 to PX4 are four pixels directly adjacent to the pixel PX0, and are respectively located above, below, to the left, and to the right of the pixel PX0.
  • the pixel array in each feature image is arranged in multiple rows and multiple columns.
  • the direction C1 may indicate a first direction perpendicular to the first edge (including the first edge of the text image and the first edge of the text feature image), such as the column direction; the direction R1 may indicate parallel to the first edge (including the first edge of the text image).
  • the first edge of the image and the first edge of the text feature image) in the second direction such as the row direction.
  • the eight feature images in the connection probability prediction image may include the first connection classification image, the second connection classification image, the third connection classification image, the fourth connection classification image, the fifth connection classification image, the sixth connection classification image, The seventh connection classification image and the eighth connection classification image.
  • the eighth connection classification image may include the first connection classification image, the second connection classification image, the third connection classification image, the fourth connection classification image, the fifth connection classification image, the sixth connection classification image, The seventh connection classification image and the eighth connection classification image.
  • the value of the pixel PX0 in the first connection classification image represents the connection prediction probability from the pixel PX0 to the direction of the pixel PX1
  • the value of the pixel PX0 in the second connection classification image represents the slave pixel PX0 points to the non-connection prediction probability of the pixel PX1
  • the value of the pixel PX0 in the third connection classification image represents the connection prediction probability from the pixel PX0 to the pixel PX2 direction
  • the value of the pixel PX0 in the fourth connection classification image represents the slave pixel PX0
  • the value of the pixel PX0 in the fifth connection classification image represents the connection prediction probability from the pixel PX0 to the pixel PX3 direction
  • the value of the pixel PX0 in the sixth connection classification image represents the point from the pixel PX0
  • connection classification image and the second connection classification image is 1, and the sum of the corresponding pixel values of the third connection classification image and the fourth connection classification image is 1.
  • the sum of the values of the corresponding pixels of the fifth connection classification image and the sixth connection classification image is 1, and the sum of the values of the corresponding pixels of the seventh connection classification image and the eighth connection classification image is 1.
  • connection probability threshold can be set, for example, 0.7; when the connection prediction probability of two directly adjacent pixels is greater than or equal to the connection probability threshold, it means that the two adjacent pixels can communicate with each other. Connection; when the predicted connection probability of two directly adjacent pixels is less than the connection probability threshold, it means that the two directly adjacent pixels cannot be connected to each other.
  • type probability threshold and connection probability threshold are only illustrative, and the type probability threshold and connection probability threshold can be set according to actual application requirements.
  • the text feature image is a binary image, but it is not limited thereto.
  • the text feature image is obtained, and the connection probability between each pixel in the text feature image and the directly adjacent pixel may include: Text classification predicts the text probability in the image. Each pixel in the image is binarized according to the comparison between its pixel value (the predicted probability of belonging to the text, that is, the type probability) and the type probability threshold to obtain the text feature image, and the text The connection probability between each pixel in the feature image and the directly adjacent pixel can be correspondingly queried from the connection probability prediction image.
  • a picture can be obtained Text feature images including positive and negative pixels.
  • FIG. 6 is a schematic diagram of a text feature image provided by at least one embodiment of the present disclosure. As shown in FIG. 6, the text feature image includes positive pixels (as shown by each gray square in FIG. 6) and negative pixels (as shown by each white square in FIG. 6).
  • the size of the text feature image is the same as the size of each feature image in the text classification prediction image and the connection probability prediction image.
  • the text detection neural network shown in FIG. 4 is schematic. In practical applications, a neural network with other structures can also be used to perform the operation of step S100; of course, the text detection neural network shown in FIG. 4 can also be partially modified to obtain a new one that can also perform the operation of step S100 Text detection neural network.
  • the fourth upsampling module and fifth dimensionality reduction module in the text detection neural network shown in FIG. 4 and the corresponding fusion processing can be omitted, and at the same time, the third fusion feature map group is used to perform Classification processing to obtain a text classification prediction image and a connection probability prediction image. It should be noted that the embodiments of the present disclosure do not limit this.
  • each pixel in the text feature image and its top, bottom, left, right, top left, bottom left, top right, and bottom right The 8 pixels are directly adjacent to each other; in this case, the connection probability prediction image can correspondingly include 16 feature images.
  • the embodiments of the present disclosure include but are not limited thereto.
  • the text detection method in which each pixel has 4 directly adjacent pixels can reduce the amount of calculation, increase the processing speed, and improve the follow-up There may be text sticking problems in the resulting text box.
  • Step S200 Use a partial area in the text feature image close to the first edge of the text feature image as a basic area, where at least some pixels in the basic area are positive pixels.
  • the first edge of the text feature image corresponds to the first edge of the text image
  • the to-be-detected text in the text image is close to the first edge of the text image (refer to the related description in FIG. 3).
  • the lower partial area in the text feature image that is, the partial area close to the first edge of the text feature image, as shown by the dashed box in FIG. 6
  • the base area, at least some of the pixels in the base area are positive pixels (as shown by the gray squares in the dashed box in FIG. 6).
  • the size of the base area can be set to h base *w (that is, including h base rows and w columns). Pixels), where h, w, and h base are all positive integers, and h base /h ⁇ 1.
  • h base /h ⁇ 1/2; for example, in some examples, the value range of h base /h is, for example, 1/10 to 1/2, such as 1/5 to 2/5, For example, 1/4 to 1/3, etc.
  • the value range of h base /h can be set according to actual application requirements, for example, it can be set according to the range of the font size that needs to be recognized and the size of the coverage area of the text image. It should be noted that if the value of h base /h is too small, positive pixels may not be included in the base area, and the text detection method provided by the embodiments of the present disclosure cannot be effectively implemented; if the value of h base /h is If it is too large, it may result in an insignificant reduction in the amount of calculation for text detection, thereby reducing the beneficial effects of the embodiments of the present disclosure; therefore, the value of h base /h should be set reasonably according to actual application requirements.
  • the width of the basic region may be set to be the same as the width of the text feature image, that is, even Is w.
  • Step S300 Group at least part of the positive pixels in the basic area to obtain at least one connected domain.
  • step S300 based on the union search algorithm, according to the connection probability between each positive pixel in the basic area and the directly adjacent pixels, at least part of the positive pixels in the basic area can be grouped to obtain at least one Connected Components.
  • the union search algorithm may include: first, construct an index set based on at least part of the positive pixels in the base area, for example, the index set includes at least part of the positive pixels in the base area, and the index set , Each positive pixel corresponds to a root node, and the initial value of the root node of each positive pixel is itself; then, in response to any directly adjacent pixel of each positive pixel in the index set, it is a positive pixel and the Each positive pixel has a positive connection relationship with the directly adjacent pixel, and the value of the root node of the directly adjacent pixel is set to the value of the root node of the positive pixel; finally, it will have the same value of the root node
  • Each group of positive pixels is regarded as a connected domain to obtain at least one connected domain.
  • At least part of the positive pixels in the base area used to construct the index set includes all the positive pixels in the base area; for example, in other examples, at least part of the base area used to construct the index set
  • the positive pixels include the positive pixels in one or several rows (which can be set according to actual requirements) that are closest to the first edge of the text feature image in the basic area, so that the amount of calculation can be reduced and the processing speed can be improved.
  • the embodiments of the present disclosure do not limit this.
  • each positive pixel includes pixels directly adjacent to each positive pixel in the first direction perpendicular to the first edge of the text feature image, and pixels directly adjacent to the first edge parallel to the text feature image. Pixels directly adjacent to each positive pixel in the second direction. For example, each positive pixel has four directly adjacent pixels.
  • connection probability between two directly adjacent pixels when the connection probability between two directly adjacent pixels is greater than the connection probability threshold, there is a positive connection relationship between the two pixels.
  • the above-mentioned at least one connected domain may be denoised.
  • the connected domains with an area less than T1 pixels or a connected domain with a width (or height) less than T2 pixels in the above-mentioned at least one connected domain may be removed, and denoising will be performed.
  • the one or more connected domains remaining after processing are used to determine the final connected domain corresponding to the text to be detected (refer to the related description in step S400 below).
  • T1 may be, for example, 100 to 300, such as 200, but is not limited thereto; for example, in some examples, T2 may be, for example, 5 to 15, such as 10, but is not limited thereto. It should be understood that the values of T1 and T2 can be set according to actual application requirements.
  • Step S400 Expand at least one connected domain in a direction away from the first edge of the text feature image to obtain at least one final connected domain corresponding to the at least one connected domain.
  • At least one final connected domain includes a final connected domain corresponding to the text to be detected.
  • FIG. 7 is an exemplary flowchart corresponding to step S400 shown in FIG. 2 provided by at least one embodiment of the present disclosure.
  • step S400 shown in FIG. 7 will be described in detail with reference to the text feature image shown in FIG. 6.
  • step S400 includes the step of S410 to step S450.
  • Step S410 Extract the positive pixel farthest from the first edge of the text feature image in the first direction perpendicular to the first edge of the text feature image in the current connected domain as the first positive pixel.
  • the current connected domain is at least one connected domain in the basic area.
  • the farthest positive pixel of the first edge of the text feature image includes pixel points 1-5, so that the pixel points 1-5 are all regarded as the first positive pixel.
  • the first positive pixels ie, pixel points 1-5) are located in the same row.
  • pixel points 1-2 belong to the same connected domain, so pixel points 1-2 have the same root node; pixel points 3-5 belong to the same connected domain, so pixel points 3-5 have the same The root node of (different from the root node of pixel 1-2).
  • Step S420 Use a pixel in the text feature image that is on the side of the first positive pixel away from the first edge of the text feature image and directly adjacent to the first positive pixel as the first adjacent pixel.
  • the first neighboring pixels include pixels 6-8, etc.; among them, the pixel 6 is directly adjacent to the pixel 1, the pixel 7 is directly adjacent to the pixel 2, and the pixel 8 is directly adjacent to the pixel. 4 is directly adjacent, and the first adjacent pixels of pixels 3 and 5 are not given reference numerals.
  • Step S430 In response to the first neighboring pixel being a positive pixel and there is a positive connection relationship between the first positive pixel and the first neighboring pixel, the value of the root node of the first neighboring pixel is modified to the value of the root node of the first positive pixel , And add the first neighboring pixel to the first neighboring pixel set.
  • connection probability between the first positive pixel and the first neighboring pixel when the connection probability between the first positive pixel and the first neighboring pixel is greater than the connection probability threshold, there is a positive connection relationship between the two.
  • the first set of neighboring pixels has a form similar to the aforementioned index set, that is, each pixel in the first set of neighboring pixels also has a corresponding root node.
  • the pixel point 6 is a positive pixel and the pixel point 6 and the pixel point 1 have a positive connection relationship, so that the pixel point 6 can be added to the first adjacent pixel set, and the pixel point 6
  • the value of the root node is the same as the value of the root node of pixel point 1.
  • pixel 7 can also be added to the first neighboring pixel, and the value of the root node of pixel 7 is the same as the value of the root node of pixel 2, that is, the value of the root node of pixels 1 and 6 is the same; 8 can also be added to the first neighboring pixel, and the value of the root node of pixel 8 is the same as the value of the root node of pixel 3.
  • Step S440 Expand the first set of adjacent pixels in a second direction parallel to the first edge of the text feature image.
  • step S440 may include: directly adjacent to any pixel in the first adjacent pixel set in a second direction parallel to the first edge of the text feature image and having a positive connection relationship. The pixels are added to the first set of neighboring pixels until the first set of neighboring pixels can no longer be expanded in a direction parallel to the first edge of the text feature image.
  • the judgment condition of the positive connection relationship in step S440 is the same as the judgment condition in the aforementioned step S430.
  • the pixel point 9 is a positive pixel and the pixel point 9 and the pixel point 6 have a positive connection relationship, so that the pixel point 9 can be added to the first adjacent pixel set, and the pixel point 9
  • the value of the root node is the same as the value of the root node of the pixel point 6; further, the pixel point 10 is a positive pixel and the pixel point 10 and the pixel point 9 have a positive connection relationship, so that the pixel point 10 can also be added to the first adjacent pixel set ,
  • the value of the root node of the pixel point 10 is the same as the value of the root node of the pixel point 9.
  • the first adjacent pixel set only includes pixels 6-8 before expansion, and includes pixels 6-11 after expansion. Among them, pixels 6-7 and 9-11 have the same root node.
  • Step S450 Extend the current connected domain to include all pixels in the first adjacent pixel set, and continue to expand the current connected domain in a direction away from the first edge of the text feature image until it cannot continue to expand.
  • the connected domain (the first connected domain) including pixels 1-2 in the basic area after the first expansion also includes pixels 6-11, including the connected domains of pixels 3-5
  • the (second connected domain) includes pixel 8 after the first expansion.
  • step S410 and step S450 may be repeated to complete the second expansion of the connected domain.
  • the pixels in the first adjacent pixel set obtained during the first expansion ie, pixel points 6-11
  • the first connected domain further includes pixels 12-14
  • the second connected domain further includes pixels 15-16.
  • the first connected domain also includes pixels 6-14, 17 and 19-20 outside the basic area
  • the second connected domain also includes pixels 6-14, 17 and 19-20 outside the basic area. Pixels 8, 15-16, 18 and 21.
  • two final connected domains can be obtained respectively.
  • the expansion of the connected domain in the text feature image shown in FIG. 6 is exemplary rather than restrictive.
  • the basic region also includes connected domains whose area does not change after the processing of step S400, for example, it cannot be expanded outward (that is, it cannot be expanded in a direction away from the first edge of the text feature image)
  • the connected domains are also regarded as final connected domains after being processed in step S400.
  • Step S500 Determine at least one feature box corresponding to the at least one final connected domain, and map the at least one feature box to the text image to obtain at least one text box, where the at least one text box includes a text box of the text to be detected.
  • determining at least one feature box corresponding to at least one final connected component may include: performing contour detection on at least one final connected component using a contour detection algorithm to obtain the contour of at least one final connected component;
  • the circumscribed rectangle algorithm processes the contour of at least one final connected domain to obtain at least one feature frame corresponding to the at least one final connected domain.
  • the contour detection algorithm may include but is not limited to the OpenCV contour detection (findContours) function;
  • the minimum bounding rectangle algorithm may include but is not limited to the OpenCV minimum bounding rectangle (minAreaRect) function.
  • the characteristic box may be a rectangular box, and correspondingly, the text box may also be a rectangular box. It should be noted that the embodiments of the present disclosure include but are not limited to this.
  • mapping includes two processes, scale transformation and projection. For example, taking the size of the text feature image as 1/(2 ⁇ 2) of the size of the text image as an example, in the process of scale transformation, the width and height of the feature box are respectively expanded twice; during the projection process, keep The relative position of the text box and the text image and the relative position of the feature box and the text feature image are consistent, so that the corresponding text box can be obtained.
  • each text box includes one text.
  • the text detection method provided by the embodiment of the present disclosure only a part of the area near the text to be detected (shown as "Tecent" in FIG. 3) in the text image needs to be detected. Thereby, only the text box of part of the text in the text image (including the text box of the text to be detected) is obtained.
  • the common text detection method corresponding to the text image shown in FIG. 1 requires traversal detection of the entire area of the text image to obtain the text box of all the text in the text image. Therefore, the text detection method provided by the embodiments of the present disclosure can reduce the calculation amount of text detection (that is, reduce the number of traversals), and reduce the response time of text detection.
  • Step S600 Determine the text box of the text to be detected from at least one text box.
  • the text image is captured by a camera set on the pen body of the translation pen, and the text to be detected is selected by the tip of the translation pen. Since the relative position of the tip of the translation pen and the camera is fixed, the relative position of the tip of the translation pen (assuming a virtual pen tip is virtual on the plane where the text image is located) and the text image captured by the camera is also stable. Therefore, step S600 can be implemented based on the foregoing principle.
  • FIG. 8 is an exemplary flowchart corresponding to step S600 shown in FIG. 2 provided by at least one embodiment of the present disclosure
  • FIG. 9 is an exemplary flowchart provided by at least one embodiment of the present disclosure corresponding to the step S600 shown in FIG. A schematic diagram of the operation of step S600.
  • step S600 shown in FIG. 8 will be described in detail with reference to FIG. 9.
  • the text box of the text to be detected is determined from at least one text box, that is, step S600 includes step S610 to step S620.
  • Step S610 construct a virtual detection frame in the text image
  • Step S620 Calculate the overlap area between the virtual detection box and each text box, and use the text box having the largest overlap area with the virtual detection box as the text box of the text to be detected.
  • a pen tip of a translation pen may be virtualized in the text image (shown in the gray solid line box in FIG. 9).
  • the virtual pen tip (shown as the black dot in FIG. 9) can be set on the first edge of the text image, but it is not limited to this; for example, in other examples, the virtual pen tip can be set on the text image. Outside the image, and close to the first edge.
  • the virtual pen tip may generally be set on the vertical line of the first edge of the text image, or may be set near the vertical line of the first edge of the text image.
  • the embodiment of the present disclosure does not do this. limit. It should be understood that the virtual pen tip can be set according to actual application requirements, which is not limited in the embodiments of the present disclosure.
  • H H1+H2
  • H1 represents the smallest distance between the virtual pen tip and the center of each text box in the text image in the first direction (ie, the column direction) perpendicular to the first edge
  • H2 is a preset height value; for example, H2 can be set to a height value of, for example, 30 pixels, but is not limited to this.
  • the width W is a preset width value; for example, W can be set to a width value of, for example, 60 pixels, but it is not limited thereto. It should be understood that H2 and W can be set according to actual application requirements, which are not limited in the embodiments of the present disclosure.
  • the text detection method provided in the embodiments of the present disclosure may further include: performing text recognition processing on the text to be detected based on the text box of the text to be detected.
  • a common text processing method can be used to perform text recognition processing, which is not limited in the embodiments of the present disclosure.
  • commonly used text processing methods may include, but are not limited to, the use of neural networks (such as multi-objective corrective attention network (MORAN), etc.) for text recognition processing.
  • MORAN multi-objective corrective attention network
  • text translation can also be performed based on the result of text recognition processing to obtain and output the translation result of the text to be detected.
  • a dictionary database is used to index the results of text recognition processing to retrieve translation results.
  • the translation result of the text to be detected can be displayed on a display, or can be output via a speaker or the like.
  • the flow of the above-mentioned text detection method may include more or fewer operations, and these operations may be executed sequentially or in parallel.
  • the flow of the text detection method described above includes multiple operations appearing in a specific order, it should be clearly understood that the order of the multiple operations is not limited.
  • the text detection method described above can be executed once or multiple times according to predetermined conditions.
  • the text detection neural network and various functional modules and functional layers in the text detection neural network can be implemented by software, hardware, firmware, or any combination thereof, so as to execute The corresponding process.
  • the text detection method provided by the embodiments of the present disclosure can perform text detection based on a pre-set basic area and adopt the idea of connected domains, thereby reducing the amount of calculations for text detection (that is, reducing the number of traversals), and reducing the response of text detection
  • this text detection method is suitable for point-translation pens, which can increase the processing speed of point-translation pens and improve user experience.
  • FIG. 10 is a schematic block diagram of a text detection device provided by at least one embodiment of the present disclosure.
  • the text detection device 1000 includes a memory 1001 and a processor 1002. It should be understood that the components of the text detection device 1000 shown in FIG. 10 are only exemplary and not restrictive. According to actual application requirements, the text detection device 1000 may also include other components.
  • the memory 1001 is used to store text images and computer-readable instructions; the processor 1002 is used to read text images and run computer-readable instructions, and the computer-readable instructions are executed when the processor 1002 runs according to any of the above-mentioned embodiments.
  • the text detection device may further include an image acquisition element 1003.
  • the image capture element 1003 is used to capture text images.
  • the image capture element 1003 is the image capture device or element described in the embodiment of the text detection method.
  • the image capture element 1003 may be various types of cameras.
  • the text detection device 1000 may be a point translation pen, but it is not limited thereto.
  • the translation pen is used to select the text to be detected.
  • the image acquisition component 1003 may be arranged on a point translation pen, for example, the image acquisition component 1003 may be a camera arranged on a point translation pen.
  • the memory 1001 and the processor 1002 can also be integrated in the translation pen, that is, the image acquisition element 1003, the memory 1001, and the processor 1002 can all be integrated in the translation pen.
  • the embodiments of the present disclosure include but not Limited to this.
  • the text detection device 1000 may further include an output unit configured to output the recognition result and/or the translation result of the text to be detected.
  • the output unit may include a display, a speaker, etc.
  • the display may be used to display the recognition result and/or translation result of the text to be detected, and the speaker may be used to output the recognition result and/or translation result of the text to be detected in the form of voice.
  • the translation pen may also include a communication module, which is used to implement communication between the translation pen and the output unit, for example, to transmit the translation result to the output unit.
  • the processor 1002 may control other components in the text detection apparatus 1000 to perform desired functions.
  • the processor 1002 may be a central processing unit (CPU), a tensor processor (TPU), or other devices with data processing capabilities and/or program execution capabilities.
  • the central processing unit (CPU) can be an X86 or ARM architecture.
  • the GPU can be directly integrated on the motherboard alone or built into the north bridge chip of the motherboard. The GPU can also be built into the central processing unit (CPU).
  • the memory 1001 may include any combination of one or more computer program products, and the computer program products may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
  • the non-volatile memory may include, for example, read only memory (ROM), hard disk, erasable programmable read only memory (EPROM), portable compact disk read only memory (CD-ROM), USB memory, flash memory, and the like.
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • CD-ROM portable compact disk read only memory
  • USB memory flash memory, and the like.
  • One or more computer-readable instructions may be stored on the computer-readable storage medium, and the processor 1002 may run the computer-readable instructions to implement various functions of the text detection apparatus 1000.
  • the network may include a wireless network, a wired network, and/or any combination of a wireless network and a wired network.
  • the network may include a local area network, the Internet, a telecommunications network, the Internet of Things (Internet of Things) based on the Internet and/or a telecommunications network, and/or any combination of the above networks, and so on.
  • the wired network may, for example, use twisted pair, coaxial cable, or optical fiber transmission for communication, and the wireless network may use, for example, a 3G/4G/5G mobile communication network, Bluetooth, Zigbee, or WiFi.
  • the present disclosure does not limit the types and functions of the network here.
  • FIG. 11 is a schematic diagram of a storage medium provided by at least one embodiment of the present disclosure.
  • one or more computer-readable instructions 1101 may be stored non-transitory on the storage medium 1100.
  • the computer-readable instructions 1101 are executed by a computer, one or more steps in the text detection method described above can be executed.
  • the storage medium 1100 can be applied to the above-mentioned text detection device 1000, for example, it can be used as the memory 1001 in the text detection device 1000.
  • the description of the storage medium 1100 reference may be made to the description of the memory in the embodiment of the text detection apparatus 100, and the repetition is not repeated here.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

一种文本检测方法及装置、存储介质。该文本检测方法包括:基于文本图像,获取对应于文本图像的文本特征图像;将文本特征图像中的靠近文本特征图像的第一边缘的部分区域作为基础区域,其中,文本特征图像的第一边缘对应于文本图像的第一边缘,待检测文本靠近文本图像的第一边缘,基础区域中的至少部分像素为正像素;对基础区域中的至少部分正像素进行分组,以得到至少一个连通域;将至少一个连通域沿远离文本特征图像的第一边缘的方向进行扩展,以得到至少一个最终连通域;确定至少一个最终连通域对应的至少一个特征框,并将至少一个特征框映射到文本图像中,以得到至少一个文本框,其中,文本框包括待检测文本的文本框。

Description

文本检测方法及装置、存储介质 技术领域
本公开的实施例涉及一种文本检测方法、文本检测装置及存储介质。
背景技术
随着科学技术的发展,当用户阅读外文文章,遇到生词需要查询时,不再局限于使用字典、电子词典、手机APP(应用程序)等进行查询,还可以使用例如翻译笔进行查询。字典不易携带,且翻阅查询效率低;手机APP和电子词典使用键盘输入,不仅耗时、操作繁琐,还容易打断思路、分散精力。相比之下,翻译笔具有使用方便、易于携带、更加贴近用户阅读习惯等优点,可以在用户阅读外文文章时,为用户提供良好的翻译和查询体验。
发明内容
本公开至少一个实施例提供一种文本检测方法,包括:基于文本图像,获取对应于所述文本图像的文本特征图像;将所述文本特征图像中的靠近所述文本特征图像的第一边缘的部分区域作为基础区域,其中,所述文本特征图像的第一边缘对应于所述文本图像的第一边缘,所述文本图像中的待检测文本靠近所述文本图像的第一边缘,所述基础区域中的至少部分像素为正像素;对所述基础区域中的至少部分正像素进行分组,以得到至少一个连通域;将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的至少一个最终连通域;以及确定所述至少一个最终连通域对应的至少一个特征框,并将所述至少一个特征框映射到所述文本图像中,以得到至少一个文本框,其中,所述至少一个文本框包括所述待检测文本的文本框。
例如,在本公开一些实施例提供的文本检测方法中,在所述文本特征图像包括h行w列像素的情况下,所述基础区域包括h base行w列像素,其中,h、w、h base均为正整数,且h base/h≤1/2。
例如,在本公开一些实施例提供的文本检测方法中,所述文本特征图 像中的每个像素与直接相邻的像素之间具有连接概率;对所述基础区域中的至少部分正像素进行分组,以得到所述至少一个连通域,包括:基于并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域。
例如,在本公开一些实施例提供的文本检测方法中,基于所述并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域,包括:基于所述基础区域中的所述至少部分正像素构建索引集合,其中,所述索引集合包括所述基础区域中的所述至少部分正像素,且在所述索引集合中,每个正像素对应一个根节点,每个正像素的根节点的初始值为其自身;响应于所述索引集合中的每个正像素的任一直接相邻的像素为正像素且所述每个正像素与所述直接相邻的像素之间具有正连接关系,将所述直接相邻的像素的根节点的值设置为所述每个正像素的根节点的值;以及将具有相同根节点的值的每组正像素作为一个连通域,以得到所述至少一个连通域。
例如,在本公开一些实施例提供的文本检测方法中,在所述基础区域中的每个正像素与直接相邻的像素之间的连接概率大于连接概率阈值情况下,确定所述每个正像素与所述直接相邻的像素之间具有所述正连接关系。
例如,在本公开一些实施例提供的文本检测方法中,所述基础区域中的每个正像素的直接相邻的像素包括:在垂直于所述文本特征图像的第一边缘的第一方向上与所述每个正像素直接相邻的像素,以及在平行于所述文本特征图像的第一边缘的第二方向上与所述每个正像素直接相邻的像素。
例如,在本公开一些实施例提供的文本检测方法中,所述基础区域中的每个正像素具有四个直接相邻的像素。
例如,在本公开一些实施例提供的文本检测方法中,将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的所述至少一个最终连通域,包括:提取当前的连通域中的在垂直于所述文本特征图像的第一边缘的第一方向上距离所述文本特征图像的第一边缘最远的正像素作为第一正像素;将所述文本特征图 像中在所述第一正像素的远离所述文本特征图像的第一边缘的一侧且与所述第一正像素直接相邻的像素作为第一邻近像素;响应于所述第一邻近像素为正像素且所述第一正像素与所述第一邻近像素之间具有正连接关系,将所述第一邻近像素的根节点的值修改为所述第一正像素的根节点的值,并将所述第一邻近像素加入第一邻近像素集合;在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展;以及将当前的连通域扩展为包括所述第一邻近像素集合中的全部像素,并继续将当前的连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,直到无法继续扩展为止。
例如,在本公开一些实施例提供的文本检测方法中,在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展,包括:将在平行于所述文本特征图像的第一边缘的第二方向上与所述第一邻近像素集合中的任一像素直接相邻且具有正连接关系的正像素加入所述第一邻近像素集合,直到无法继续在平行于所述文本特征图像的第一边缘的方向上对所述第一邻近像素集合进行扩展为止。
例如,在本公开一些实施例提供的文本检测方法中,所述至少一个最终连通域包括所述基本区域内的无法沿远离所述文本特征图像的第一边缘的方向进行扩展的连通域。
例如,在本公开一些实施例提供的文本检测方法中,基于所述文本图像,获取对应于所述文本图像的所述文本特征图像,包括:使用文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
例如,在本公开一些实施例提供的文本检测方法中,所述文本检测神经网络包括第一至第六卷积模块、第一至第五下采样模块、第一至第四上采样模块、以及分类器;使用所述文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率,包括:使用第一卷积模块对所述文本图像进行卷积处理,以得到第一卷积特征图组;使用第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到第二卷积特征图组;使用第二下采样模块对所述第二卷积特征图组进行下采样处理, 以得到第二下采样特征图组,且使用第五降维模块对所述第二卷积特征图组进行降维处理,以得到第五降维特征图组;使用第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用第四降维模块对所述第三卷积特征图组进行降维处理,以得到第四降维特征图组;使用第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用第三降维模块对所述第四卷积特征图组进行降维处理,以得到第三降维特征图组;使用第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用第二降维模块对所述第五卷积特征图组进行降维处理,以得到第二降维特征图组;使用第六卷积模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用第一上采样模块对所述第六卷积特征图组进行上采样处理,以得到第一上采样特征图组;使用第一降维模块对所述第一上采样特征图组进行降维处理,以得到第一降维特征图组;对所述第一降维特征图组和所述第二降维特征图组进行融合处理,以得到第一融合特征图组;使用第二上采样模块对所述第一融合特征图组进行上采样处理,以得到第二上采样特征图组;对所述第二上采样特征图组和所述第三降维特征图组进行融合处理,以得到第二融合特征图组;使用第三上采样模块对所述第二融合特征图组进行上采样处理,以得到第三上采样特征图组;对所述第三上采样特征图组和所述第四降维特征图组进行融合处理,以得到第三融合特征图组;使用第四上采样模块对所述第三融合特征图组进行上采样处理,以得到第四上采样特征图组;对所述第四上采样特征图组和所述第五降维特征图组进行融合处理,以得到所述第四融合特征图组;使用分类器对所述第四融合特征图组进行分类处理,以得到文本分类预测图像和连接概率预测图像;以及基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
例如,在本公开一些实施例提供的文本检测方法中,所述文本分类预 测图像中的每个像素具有类型概率,所述连接概率预测图像中的每个像素具有所述像素与直接相邻的像素之间的连接概率;基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与其相邻像素之间的连接概率,包括:将所述文本分类预测图像中的类型概率大于或等于类型概率阈值的像素作为正像素,将所述文本分类预测图像中的类型概率小于所述类型概率阈值的像素作为负像素,以得到所述文本特征图像,所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率可以对应地从连接概率预测图像中查询得到。
例如,在本公开一些实施例提供的文本检测方法中,确定所述至少一个最终连通域对应的所述至少一个特征框,包括:使用轮廓检测算法对所述至少一个最终连通域进行轮廓检测,以得到所述至少一个最终连通域的轮廓;使用最小外接矩形算法对所述至少一个最终连通域的轮廓进行处理,以得到所述至少一个最终连通域对应的所述至少一个特征框。
例如,本公开一些实施例提供的文本检测方法,还包括:从所述至少一个文本框中确定出所述待检测文本的文本框。
例如,在本公开一些实施例提供的文本检测方法中,从所述至少一个文本框中确定出所述待检测文本的文本框,包括:在所述文本图像中构建虚拟检测框;以及计算所述虚拟检测框与各个文本框的重叠面积,将与所述虚拟检测框具有最大重叠面积的文本框作为所述待检测文本的文本框。
例如,本公开一些实施例提供的文本检测方法,还包括:基于所述待检测文本的文本框,对所述待检测文本进行识别处理。
例如,本公开一些实施例提供的文本检测方法,还包括:使用点译笔的图像采集元件采集所述文本图像;其中,在采集所述文本图像时,所述点译笔的笔尖点在所述待检测文本的靠近所述文本图像的第一边缘的一侧,所述文本图像包括所述待检测文本。
本公开至少一实施例还提供一种文本检测装置,包括:存储器,用于存储文本图像以及计算机可读指令;处理器,用于读取所述文本图像,并运行所述计算机可读指令,所述计算机可读指令被所述处理器运行时执行本公开任一实施例提供的文本检测方法。
例如,本公开一些实施例提供的文本检测装置,还包括:图像采集元 件,用于采集所述文本图像。
例如,在本公开一些实施例提供的文本检测装置中,所述文本检测装置为点译笔,其中,所述图像采集元件设置在所述点译笔上,所述点译笔用于选择所述待检测文本。
本公开至少一实施例还提供一种存储介质,非暂时性地存储计算机可读指令,其中,当所述计算机可读指令由计算机执行时能够执行本公开任一实施例提供的文本检测方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1为一种点译笔的工作原理的示意图;
图2为本公开至少一实施例提供的一种文本检测方法的示例性流程图;
图3为本公开至少一实施例提供的一种文本图像的示意图;
图4为本公开至少一实施例提供的一种文本检测神经网络的示意图;
图5为本公开至少一实施例提供的一种像素邻接关系的示意图;
图6为本公开至少一实施例提供的一种文本特征图像的示意图;
图7为本公开至少一实施例提供的一种对应于图2中所示的步骤S400的示例性流程图;
图8为本公开至少一实施例提供的一种对应于图2中所示的步骤S600的示例性流程图;
图9为本公开至少一实施例提供的一种对应于图2中所示的步骤S600的操作示意图;
图10为本公开至少一实施例提供的一种文本检测装置的示意性框图;以及
图11为本公开至少一实施例提供的一种存储介质的示意图。
具体实施方式
为了使得本公开实施例的目的、技术方案和优点更加清楚,下面将结 合本公开实施例的附图,对本公开实施例的技术方案进行清楚、完整地描述。显然,所描述的实施例是本公开的一部分实施例,而不是全部的实施例。基于所描述的本公开的实施例,本领域普通技术人员在无需创造性劳动的前提下所获得的所有其他实施例,都属于本公开保护的范围。
除非另外定义,本公开使用的技术术语或者科学术语应当为本公开所属领域内具有一般技能的人士所理解的通常意义。本公开中使用的“第一”、“第二”以及类似的词语并不表示任何顺序、数量或者重要性,而只是用来区分不同的组成部分。“包括”或者“包含”等类似的词语意指出现该词前面的元件或者物件涵盖出现在该词后面列举的元件或者物件及其等同,而不排除其他元件或者物件。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接,而是可以包括电性的连接,不管是直接的还是间接的。“上”、“下”、“左”、“右”等仅用于表示相对位置关系,当被描述对象的绝对位置改变后,则该相对位置关系也可能相应地改变。
下面通过几个具体的实施例对本公开进行说明。为了保持本公开实施例的以下说明清楚且简明,可省略已知功能和已知部(元)件的详细说明。当本公开实施例的任一部(元)件在一个以上的附图中出现时,该部(元)件在每个附图中由相同或类似的参考标号表示。
翻译笔通常包括扫描式翻译笔(简称“扫译笔”)和点译式翻译笔(简称“点译笔”)。扫译笔在使用时,需要直立笔身在待翻译的文本上滑动(即进行扫描),该使用模式有别于通常的用笔习惯,因此,用户使用扫译笔时通常需要有一个适应过程。不同于扫译笔的使用模式,点译笔在使用时,仅需要将笔尖对准待翻译文本的下方,轻轻一点,即可进行对应的识别和翻译,因此使用方法更加灵活,也更加贴近用户的用笔习惯,可以提供更好的用户体验。
当前的点译笔的工作原理主要是:先将点译笔的笔尖点击在待检测文本(例如,英语单词,但不限于此)的下方,使用点译笔的笔身摄像头拍摄得到文本图像,例如得到如图1所示的文本图像;然后,对整幅文本图像的每个像素位置进行遍历式的文本检测处理,得到文本图像上的所有文本框(如图1中的包围各个单词的实线框所示);再找出笔尖附近的文本框,即待检测文本的文本框(即围绕待检测文本的文本框),对其中的文本进行识别和翻译。在进行文本检测时,需要对整幅文本图像进行遍历式的处理, 但是文本图像上检测到的大部分文本框都是冗余的(即与待检测文本无关),这会限制点译笔的响应速度,降低点译笔的工作效率。
如果可以仅聚焦在笔尖位置附近的区域(即图1所示的文本图像中靠下部的区域)进行文本的检测与识别,则可以大大提高处理速度,减少响应时间和计算资源的占用。然而,由于点译笔需要识别不同字号大小的文本,如果人为限定文本图像中需要检测的区域,则可能出现以下问题:一方面,如果人为限定的需要检测的区域过大,则有益效果(即提高处理速度,减少响应时间和计算资源的占用等)可能并不明显;另一方面,如果人为限定的需要检测的区域偏小,则可能由于其无法覆盖大号字体的文本,导致无法完整检测并识别大号字体的文本,反而会限制点译笔的使用范围。
本公开至少一实施例提供一种文本检测方法。该检测方法包括:基于文本图像,获取对应于文本图像的文本特征图像;将文本特征图像中的靠近文本特征图像的第一边缘的部分区域作为基础区域,其中,文本特征图像的第一边缘对应于文本图像的第一边缘,文本图像中的待检测文本靠近文本图像的第一边缘,基础区域中的至少部分像素为正像素;对基础区域中的至少部分正像素进行分组,以得到至少一个连通域;将至少一个连通域沿远离文本特征图像的第一边缘的方向进行扩展,以得到与至少一个连通域对应的至少一个最终连通域;以及确定至少一个最终连通域对应的至少一个特征框,并将至少一个特征框映射到文本图像中,以得到至少一个文本框,其中,该至少一个文本框包括待检测文本的文本框。
本公开的一些实施例还提供对应于上述文本检测方法的文本检测装置以及存储介质。
本公开的实施例提供的文本检测方法,可以基于预先设定的基础区域,采用连通域的思想进行文本检测,由此可以减少文本检测的运算量(即减少遍历次数),减少文本检测的响应时间。该文本检测方法适用于点译笔等,可以提高点译笔的处理速度,改善用户使用体验。
下面结合附图对本公开的一些实施例及其示例进行详细说明。
图2为本公开至少一实施例提供的一种文本检测方法的示例性流程图。例如,本公开的实施例提供的文本检测方法可以应用于点译笔获取的文本图像,但不限于此。例如,如图2所示,该文本检测方法包括但不限于步骤S100至步骤S600。
步骤S100:基于文本图像,获取对应于文本图像的文本特征图像。
例如,在步骤S100中,文本图像可以包括通过图像采集装置或元件拍摄的图像。例如,在一些实施例中,该文本检测方法在步骤S100之前,还包括步骤S000:采集文本图像。
例如,在一些示例中,可以使用例如点译笔等采集文本图像。例如,点译笔可以包括图像采集元件,例如摄像头;例如,该摄像头可以设置在点译笔的例如笔身上。从而,可以使用点译笔(点译笔上的摄像头)执行步骤S000,即采集文本图像。例如,在使用点译笔的图像采集元件采集文本图像时,点译笔的笔尖一般点在待检测文本的下方,从而相对于文本图像而言,点译笔的笔尖相当于点在待检测文本的靠近文本图像的边缘的一侧。为了区别于文本图像的其他边缘,该边缘称为文本图像的第一边缘(参考图3中的文本图像的第一边缘FE所示)。
例如,文本图像可以为灰度图像,也可以为彩色图像。文本图像的形状可以为矩形、菱形、圆形等,本公开的实施例对此不作限制。在本公开的实施例中,以文本图像为矩形为例进行说明,但不应视作对本公开的限制。
例如,文本图像可以为图像采集装置或元件直接采集到的原始图像,也可以是对原始图像进行预处理之后获得的图像。例如,为了避免文本图像的数据质量、数据不均衡等对于文字识别的影响,在对文本图像进行文本检测之前,本公开的实施例提供的文本检测方法还可以包括对文本图像进行预处理的操作。预处理可以消除文本图像中的无关信息或噪声信息,以便于更好地对文本图像进行处理。预处理例如可以包括对文本图像进行缩放、剪裁、伽玛(Gamma)校正、图像增强或降噪滤波等处理。
例如,文本图像包括至少一个文本,该至少一个文本包括待检测文本。例如,该待检测文本通常靠近于文本图像的第一边缘(例如,下边缘)。需要说明的是,待检测文本即为用户希望检测的文本。文本图像是指以可视化方式呈现文本的形式,例如包括文本的图片、视频等。
例如,待检测文本可以包括:英语、法语、德语、西班牙语等语言之一的一个单词,或者中文、日语、韩语等语言之一的一个字或词;但不限于此。
图3为本公开至少一实施例提供的一种文本图像的示意图。例如,如 图3所示,该文本图像包括多个文本,例如,一个文本可以为一个英文单词(例如,图3中的“Tecent”、“the”等)、一个或一串数字(例如,图3中的“61622214”等)等,但不限于此。例如,在图3所示的文本图像中,待检测文本可以为“Tecent”;例如,在一些示例中,使用点译笔选择“Tecent”作为待检测文本时,点译笔的笔尖点在“Tecent”的下方(靠近第一边缘FE),并使用设置在点译笔的笔身上的摄像头进行拍摄,以得到图3所示的文本图像。
例如,在一些实施例中,在步骤S100中,可以使用文本检测神经网络对文本图像进行处理,以得到文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
图4为本公开至少一实施例提供的一种文本检测神经网络的示意图。例如,如图4所示,该文本检测神经网络包括第一至第六卷积模块、第一至第五下采样模块、第一至第四上采样模块、以及分类器。
例如,第一至第六卷积模块中的每个均可以包括卷积层。卷积层是卷积神经网络的核心层。卷积层可以对输入图像应用若干个卷积核(也称为滤波器),以提取输入图像的多种类型的特征。每个卷积核可以提取一种类型的特征。卷积核一般以随机小数矩阵的形式初始化,在卷积神经网络的训练过程中卷积核将通过学习以得到合理的权值。对输入图像应用一个卷积核之后得到的结果被称为特征图像(feature map),特征图像的数目与卷积核的数目相等。例如,在本公开的实施例中,如图4所示,文本图像作为输入图像。需要说明的是,本公开的实施例对第一至第六卷积模块中包括的卷积层的数目不作限制。
例如,在一些实施例中,上述每个卷积模块还可以包括激活层。激活层包括激活函数,激活函数用于给卷积神经网络引入非线性因素,以使卷积神经网络可以更好地解决较为复杂的问题。激活函数可以包括线性修正单元(ReLU)函数、带泄露的线性修正单元函数(LeakyReLU)、S型函数(Sigmoid函数)或双曲正切函数(tanh函数)等。ReLU函数和LeakyReLU函数为非饱和非线性函数,Sigmoid函数和tanh函数为饱和非线性函数。
例如,在一些实施例中,上述每个卷积模块还可以包括例如批量标准化(batch normalization,BN)层等。例如,批量标准化层用于对小批量(mini-batch)的样本(即输入图像)的特征图像进行批量标准化处理,以 使各特征图像的像素的灰度值在预定范围内变化,从而降低计算难度,提高对比度。例如,预定范围可以为[-1,1],但不限于此。例如,批量标准化层可以根据每个小批量的样本的特征图像的均值和方差,对各特征图像进行批量标准化处理。
例如,第一至第五下采样模块中的每个均可以包括下采样层。一方面,下采样层可以用于缩减输入图像的规模,简化计算的复杂度,在一定程度上减小过拟合的现象;另一方面,下采样层也可以进行特征压缩,提取输入图像的主要特征。下采样层能够减少特征图像的尺寸,但不改变特征图像的数量。例如,一个尺寸为12×12的输入图像,通过2×2的下采样层过滤器对其进行采样,那么可以得到6×6的特征图像,这意味着输入图像上的4个像素合并为特征图像中的1个像素。
例如,下采样层可以采用最大值池化(max pooling)、平均值池化(average pooling)、跨度卷积(strided convolution)、欠采样(decimation,例如选择固定的像素)、解复用输出(demuxout,将输入图像拆分为多个更小的图像)等下采样方法进行下采样处理。例如,在一些实施例中,第一至第五下采样模块中的下采样层的下采样因子均为1/(2×2),本公开包括但不限于此。
例如,第一至第四上采样模块中的每个均可以包括上采样层。例如,上采样层可以采用跨度转置卷积(strided transposed convolution)、插值算法等上采样方法进行上采样处理。插值算法例如可以包括内插值、双线性插值、两次立方插值(Bicubic Interprolation)等算法。上采样处理用于增大特征图像的尺寸,从而增加特征图像的数据量。例如,在一些实施例中,第一至第四上采样模块中的上采样层的上采样因子均为2×2,本公开包括但不限于此。
例如,第一至第五降维模块中的每个均可以包括采用1×1卷积核的卷积层。例如,上述每个降维模块均可以采用1×1卷积核对数据进行降维,减少特征图像的数量,从而减少后续处理中的参数数量,降低计算量,以提高处理速度。例如,在一些实施例中,第一至第五降维模块中的每个均可以包括10个1×1卷积核,从而每个降维模块均可以对应输出10幅特征图像。
例如,分类器可以包括两个softmax分类器,分别为第一softmax分类 器和第二softmax分类器。第一softmax分类器用于对每个像素是文本像素(即正像素)或非文本像素(即负像素)进行文本分类预测,第二softmax分类器对每个像素与其直接相邻的四个像素是否存在连接(link)关系进行连接分类预测。需要说明的是,在本公开中,还可以采用其他任意可行的方法进行文本分类预测和连接分类预测,包括但不限于上述第一和第二softmax分类器。
需要说明的是,在本公开中,卷积层、下采样层、上采样层等这些层每个都指代对应的处理操作,即卷积处理、下采样处理、上采样处理等,以下不再重复说明。
例如,使用文本检测神经网络对文本图像进行处理,以得到对应的文本特征图像,包括:使用第一卷积模块对文本图像进行卷积处理,以得到第一卷积特征图组;使用第一下采样模块对第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;使用第二卷积模块对第一下采样特征图组进行卷积处理,以得到第二卷积特征图组;使用第二下采样模块对第二卷积特征图组进行下采样处理,以得到第二下采样特征图组,且使用第五降维模块对第二卷积特征图组进行降维处理,以得到第五降维特征图组;使用第三卷积模块对第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;使用第三下采样模块对第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用第四降维模块对第三卷积特征图组进行降维处理,以得到第四降维特征图组;使用第四卷积模块对第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;使用第四下采样模块对第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用第三降维模块对第四卷积特征图组进行降维处理,以得到第三降维特征图组;使用第五卷积模块对第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;使用第五下采样模块对第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用第二降维模块对第五卷积特征图组进行降维处理,以得到第二降维特征图组;使用第六卷积模块对第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;使用第一上采样模块对第六卷积特征图组进行上采样处理,以得到第一上采样特征图组;使用第一降维模块对第一上采样特征图组进行降维处理,以得到第一降维特征图组;对第一降维特征图组和第二降维特征图组进行融合处理,以得到第 一融合特征图组;使用第二上采样模块对第一融合特征图组进行上采样处理,以得到第二上采样特征图组;对第二上采样特征图组和第三降维特征图组进行融合处理,以得到第二融合特征图组;使用第三上采样模块对第二融合特征图组进行上采样处理,以得到第三上采样特征图组;对第三上采样特征图组和第四降维特征图组进行融合处理,以得到第三融合特征图组;使用第四上采样模块对第三融合特征图组进行上采样处理,以得到第四上采样特征图组;对第四上采样特征图组和第五降维特征图组进行融合处理,以得到第四融合特征图组;使用分类器对第四融合特征图组进行分类处理,以得到文本分类预测图像和连接概率预测图像;以及基于文本分类预测图像和连接概率预测图像,得到文本特征图像,并得到文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
例如,在本公开的实施例中,每个特征图组通常包括多幅特征图像。
例如,在本公开的实施例中,如图4所示,融合处理可以包括对位相加处理ADD。例如,对位相加处理ADD通常是指将一组输入的图像的每个通道的图像矩阵的每一行、每一列的数值与另一组输入的图像的对应通道的图像矩阵的每一行、每一列的数值相加。例如,作为对位相加处理ADD的输入的两组图像的通道数相同,例如,对位相加处理ADD的输出的图像的通道数也与输入的任一组图像的通道数相同。因此,“融合处理”表示将一个特征图组中的每个特征图像中的每个像素与另一个特征图组中的对应的特征图像的对应像素的值进行相加,以得到新的特征图像。“融合处理”不改变特征图像的数量和尺寸。
例如,在一些实施例中,文本分类预测图像包括2个特征图像,连接概率预测图像包括8个特征图像。需要说明的是,文本分类预测图像和连接概率预测图像中的每个特征图像中的像素的值均大于等于0且小于等于1,且表示文本预测概率或连接预测概率。文本分类预测图像中的特征图像表示每个像素是否为文本的概率图,连接概率预测图像中的特征图像表示每个像素与该像素的直接相邻的像素是否连接的概率图。
例如,文本分类预测图像中的2个特征图像包括文本概率图像和非文本概率图像,文本概率图像表示各个像素属于文本的预测概率(即各个像素的类型概率),非文本概率图像表示各个像素属于非文本的预测概率,该2个特征图像的对应的像素点的值相加为1。例如,在一些实施例中,可以 设置类型概率阈值,例如,为0.75;若一个像素的属于文本的预测概率大于或等于类型概率阈值,则表示该像素属于文本,即该像素为正像素(positive pixel);若一个像素的属于文本的预测概率小于类型概率阈值,则表示该像素属于非文本,即该像素为负像素(negative pixel)。
图5为本公开至少一实施例提供的一种像素邻接关系的示意图。例如,在一些实施例中,如图4所示,在方向R1上,像素PX3和像素PX4与像素PX0直接相邻,在方向C1上,像素PX1和像素PX2与像素PX0直接相邻,也就是说,像素PX1至PX4为像素PX0的直接相邻的四个像素,且分别位于像素PX0的上方、下方、左方、右方。例如,在一些实施例中,每个特征图像中的像素阵列排布为多行多列。例如,方向C1可以表示垂直于第一边缘(包括文本图像的第一边缘和文本特征图像的第一边缘)的第一方向,例如列方向;方向R1可以表示平行于第一边缘(包括文本图像的第一边缘和文本特征图像的第一边缘)的第二方向,例如行方向。
例如,连接概率预测图像中的8个特征图像可以包括第一连接分类图像、第二连接分类图像、第三连接分类图像、第四连接分类图像、第五连接分类图像、第六连接分类图像、第七连接分类图像和第八连接分类图像。例如,如图4所示,对于像素PX0,第一连接分类图像中的像素PX0的值表示从像素PX0指向像素PX1方向的连接预测概率,第二连接分类图像中的像素PX0的值表示从像素PX0指向像素PX1方向的不连接预测概率;第三连接分类图像中的像素PX0的值表示从像素PX0指向像素PX2方向的连接预测概率,第四连接分类图像中的像素PX0的值表示从像素PX0指向像素PX2方向的不连接预测概率;第五连接分类图像中的像素PX0的值表示从像素PX0指向像素PX3方向的连接预测概率,第六连接分类图像中的像素PX0的值表示从像素PX0指向像素PX3方向的不连接预测概率;第七连接分类图像中的像素PX0的值表示从像素PX0指向像素PX4方向的连接预测概率,第八连接分类图像中的像素PX0的值表示从像素PX0指向像素PX4方向的不连接预测概率。应当理解的是,第一连接分类图像和第二连接分类图像的对应的像素点的值相加为1,第三连接分类图像和第四连接分类图像的对应的像素点的值相加为1,第五连接分类图像和第六连接分类图像的对应的像素点的值相加为1,第七连接分类图像和第八连接分类图像的对应的像素点的值相加为1。
例如,在一些实施例中,可以设置连接概率阈值,例如,为0.7;当两个直接相邻的像素的连接预测概率大于或等于连接概率阈值,则表示这两个相邻像素之间可以相互连接;当两个直接相邻的像素的连接预测概率小于连接概率阈值,则表示这两个直接相邻的像素之间不可以相互连接。
需要说明的是,上述类型概率阈值和连接概率阈值仅仅是示意性的,类型概率阈值和连接概率阈值可以根据实际应用需求进行设置。
例如,在一些实施例中,文本特征图像为二值图像,但不限于此。例如,在一些实施例中,基于文本分类预测图像和连接概率预测图像,得到文本特征图像,并得到文本特征图像中的每个像素与直接相邻的像素之间的连接概率,可以包括:将文本分类预测图像中的文本概率图像中的每个像素根据其像素值(属于文本的预测概率,即类型概率)与类型概率阈值的大小比较关系进行二值化,以得到文本特征图像,且文本特征图像中的每个像素与直接相邻的像素之间的连接概率可以对应地从连接概率预测图像中查询得到。例如,在文本概率图像中,若一个像素的属于文本的预测概率大于或等于类型概率阈值,则将该像素作为正像素(positive pixel),也就是说,正像素的文本预测概率大于或等于类型概率阈值;若一个像素的属于文本的预测概率小于类型概率阈值,则将该像素作为负像素(negative pixel),也就是说,负像素的文本预测概率小于类型概率阈值;由此可以得到一幅包括正、负像素的文本特征图像。
图6为本公开至少一实施例提供的一种文本特征图像的示意图。如图6所示,该文本特征图像包括正像素(如图6中的每个灰色方格所示)和负像素(如图6中的每个白色方格所示)。
应当理解的是,文本特征图像的尺寸大小与文本分类预测图像和连接概率预测图像中的各特征图像的尺寸大小相同。
需要说明的是,图4所示的文本检测神经网络是示意性的。在实际应用中,还可以采用具有其他结构形式的神经网络执行步骤S100的操作;当然,也可以对图4所示的文本检测神经网络进行部分修改以得到同样可以执行步骤S100的操作的新的文本检测神经网络。例如,在一些示例中,可以省略图4所示的文本检测神经网络中的第四上采样模块和第五降维模块以及相应的融合处理,同时,使用分类器对第三融合特征图组进行分类处理,以得到文本分类预测图像和连接概率预测图像。需要说明的是,本 公开的实施例对此不作限制。
应当理解的是,在一些示例提供的文本检测方法中,也可以设定:文本特征图像中的每一个像素与其上方、下方、左方、右方、左上方、左下方、右上方、右下方的8个像素直接相邻;在此情况下,连接概率预测图像可以对应包括16个特征图像。本公开的实施例包括但不限于此。例如,与每个像素具有8个直接相邻的像素的文本检测方法相比,每个像素具有4个直接相邻的像素的文本检测方法,可以减少运算量,提高处理速度,同时可以改善后续得到的文本框中可能出现文本粘连的问题。
步骤S200:将文本特征图像中的靠近文本特征图像的第一边缘的部分区域作为基础区域,其中,基础区域中的至少部分像素为正像素。
例如,文本特征图像的第一边缘对应于文本图像的第一边缘,文本图像中的待检测文本靠近文本图像的第一边缘(参考图3的相关描述)。
例如,在一些实施例中,如图6所示,可以将文本特征图像中的靠下的部分区域(即靠近文本特征图像的第一边缘的部分区域,如图6中虚线框所示)作为基础区域,该基础区域中的至少部分像素为正像素(如图6的虚线框中的灰色方格所示)。
例如,在一些实施例中,假设文本特征图像的尺寸大小为h*w(即包括h行w列像素),则基础区域的尺寸大小可以设置为h base*w(即包括h base行w列像素),其中,h、w、h base均为正整数,且h base/h≤1。例如,在一些示例中,h base/h≤1/2;例如,在一些示例中,h base/h的取值范围为例如1/10~1/2,例如1/5~2/5,例如1/4~1/3等。例如,h base/h的取值范围可以根据实际应用需求进行设置,例如可以根据需要识别的字体大小的范围以及文本图像的覆盖范围的大小进行设置。应当注意的是,如果h base/h的取值过小,可能导致基础区域中不包括正像素,进而导致本公开的实施例提供的文本检测方法不能有效实施;如果h base/h的取值过大,则可能导致文本检测的运算量的减小不明显,进而降低本公开的实施例的有益效果;因此,h base/h的取值应当根据实际应用需求进行合理设置。
例如,由于待检测文本的长度可能不是固定的,例如,英语单词通常长短不一,因此,在本公开的实施例中,可以将基础区域的宽度设置为与文本特征图像的宽度相同,即均为w。
步骤S300:对基础区域中的至少部分正像素进行分组,以得到至少一 个连通域。
例如,在步骤S300中,可以基于并查集算法,根据基础区域中的各个正像素与直接相邻的像素之间的连接概率,对基础区域中的至少部分正像素进行分组,以得到至少一个连通域(Connected Components)。
例如,在一些实施例中,并查集算法可以包括:首先,基于基础区域中的至少部分正像素构建索引集合,例如,该索引集合包括基础区域中的至少部分正像素,且在该索引集合中,每个正像素对应一个根节点,每个正像素的根节点的初始值为其自身;然后,响应于索引集合中的每个正像素的任一直接相邻的像素为正像素且该每个正像素与该直接相邻的像素之间具有正连接关系,将该直接相邻的像素的根节点的值设置为该正像素的根节点的值;最后,将具有相同根节点的值的每组正像素作为一个连通域,以得到至少一个连通域。需要说明的是,上述并查集算法的具体过程是示意性的,本公开的实施例对此不作限制。例如,在一些示例中,用于构建索引集合的基础区域中的至少部分正像素包括基础区域中的全部正像素;例如,在另一些示例中,用于构建索引集合的基础区域中的至少部分正像素包括不包括基础区域中的例如最靠近文本特征图像的第一边缘的一行或若干行(可以根据实际需求进行设定)中的正像素,从而可以减少运算量,提高处理速度。本公开的实施例对此不作限制。
例如,每个正像素的直接相邻的像素包括在垂直于文本特征图像的第一边缘的第一方向上与每个正像素直接相邻的像素以及在平行于文本特征图像的第一边缘的第二方向上与每个正像素直接相邻的像素。例如,每个正像素具有四个直接相邻的像素。
例如,在本公开的实施例中,当两个直接相邻的像素之间的连接概率大于连接概率阈值时,两者之间具有正连接关系。
示例性地,在图6所示的文本特征图像中,对基础区域中的全部正像素进行分组,得到了四个连通域。
例如,在一些实施例中,为了防止噪声的影响,可以对上述至少一个连通域进行去噪处理。例如,在一些示例中,可以将上述至少一个连通域中的面积小于T1个像素的连通域或者宽度(或高度)小于T2个像素的宽度(或高度)的连通域去除,并将执行去噪处理后剩下的一个或多个连通域用于确定待检测文本对应的最终连通域(参考下述步骤S400中的相关 描述)。例如,在一些示例中,T1可以为例如100~300,例如200,但不限于此;例如,在一些示例中,T2可以为例如5~15,例如10,但不限于此。应当理解的是,T1、T2的取值可以根据实际应用需求进行设置。
步骤S400:将至少一个连通域沿远离文本特征图像的第一边缘的方向进行扩展,以得到与至少一个连通域对应的至少一个最终连通域。
例如,在步骤S400中,至少一个最终连通域包括一个与待检测文本对应的最终连通域。
图7为本公开至少一实施例提供的一种对应于图2中所示的步骤S400的示例性流程图。以下,结合图6所示的文本特征图像对图7所示的步骤S400进行详细说明。
例如,如图7所示,将至少一个连通域沿远离文本特征图像的第一边缘的方向进行扩展,以得到与至少一个连通域一一对应的至少一个最终连通域,即步骤S400,包括步骤S410至步骤S450。
步骤S410:提取当前的连通域中的在垂直于文本特征图像的第一边缘的第一方向上距离文本特征图像的第一边缘最远的正像素作为第一正像素。
例如,在步骤S410中,当前的连通域为基础区域中的至少一个连通域。例如,如图6所示,当前的连通域中的在垂直于第一边缘(即图6所示的文本特征图像的下边缘)的第一方向(即自下而上的列方向)上距离文本特征图像的第一边缘最远的正像素包括像素点1-5,从而,将像素点1-5均作为第一正像素。例如,如图6所示,第一正像素(即像素点1-5)位于同一行。例如,如图6所示,像素点1-2属于同一个连通域,因此像素点1-2具有相同的根节点;像素点3-5属于同一个连通域,因此像素点3-5具有相同的根节点(不同于像素点1-2的根节点)。
步骤S420:将文本特征图像中在第一正像素的远离文本特征图像的第一边缘的一侧且与第一正像素直接相邻的像素作为第一邻近像素。
例如,如图6所示,将像素点1-5上一行的分别与像素点1-5直接相邻的五个像素点作为第一邻近像素。例如,如图6所示,第一邻近像素包括像素点6-8等;其中,像素点6与像素点1直接相邻,像素点7与像素点2直接相邻,像素点8与像素点4直接相邻,像素点3、5的第一邻近像素均未给出附图标记。
步骤S430:响应于第一邻近像素为正像素且第一正像素与第一邻近像素之间具有正连接关系,将第一邻近像素的根节点的值修改为第一正像素的根节点的值,并将第一邻近像素加入第一邻近像素集合。
例如,在一些实施例中,当第一正像素与第一邻近像素之间的连接概率大于连接概率阈值时,两者之间具有正连接关系。
例如,在一些实施例中,第一邻近像素集合具有与前述索引集合相似的形式,即该第一邻近像素集合中的每个像素也具有对应的根节点。例如,在一些示例中,如图6所示,像素点6为正像素且像素点6与像素点1具有正连接关系,从而可以将像素点6加入第一邻近像素集合,且像素点6的根节点的值与像素点1的根节点的值相同。类似地,像素点7也可以加入第一邻近像素,且像素点7的根节点的值与像素点2的根节点的值相同,即与像素点1和6的根节点的值相同;像素点8也可以加入第一邻近像素,且像素点8的根节点的值与像素点3的根节点的值相同。
步骤S440:在平行于文本特征图像的第一边缘的第二方向上对第一邻近像素集合进行扩展。
例如,在一些实施例中,步骤S440可以包括:将在平行于文本特征图像的第一边缘的第二方向上与第一邻近像素集合中的任一像素直接相邻且具有正连接关系的正像素加入第一邻近像素集合,直到无法继续在平行于文本特征图像的第一边缘的方向上对第一邻近像素集合进行扩展为止。
例如,在一些实施例中,步骤S440中的正连接关系的判别条件与前述步骤S430中的判别条件相同。
例如,在一些示例中,如图6所示,像素点9为正像素且像素点9与像素点6具有正连接关系,从而可以将像素点9加入第一邻近像素集合,且像素点9的根节点的值与像素点6的根节点的值相同;进一步地,像素点10为正像素且像素点10与像素点9具有正连接关系,从而可以将像素点10也加入第一邻近像素集合,且像素点10的根节点的值与像素点9的根节点的值相同。例如,如图6所示,第一邻近像素集合在扩展前仅包括像素点6-8,经过扩展后则包括像素点6-11。其中,像素点6-7和9-11具有相同的根节点。
步骤S450:将当前的连通域扩展为包括第一邻近像素集合中的全部像素,并继续将当前的连通域沿远离文本特征图像的第一边缘的方向进行扩 展,直到无法继续扩展为止。
例如,如图6所示,基础区域中的包括像素点1-2的连通域(第一连通域)经过第一次扩展后还包括像素点6-11,包括像素点3-5的连通域(第二连通域)经过第一次扩展后还包括像素点8。
例如,可以基于第一次扩展后的连通域,重复步骤S410值步骤S450的操作以完成对连通域的第二次扩展。例如,在第二次扩展时,可以将第一次扩展时得到的第一邻近像素集合中的像素(即像素点6-11)作为第一正像素。例如,如图6所示,经过第二次扩展后,第一连通域还包括像素点12-14,第二连通域还包括像素点15-16。
以此类推,经过多次扩展后,如图6所示,第一连通域在基本区域之外还包括像素点6-14、17和19-20,第二连通域在基本区域之外还包括像素点8、15-16、18和21。由此,可以分别得到两个最终连通域。
需要说明的是,在本公开的实施例中,图6所示的文本特征图像中的连通域的扩展时示例性的,而非限制性的。例如,在一些实施例中,基础区域中可以向外(向基本区域外)进行扩展的连通域可以是一个或多个,而不限于图6所示的两个。例如,在一些实施例中,基础区域中的两个或两个以上的连通域向外进行扩展后,可以共同形成一个最终连通域,而不限于每个连通域对应一个最终连通域。例如,在一些实施例中,基础区域内还包括经过步骤S400的处理后面积不发生变化的连通域,例如不能向外进行扩展(即不能沿远离文本特征图像的第一边缘的方向进行扩展)的连通域,这样的连通域在经过步骤S400的处理后也作为最终连通域。
步骤S500:确定至少一个最终连通域对应的至少一个特征框,并将至少一个特征框映射到文本图像中,以得到至少一个文本框,其中,该至少一个文本框包括待检测文本的文本框。
例如,在一些实施例中,确定至少一个最终连通域对应的至少一个特征框,可以包括:使用轮廓检测算法对至少一个最终连通域进行轮廓检测,以得到至少一个最终连通域的轮廓;使用最小外接矩形算法对至少一个最终连通域的轮廓进行处理,以得到至少一个最终连通域对应的至少一个特征框。例如,轮廓检测算法可以包括但不限于OpenCV的轮廓检测(findContours)函数;例如,最小外接矩形算法可以包括但不限于OpenCV的最小外接矩形(minAreaRect)函数。
例如,在本公开的实施例中,特征框可以为矩形框,相应地,文本框也可以为矩形框。需要说明的是,本公开的实施例包括但不限于此。
例如,在一些实施例中,如图3所示,文本特征图像中的至少一个特征框映射到文本图像中后,可以得到至少一个文本框(如图3中的实线框所示)。例如,映射包括尺度变换和投影两个过程。例如,以文本特征图像的尺寸大小为文本图像的尺寸大小的1/(2×2)为例,在尺度变换过程中,将特征框的宽度和高度分别扩大两倍;在投影过程中,保持文本框与文本图像的相对位置和特征框与文本特征图像的相对位置一致,从而可以得到对应的文本框。例如,如图3所示,每个文本框中包括一个文本。
例如,如图3所示,在本公开的实施例提供的文本检测方法中,仅需要对文本图像中的待检测文本(如图3中的“Tecent”所示)附近的部分区域进行检测,从而仅得到文本图像中部分文本的文本框(包括待检测文本的文本框)。相比之下,图1所示的文本图像对应的常用的文本检测方法,需要对文本图像的整个区域进行遍历式的检测,以得到文本图像中全部文本的文本框。因此,本公开的实施例提供的文本检测方法,可以减少文本检测的运算量(即减少遍历次数),减少文本检测的响应时间。
步骤S600:从至少一个文本框中确定出待检测文本的文本框。
例如,在一些实施例中,文本图像由设置在点译笔的笔身上的摄像头拍摄得到,而待检测文本由点译笔的笔尖选择。由于点译笔的笔尖和摄像头的相对位置是固定的,因此,点译笔的笔尖(假设在文本图像所在平面虚拟出一个点译笔的笔尖,即虚拟笔尖)与摄像头拍摄得到的文本图像的相对位置也是固定的。从而,可以基于上述原理实现步骤S600。
图8为本公开至少一实施例提供的一种对应于图2中所示的步骤S600的示例性流程图,图9为本公开至少一实施例提供的一种对应于图2中所示的步骤S600的操作示意图。以下,结合图9对图8所示的步骤S600进行详细说明。
例如,如图8所示,从至少一个文本框中确定出待检测文本的文本框,即步骤S600包括步骤S610至步骤S620。
步骤S610:在文本图像中构建虚拟检测框;
步骤S620:计算虚拟检测框与各个文本框的重叠面积,将与虚拟检测框具有最大重叠面积的文本框作为待检测文本的文本框。
例如,在一些实施例中,如图9所示,可以先在文本图像(如图9中灰色实线框所示)虚拟出一个点译笔的笔尖,即虚拟笔尖。例如,在一些示例中,虚拟笔尖(如图9中黑色圆点所示)可以设置在文本图像的第一边缘上,但不限于此;例如,在另一些示例中,虚拟笔尖可以设置在文本图像外,且靠近第一边缘。例如,如图9所示,虚拟笔尖一般可以设置在文本图像的第一边缘的中垂线上,或者可以设置在文本图像的第一边缘的中垂线附近,本公开的实施例对此不作限制。应当理解的是,虚拟笔尖可以根据实际应用需求进行设置,本公开的实施例对此不作限制。
然后,以虚拟笔尖为虚拟检测框的底边中点,构建一个高度为H、宽度为W的虚拟检测框(如图9中虚线框所示)。例如,在一些实施例中,H=H1+H2,其中,H1表示虚拟笔尖与文本图像中的各个文本框的中心在垂直于第一边缘的第一方向(即列方向)上的距离的最小值,H2为预先设置的高度数值;例如,H2可以设置为例如30个像素的高度值,但不限于此。例如,在一些实施例中,宽度W为预先设置的宽度数值;例如,W可以设置为例如60个像素的宽度值,但不限于此。应当理解的是,H2和W可以根据实际应用需求进行设置,本公开的实施例对此不作限制。
例如,在一些实施例中,在确定出待检测文本的文本框,本公开的实施例提供的文本检测方法,还可以进一步包括:基于待检测文本的文本框,对待检测文本进行文本识别处理。例如,可以采用常用的文本处理方法进行文本识别处理,本公开的实施例对此不作限制。例如,常用的文本处理方法可以包括但不限于使用神经网络(例如多目标纠正注意网络(MORAN)等)进行文本识别处理。
例如,在实际应用中,还可以基于文本识别处理的结果,进行文本翻译,以得到并输出待检测文本的翻译结果。例如,使用词典数据库对文本识别处理的结果进行索引,以检索得到翻译结果。例如,待检测文本的翻译结果可以通过显示器进行显示,也可以通过扬声器等进行语音输出。
需要说明的是,在本公开的实施例中,上述文本检测方法的流程可以包括更多或更少的操作,这些操作可以顺序执行或并行执行。虽然上文描述的文本检测方法的流程包括特定顺序出现的多个操作,但是应该清楚地了解,多个操作的顺序并不受限制。上文描述的文本检测方法可以执行一次,也可以按照预定条件执行多次。
需要说明的是,在本公开的实施例中,文本检测神经网络以及文本检测神经网络中的各种功能模块和功能层等均可以采用软件、硬件、固件或其任意组合等方式实现,从而执行相应的处理过程。
本公开的实施例提供的文本检测方法,可以基于预先设定的基础区域,采用连通域的思想进行文本检测,由此可以减少文本检测的运算量(即减少遍历次数),减少文本检测的响应时间,该文本检测方法适用于点译笔,可以提高点译笔的处理速度,改善用户使用体验。
本公开至少一实施例还提供一种文本检测装置。图10为本公开至少一实施例提供的一种文本检测装置的示意性框图。
例如,如图10所示,文本检测装置1000包括存储器1001和处理器1002。应当理解的是,图10所示的文本检测装置1000的组件只是示例性的,而非限制性的,根据实际应用需要,该文本检测装置1000还可以包括其他组件。
例如,存储器1001用于存储文本图像以及计算机可读指令;处理器1002用于读取文本图像,并运行计算机可读指令,计算机可读指令被处理器1002运行时执行根据上述任一实施例所述的文本检测方法中的一个或多个步骤。
例如,在一些实施例中,如图10所示,文本检测装置还可以包括图像采集元件1003。例如,图像采集元件1003用于采集文本图像。例如,图像采集元件1003即为上述文本检测方法的实施例中描述的图像采集装置或元件,例如,图像采集元件1003可以是各种类型的摄像头。
例如,在一些实施例中,文本检测装置1000可以为点译笔,但不限于此。例如,点译笔用于选择待检测文本。例如,图像采集元件1003可以设置在点译笔上,例如,图像采集元件1003可以为设置在点译笔上的摄像头。
需要说明的是,存储器1001和处理器1002也可以集成在点译笔中,也就是说,图像采集元件1003、存储器1001和处理器1002均可以集成在点译笔中,本公开的实施例包括但不限于此。
例如,文本检测装置1000还可以包括输出单元,输出单元用于输出待检测文本的识别结果和/或翻译结果。例如,输出单元可以包括显示器、扬声器等,显示器可以用于显示待检测文本的识别结果和/或翻译结果,扬 声器可以用于将待检测文本的识别结果和/或翻译结果以语音的形式输出。例如,点译笔还可以包括通信模块,通信模块用于实现点译笔与输出单元之间的通信,例如,将翻译结果传输至输出单元。
例如,处理器1002可以控制文本检测装置1000中的其它组件以执行期望的功能。处理器1002可以是中央处理单元(CPU)、张量处理器(TPU)等具有数据处理能力和/或程序执行能力的器件。中央处理元(CPU)可以为X86或ARM架构等。GPU可以单独地直接集成到主板上,或者内置于主板的北桥芯片中。GPU也可以内置于中央处理器(CPU)上。
例如,存储器1001可以包括一个或多个计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在所述计算机可读存储介质上可以存储一个或多个计算机可读指令,处理器1002可以运行所述计算机可读指令,以实现文本检测装置1000的各种功能。
例如,图像采集元件1003、存储器1001、存储器1230和输出单元等组件之间可以通过网络连接进行通信。网络可以包括无线网络、有线网络、和/或无线网络和有线网络的任意组合。网络可以包括局域网、互联网、电信网、基于互联网和/或电信网的物联网(Internet of Things)、和/或以上网络的任意组合等。有线网络例如可以采用双绞线、同轴电缆或光纤传输等方式进行通信,无线网络例如可以采用3G/4G/5G移动通信网络、蓝牙、Zigbee或者WiFi等通信方式。本公开对网络的类型和功能在此不作限制。
例如,关于文本检测装置1000执行文本检测处理的过程的详细说明可以参考文本检测方法的实施例中的相关描述,重复之处在此不再赘述。
本公开的实施例提供的文本检测装置的技术效果可以参考上述实施例中关于文本检测方法的相应描述,在此不再赘述。
本公开至少一实施例还提供一种存储介质。图11为本公开至少一实施例提供的一种存储介质的示意图。例如,如图11所示,在存储介质1100上可以非暂时性地存储一个或多个计算机可读指令1101。例如,当所述计算机可读指令1101由计算机执行时能够执行根据上文所述的文本检测方 法中的一个或多个步骤。
例如,该存储介质1100可以应用于上述文本检测装置1000中,例如,其可以作为文本检测装置1000中的存储器1001。关于存储介质1100的说明可以参考文本检测装置100的实施例中对于存储器的描述,重复之处不再赘述。
本公开的实施例提供的存储介质的技术效果可以参考上述实施例中关于文本检测方法的相应描述,在此不再赘述。
对于本公开,有以下几点需要说明:
(1)本公开实施例附图只涉及到与本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)为了清晰起见,在用于描述本公开的实施例的附图中,层的厚度或区域的尺寸被放大或缩小,即这些附图并非按照实际的比例绘制。可以理解,当诸如层、膜、区域或基板之类的元件被称作位于另一元件“上”或“下”时,该元件可以“直接”位于另一元件“上”或“下”,或者可以存在中间元件。
(3)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。
以上所述仅是本公开的示范性实施方式,而非用于限制本公开的保护范围,任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围由所附的权利要求确定。

Claims (22)

  1. 一种文本检测方法,包括:
    基于文本图像,获取对应于所述文本图像的文本特征图像;
    将所述文本特征图像中的靠近所述文本特征图像的第一边缘的部分区域作为基础区域,其中,所述文本特征图像的第一边缘对应于所述文本图像的第一边缘,所述文本图像中的待检测文本靠近所述文本图像的第一边缘,所述基础区域中的至少部分像素为正像素;
    对所述基础区域中的至少部分正像素进行分组,以得到至少一个连通域;
    将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的至少一个最终连通域;以及
    确定所述至少一个最终连通域对应的至少一个特征框,并将所述至少一个特征框映射到所述文本图像中,以得到至少一个文本框,其中,所述至少一个文本框包括所述待检测文本的文本框。
  2. 根据权利要求1所述的文本检测方法,其中,在所述文本特征图像包括h行w列像素的情况下,所述基础区域包括h base行w列像素,
    其中,h、w、h base均为正整数,且h base/h≤1/2。
  3. 根据权利要求1或2所述的文本检测方法,其中,所述文本特征图像中的每个像素与直接相邻的像素之间具有连接概率;
    对所述基础区域中的至少部分正像素进行分组,以得到所述至少一个连通域,包括:
    基于并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域。
  4. 根据权利要求3所述的文本检测方法,其中,基于所述并查集算法,根据所述基础区域中的所述至少部分正像素中的每个正像素与直接相邻的像素之间的连接概率,对所述基础区域中的所述至少部分正像素进行分组,以得到所述至少一个连通域,包括:
    基于所述基础区域中的所述至少部分正像素构建索引集合,其中,所 述索引集合包括所述基础区域中的所述至少部分正像素,且在所述索引集合中,每个正像素对应一个根节点,每个正像素的根节点的初始值为其自身;
    响应于所述索引集合中的每个正像素的任一直接相邻的像素为正像素且所述每个正像素与所述直接相邻的像素之间具有正连接关系,将所述直接相邻的像素的根节点的值设置为所述每个正像素的根节点的值;以及
    将具有相同根节点的值的每组正像素作为一个连通域,以得到所述至少一个连通域。
  5. 根据权利要求4所述的文本检测方法,其中,在所述基础区域中的每个正像素与直接相邻的像素之间的连接概率大于连接概率阈值情况下,确定所述每个正像素与所述直接相邻的像素之间具有所述正连接关系。
  6. 根据权利要求4或5所述的文本检测方法,其中,所述基础区域中的每个正像素的直接相邻的像素包括:
    在垂直于所述文本特征图像的第一边缘的第一方向上与所述每个正像素直接相邻的像素,以及在平行于所述文本特征图像的第一边缘的第二方向上与所述每个正像素直接相邻的像素。
  7. 根据权利要求4-6任一项所述的文本检测方法,其中,所述基础区域中的每个正像素具有四个直接相邻的像素。
  8. 根据权利要求4-7任一项所述的文本检测方法,其中,将所述至少一个连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,以得到与所述至少一个连通域对应的所述至少一个最终连通域,包括:
    提取当前的连通域中的在垂直于所述文本特征图像的第一边缘的第一方向上距离所述文本特征图像的第一边缘最远的正像素作为第一正像素;
    将所述文本特征图像中在所述第一正像素的远离所述文本特征图像的第一边缘的一侧且与所述第一正像素直接相邻的像素作为第一邻近像素;
    响应于所述第一邻近像素为正像素且所述第一正像素与所述第一邻近像素之间具有正连接关系,将所述第一邻近像素的根节点的值修改为所述第一正像素的根节点的值,并将所述第一邻近像素加入第一邻近像素集合;
    在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展;以及
    将当前的连通域扩展为包括所述第一邻近像素集合中的全部像素,并继续将当前的连通域沿远离所述文本特征图像的第一边缘的方向进行扩展,直到无法继续扩展为止。
  9. 根据权利要求8所述的文本检测方法,其中,在平行于所述文本特征图像的第一边缘的第二方向上对所述第一邻近像素集合进行扩展,包括:
    将在平行于所述文本特征图像的第一边缘的第二方向上与所述第一邻近像素集合中的任一像素直接相邻且具有正连接关系的正像素加入所述第一邻近像素集合,直到无法继续在平行于所述文本特征图像的第一边缘的方向上对所述第一邻近像素集合进行扩展为止。
  10. 根据权利要求8或9所述的文本检测方法,其中,所述至少一个最终连通域包括所述基本区域内的无法沿远离所述文本特征图像的第一边缘的方向进行扩展的连通域。
  11. 根据权利要求3-10任一项所述的文本检测方法,其中,基于所述文本图像,获取对应于所述文本图像的所述文本特征图像,包括:
    使用文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
  12. 根据权利要求11所述的文本检测方法,其中,所述文本检测神经网络包括第一至第六卷积模块、第一至第五下采样模块、第一至第四上采样模块、以及分类器;
    使用所述文本检测神经网络对所述文本图像进行处理,以得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率,包括:
    使用第一卷积模块对所述文本图像进行卷积处理,以得到第一卷积特征图组;
    使用第一下采样模块对所述第一卷积特征图组进行下采样处理,以得到第一下采样特征图组;
    使用第二卷积模块对所述第一下采样特征图组进行卷积处理,以得到 第二卷积特征图组;
    使用第二下采样模块对所述第二卷积特征图组进行下采样处理,以得到第二下采样特征图组,且使用第五降维模块对所述第二卷积特征图组进行降维处理,以得到第五降维特征图组;
    使用第三卷积模块对所述第二下采样特征图组进行卷积处理,以得到第三卷积特征图组;
    使用第三下采样模块对所述第三卷积特征图组进行下采样处理,以得到第三下采样特征图组,且使用第四降维模块对所述第三卷积特征图组进行降维处理,以得到第四降维特征图组;
    使用第四卷积模块对所述第三下采样特征图组进行卷积处理,以得到第四卷积特征图组;
    使用第四下采样模块对所述第四卷积特征图组进行下采样处理,以得到第四下采样特征图组,且使用第三降维模块对所述第四卷积特征图组进行降维处理,以得到第三降维特征图组;
    使用第五卷积模块对所述第四下采样特征图组进行卷积处理,以得到第五卷积特征图组;
    使用第五下采样模块对所述第五卷积特征图组进行下采样处理,以得到第五下采样特征图组,且使用第二降维模块对所述第五卷积特征图组进行降维处理,以得到第二降维特征图组;
    使用第六卷积模块对所述第五下采样特征图组进行卷积处理,以得到第六卷积特征图组;
    使用第一上采样模块对所述第六卷积特征图组进行上采样处理,以得到第一上采样特征图组;
    使用第一降维模块对所述第一上采样特征图组进行降维处理,以得到第一降维特征图组;
    对所述第一降维特征图组和所述第二降维特征图组进行融合处理,以得到第一融合特征图组;
    使用第二上采样模块对所述第一融合特征图组进行上采样处理,以得到第二上采样特征图组;
    对所述第二上采样特征图组和所述第三降维特征图组进行融合处理,以得到第二融合特征图组;
    使用第三上采样模块对所述第二融合特征图组进行上采样处理,以得到第三上采样特征图组;
    对所述第三上采样特征图组和所述第四降维特征图组进行融合处理,以得到第三融合特征图组;
    使用第四上采样模块对所述第三融合特征图组进行上采样处理,以得到第四上采样特征图组;
    对所述第四上采样特征图组和所述第五降维特征图组进行融合处理,以得到所述第四融合特征图组;
    使用分类器对所述第四融合特征图组进行分类处理,以得到文本分类预测图像和连接概率预测图像;以及
    基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率。
  13. 根据权利要求12所述的文本检测方法,其中,所述文本分类预测图像中的每个像素具有类型概率,所述连接概率预测图像中的每个像素具有所述像素与直接相邻的像素之间的连接概率;
    基于所述文本分类预测图像和所述连接概率预测图像,得到所述文本特征图像,并得到所述文本特征图像中的每个像素与其相邻像素之间的连接概率,包括:
    将所述文本分类预测图像中的类型概率大于或等于类型概率阈值的像素作为正像素,将所述文本分类预测图像中的类型概率小于所述类型概率阈值的像素作为负像素,以得到所述文本特征图像,所述文本特征图像中的每个像素与直接相邻的像素之间的连接概率可以对应地从连接概率预测图像中查询得到。
  14. 根据权利要求1-13任一项所述的文本检测方法,其中,确定所述至少一个最终连通域对应的所述至少一个特征框,包括:
    使用轮廓检测算法对所述至少一个最终连通域进行轮廓检测,以得到所述至少一个最终连通域的轮廓;使用最小外接矩形算法对所述至少一个最终连通域的轮廓进行处理,以得到所述至少一个最终连通域对应的所述至少一个特征框。
  15. 根据权利要求1-14任一项所述的文本检测方法,还包括:从所述 至少一个文本框中确定出所述待检测文本的文本框。
  16. 根据权利要求15所述的文本检测方法,其中,从所述至少一个文本框中确定出所述待检测文本的文本框,包括:
    在所述文本图像中构建虚拟检测框;以及
    计算所述虚拟检测框与各个文本框的重叠面积,将与所述虚拟检测框具有最大重叠面积的文本框作为所述待检测文本的文本框。
  17. 根据权利要求15或16所述的文本检测方法,还包括:基于所述待检测文本的文本框,对所述待检测文本进行识别处理。
  18. 根据权利要求1-17任一项所述的文本检测方法,还包括:使用点译笔的图像采集元件采集所述文本图像;
    其中,在采集所述文本图像时,所述点译笔的笔尖点在所述待检测文本的靠近所述文本图像的第一边缘的一侧,
    所述文本图像包括所述待检测文本。
  19. 一种文本检测装置,包括:
    存储器,用于存储文本图像以及计算机可读指令;
    处理器,用于读取所述文本图像,并运行所述计算机可读指令,所述计算机可读指令被所述处理器运行时执行根据权利要求1-18任一项所述的文本检测方法。
  20. 根据权利要求19所述的文本检测装置,还包括:
    图像采集元件,用于采集所述文本图像。
  21. 根据权利要求20所述的文本检测装置,其中,所述文本检测装置为点译笔,其中,
    所述图像采集元件设置在所述点译笔上,所述点译笔用于选择所述待检测文本。
  22. 一种存储介质,非暂时性地存储计算机可读指令,其中,当所述计算机可读指令由计算机执行时能够执行根据权利要求1-18任一项所述的文本检测方法。
PCT/CN2020/073622 2020-01-21 2020-01-21 文本检测方法及装置、存储介质 WO2021146951A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080000057.5A CN113498521A (zh) 2020-01-21 2020-01-21 文本检测方法及装置、存储介质
PCT/CN2020/073622 WO2021146951A1 (zh) 2020-01-21 2020-01-21 文本检测方法及装置、存储介质

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/073622 WO2021146951A1 (zh) 2020-01-21 2020-01-21 文本检测方法及装置、存储介质

Publications (1)

Publication Number Publication Date
WO2021146951A1 true WO2021146951A1 (zh) 2021-07-29

Family

ID=76991755

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/073622 WO2021146951A1 (zh) 2020-01-21 2020-01-21 文本检测方法及装置、存储介质

Country Status (2)

Country Link
CN (1) CN113498521A (zh)
WO (1) WO2021146951A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807351A (zh) * 2021-09-18 2021-12-17 京东鲲鹏(江苏)科技有限公司 一种场景文字检测方法和装置
CN116916047A (zh) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 一种版式文件识别数据智能存储方法
CN116993976A (zh) * 2023-07-17 2023-11-03 中国科学院自动化研究所 引用图像分割模型训练方法及引用图像分割方法
CN117894030A (zh) * 2024-01-18 2024-04-16 广州宏途数字科技有限公司 一种校园智慧纸笔的文本识别方法及系统

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116092087B (zh) * 2023-04-10 2023-08-08 上海蜜度信息技术有限公司 Ocr识别方法、系统、存储介质及电子设备

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050249430A1 (en) * 2004-05-07 2005-11-10 Samsung Electronics Co., Ltd. Image quality improving apparatus and method
US20120250985A1 (en) * 2011-03-30 2012-10-04 Jing Xiao Context Constraints for Correcting Mis-Detection of Text Contents in Scanned Images
CN110222695A (zh) * 2019-06-19 2019-09-10 拉扎斯网络科技(上海)有限公司 一种证件图片处理方法及装置、介质、电子设备
CN110610166A (zh) * 2019-09-18 2019-12-24 北京猎户星空科技有限公司 文本区域检测模型训练方法、装置、电子设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050249430A1 (en) * 2004-05-07 2005-11-10 Samsung Electronics Co., Ltd. Image quality improving apparatus and method
US20120250985A1 (en) * 2011-03-30 2012-10-04 Jing Xiao Context Constraints for Correcting Mis-Detection of Text Contents in Scanned Images
CN110222695A (zh) * 2019-06-19 2019-09-10 拉扎斯网络科技(上海)有限公司 一种证件图片处理方法及装置、介质、电子设备
CN110610166A (zh) * 2019-09-18 2019-12-24 北京猎户星空科技有限公司 文本区域检测模型训练方法、装置、电子设备和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, MO ET AL.: "Caption Detection and Text Content Extraction in News Video", DIANSHI-JISHU: YUEKAN - VIDEO ENGINEERING, vol. 8 (278), no. Suppl. 1, 31 December 2005 (2005-12-31), CN, pages 147 - 149, XP009529346, ISSN: 1002-8692 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113807351A (zh) * 2021-09-18 2021-12-17 京东鲲鹏(江苏)科技有限公司 一种场景文字检测方法和装置
CN113807351B (zh) * 2021-09-18 2024-01-16 京东鲲鹏(江苏)科技有限公司 一种场景文字检测方法和装置
CN116993976A (zh) * 2023-07-17 2023-11-03 中国科学院自动化研究所 引用图像分割模型训练方法及引用图像分割方法
CN116916047A (zh) * 2023-09-12 2023-10-20 北京点聚信息技术有限公司 一种版式文件识别数据智能存储方法
CN116916047B (zh) * 2023-09-12 2023-11-10 北京点聚信息技术有限公司 一种版式文件识别数据智能存储方法
CN117894030A (zh) * 2024-01-18 2024-04-16 广州宏途数字科技有限公司 一种校园智慧纸笔的文本识别方法及系统

Also Published As

Publication number Publication date
CN113498521A (zh) 2021-10-12

Similar Documents

Publication Publication Date Title
WO2021146951A1 (zh) 文本检测方法及装置、存储介质
WO2020200030A1 (zh) 神经网络的训练方法、图像处理方法、图像处理装置和存储介质
CN107424159B (zh) 基于超像素边缘和全卷积网络的图像语义分割方法
WO2021073493A1 (zh) 图像处理方法及装置、神经网络的训练方法、合并神经网络模型的图像处理方法、合并神经网络模型的构建方法、神经网络处理器及存储介质
US11710293B2 (en) Target detection method and apparatus, computer-readable storage medium, and computer device
WO2022148192A1 (zh) 图像处理方法、图像处理装置以及非瞬时性存储介质
WO2019201035A1 (zh) 对图像中的对象节点的识别方法、装置、终端及计算机可读存储介质
US11670071B2 (en) Fine-grained image recognition
CN107506761B (zh) 基于显著性学习卷积神经网络的脑部图像分割方法及系统
US20210398287A1 (en) Image processing method and image processing device
JP7464752B2 (ja) 画像処理方法、装置、機器及びコンピュータプログラム
CN108427924B (zh) 一种基于旋转敏感特征的文本回归检测方法
CN113239782B (zh) 一种融合多尺度gan和标签学习的行人重识别系统及方法
WO2021146937A1 (zh) 文字识别方法、文字识别装置和存储介质
US20230222631A1 (en) Method and device for removing handwritten content from text image, and storage medium
CN111626994B (zh) 基于改进U-Net神经网络的设备故障缺陷诊断方法
CN110517270B (zh) 一种基于超像素深度网络的室内场景语义分割方法
CN111353544A (zh) 一种基于改进的Mixed Pooling-YOLOV3目标检测方法
JP2023501820A (ja) フェイスパーシング方法および関連デバイス
Rao et al. Exploring deep learning techniques for Kannada handwritten character recognition: a boon for digitization
CN111985525A (zh) 基于多模态信息融合处理的文本识别方法
CN113487610B (zh) 疱疹图像识别方法、装置、计算机设备和存储介质
Cui et al. Deep saliency detection via spatial-wise dilated convolutional attention
Zhang et al. A simple and effective static gesture recognition method based on attention mechanism
CN111832390B (zh) 一种手写古文字检测方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20915773

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20915773

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 20915773

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.03.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20915773

Country of ref document: EP

Kind code of ref document: A1