WO2022033095A1 - 一种文本区域的定位方法及装置 - Google Patents

一种文本区域的定位方法及装置 Download PDF

Info

Publication number
WO2022033095A1
WO2022033095A1 PCT/CN2021/093660 CN2021093660W WO2022033095A1 WO 2022033095 A1 WO2022033095 A1 WO 2022033095A1 CN 2021093660 W CN2021093660 W CN 2021093660W WO 2022033095 A1 WO2022033095 A1 WO 2022033095A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
pixel
value
connected domains
color
Prior art date
Application number
PCT/CN2021/093660
Other languages
English (en)
French (fr)
Inventor
费志军
邱雪涛
何朔
Original Assignee
中国银联股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国银联股份有限公司 filed Critical 中国银联股份有限公司
Publication of WO2022033095A1 publication Critical patent/WO2022033095A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/63Scene text, e.g. street names
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Definitions

  • the present invention relates to the field of computer technology, and in particular, to a method and device for locating a text area.
  • Door head refers to the plaques and related facilities set up by enterprises, institutions and individual industrial and commercial households at the door.
  • the door of the merchant generally contains text content such as the name of the merchant and the address of the merchant.
  • text content such as the name of the merchant and the address of the merchant.
  • the inspectors need to go to the address of the store to take pictures, and then the reviewers will check the information, which is inefficient and prone to errors.
  • it is necessary to locate the text position of the name of the merchant in the image of the door of the merchant photographed on the street.
  • the existing image text recognition generally recognizes all the text in the image, and cannot effectively distinguish the text area of the merchant name in the image of the merchant's door from other text areas, which affects the accuracy of subsequent recognition of the merchant name.
  • Embodiments of the present invention provide a method and device for locating a text area, which are used to improve the accuracy of locating a text area in a door header image of a merchant.
  • an embodiment of the present invention provides a method for locating a text area, including:
  • text pixels are determined from all the pixels of the target image, and a plurality of text connected domains are formed by the text pixels;
  • any two text connected domains calculate the difference feature value between the two text connected domains according to the color value of each pixel in the text connected domain, and calculate the difference between the two text connected domains according to the distance between the two text connected domains. Adjacent feature values between the two text connected domains;
  • the target text area in the target image is determined according to the area of the combined text connected domain.
  • determining text pixels from all pixels of the target image according to pixel values including:
  • the target image is input into the trained pixel classification model, and the pixel feature extraction results of all pixels are obtained by alternating convolution operations and pooling operations in the pixel classification model;
  • the classification result of each pixel in the target image is determined, and the classification result of the pixel is that the pixel is a text pixel or a non-text pixel. Text pixels.
  • the plurality of text connected domains formed by text pixels include:
  • the text pixels are connected to form multiple text connected domains.
  • the method further includes:
  • the calculation of the adjacency feature value between the two text connected domains according to the distance between the two text connected domains includes:
  • the adjacency eigenvalues between the two minimum enclosing rectangles are calculated.
  • the difference feature value between the two minimum circumscribed rectangles is calculated, including:
  • the minimum circumscribed rectangle of each text connected domain obtain the color value of each pixel in the minimum circumscribed rectangle; calculate the mean value of the color values of all pixels as the color feature value of the minimum circumscribed rectangle; the color feature Values include red component value, green component value and blue component value;
  • the color difference component with the largest value is selected as the difference feature value between the two smallest circumscribed rectangles.
  • calculating the adjacency feature value between the two minimum circumscribed rectangles including:
  • combining the multiple text connected domains according to the difference feature value and the adjacent feature value includes:
  • the embodiment of the present invention also provides an image character recognition method, the method includes:
  • the target feature vector is compared with the labeled feature vector of the labeled sample, and the labeled text image with the largest similarity is determined, and the labeled sample includes the labeled text image, the corresponding labeled feature vector and text information;
  • the text information of the marked image with the highest similarity is used as the text information of the target text area.
  • an embodiment of the present invention further provides a device for locating a text area, the device comprising:
  • an acquisition unit for acquiring the pixel value of each pixel in the target image
  • the calculation unit is used for any two text connected domains to calculate the difference feature value between the two text connected domains according to the color value of each pixel in the text connected domain, and according to the difference between the two text connected domains.
  • the distance between calculate the adjacency feature value between the two text connected domains;
  • a merging unit for merging the plurality of text connected domains according to the difference feature value and the adjacent feature value
  • the filtering unit is configured to determine the target text area in the target image according to the area of the merged text connected domain.
  • the communication unit is specifically used for:
  • the target image is input into the trained pixel classification model, and the pixel feature extraction results of all pixels are obtained by alternating convolution operations and pooling operations in the pixel classification model;
  • the classification result of each pixel in the target image is determined, and the classification result of the pixel is that the pixel is a text pixel or a non-text pixel. Text pixels.
  • the communication unit is specifically used for:
  • the text pixels are connected to form multiple text connected domains.
  • the computing unit is specifically used for:
  • any text connected domain obtain the color value of each pixel in the text connected domain; calculate the mean value of the color values of all pixels as the color feature value of the text connected domain; the color feature value includes the red component value, green component value and blue component value;
  • the color difference component with the largest value is selected as the difference feature value between the two connected domains.
  • the computing unit is specifically used for:
  • the merging unit is specifically used for:
  • the union search algorithm is used to merge all text connected domains.
  • the connected unit is also used to determine the minimum circumscribed rectangle of each text connected domain
  • the computing unit is further configured to calculate the difference feature value between the two text connected domains according to the color value of each pixel in the minimum circumscribed rectangle corresponding to each text connected domain; according to the minimum circumscribed value of the two text connected domains The overlapping area between the rectangles, and the adjacency eigenvalues between the two text connected domains are calculated.
  • an embodiment of the present invention also provides an image character recognition device, the device comprising:
  • the positioning unit includes the positioning device for the text area as described above;
  • the labeled sample includes the labeled image, the corresponding labeled feature vector, and text information
  • the text information of the marked image with the highest similarity is used as the text information of the target text area.
  • an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the positioning of the text area of the first aspect is realized method.
  • an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that can run on the processor, and when the computer program is executed by the processor At the time, the processor is made to implement the method for locating the text area of the first aspect.
  • the pixel value of each pixel in the target image is acquired when the text area is located on the target image.
  • the text pixels are determined from all the pixels of the target image, and a plurality of text connected domains are formed by the text pixels.
  • calculate the difference feature value between the two text connected domains according to the color value of each pixel in the text connected domain, calculate the difference feature value between the two text connected domains, and at the same time, according to the distance between the two text connected domains, calculate the two Adjacent eigenvalues between two text connected domains.
  • the multiple text connected domains are merged, and the target text area in the target image is determined according to the area of the merged text connected domain.
  • the difference feature value and the adjacent feature value between the text connected domains are calculated, and multiple text connected domains are merged according to these two conditions, so that the text connected domains with similar colors and similar distances are merged.
  • the text of the name in the image of the merchant's door can be combined by color and distance to form the target text area.
  • the area of the merged text connected domain corresponding to the merchant name is the largest, and the merged text connected domain can be filtered according to the area to determine the target text area.
  • the embodiment of the present invention can effectively distinguish the text area and the picture area in the door header image of the merchant, and effectively distinguish different text areas, thereby improving the accuracy of the target text area positioning and further ensuring the accuracy of subsequent merchant name recognition.
  • FIG. 1 is a schematic diagram of a system architecture of a method for locating a text area according to an embodiment of the present invention
  • FIG. 2 is a flowchart of a method for locating a text area according to an embodiment of the present invention
  • FIG. 3 is a schematic structural diagram of a CNN pixel classification model provided by an embodiment of the present invention.
  • FIG. 4 is a flowchart of another method for locating a text area provided by an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of a device for locating a text area according to an embodiment of the present invention.
  • FIG. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
  • exemplary means “serving as an example, embodiment, or illustration.” Any embodiment described as “exemplary” is not necessarily to be construed as preferred or advantageous over other embodiments.
  • first and second in the text are only used for the purpose of description, and should not be construed as expressing or implying relative importance or implying the number of indicated technical features. Therefore, the features defined as “first” and “second” may explicitly or implicitly include one or more of the features. In the description of the embodiments of the present invention, unless otherwise stated, the “multiple” The meaning is two or more. Furthermore, the term “comprising” and any variations thereof are intended to cover non-exclusive protections.
  • a process, method, system, product or device comprising a series of steps or units is not limited to the listed steps or units, but optionally also includes unlisted steps or units, or optionally also includes For other steps or units inherent to these processes, methods, products or devices.
  • CNN Convolutional Neural Networks, Convolutional Neural Networks
  • feedforward Neural Networks Feedforward Neural Networks
  • Convolutional neural network has the ability of representation learning and can perform shift-invariant classification of input information according to its hierarchical structure, so it is also called "shift-invariant artificial neural network.
  • DBN Deep Belief Network
  • a kind of neural network which includes a fully connected calculation and a feedforward neural network with a deep structure, which can be used for unsupervised learning, similar to an autoencoder; it can also be used for Supervised learning, used as a classifier.
  • unsupervised learning the purpose is to retain the characteristics of the original features as much as possible while reducing the dimension of the features.
  • supervised learning the purpose is to make the classification error rate as small as possible. Whether it is supervised learning or unsupervised learning, the essence of DBN is how to get better feature expression.
  • RNN (Recurrent neural network, recurrent neural network) contains a recurrent link structure and a feedforward neural network with a deep structure. It is a type of recurrent neural network that takes sequence data as input, performs recursion in the evolution direction of the sequence, and connects all nodes (recurrent units) in a chain. Recurrent neural network has memory, parameter sharing and Turing completeness, so it has certain advantages in learning nonlinear characteristics of sequences. Recurrent neural networks have applications in natural language processing (NLP), such as speech recognition, language modeling, machine translation, etc., and are also used in various time series forecasting. The recurrent neural network constructed by CNN is introduced to deal with computer vision problems involving sequence input.
  • NLP natural language processing
  • CRAFT (Character Region Awareness For Text detection, character region recognition in text detection) is a deep network structure for text positioning, and proposes methods of single-word segmentation and single-word segmentation, which is more in line with the core concept of target detection.
  • the text box is used as the target, so that the small receptive field can also predict large text and long text, only need to pay attention to the content at the character level instead of the entire text instance, and also proposes how to use the existing text detection data to collect data to obtain real data.
  • CTPN (Connectionist Text Proposal Network, text area proposal network based on link relationship) is a deep network structure for text localization. CTPN combines CNN and LSTM deep network, which can effectively detect the horizontally distributed text in complex scenes. At present, the effect of the text detection algorithm is relatively good.
  • PSEnet Progressive Scale Expansion Network
  • a deep network structure for text localization is a new instance segmentation network with two advantages.
  • PSEnet as a segmentation-based method, is able to localize text of arbitrary shapes;
  • the model proposes a progressive scale expansion algorithm that can successfully identify adjacent text instances.
  • VGG (Very Deep Convolutional Networks For Large-scale Image Recognition, a deep convolutional network for large-scale image recognition) a feedforward neural network that includes convolutional computation and has a deep structure.
  • VGG three 3 ⁇ 3 The convolution kernel replaces the 7 ⁇ 7 convolution kernel, and 2 3 ⁇ 3 convolution kernels are used to replace the 5 ⁇ 5 convolution kernel.
  • the main purpose of this is to improve the network while ensuring the same perception field.
  • the depth of the neural network improves the effect of the neural network to a certain extent.
  • Minimum circumscribed rectangle refers to the maximum range of several two-dimensional shapes (such as points, lines, and polygons) represented by two-dimensional coordinates, that is, the maximum abscissa, minimum abscissa, and maximum vertical among the vertices of a given two-dimensional shape. Coordinates, the minimum ordinate defines the bounding rectangle. Such a rectangle contains the given two-dimensional shape with sides parallel to the coordinate axes. The minimum bounding rectangle is the two-dimensional form of the minimum bounding box.
  • Pixel refers to the smallest unit in an image represented by a sequence of numbers, also known as a pixel.
  • a pixel is an indivisible unit or element in the entire image.
  • Each bitmap contains a certain number of pixels that determine the size of the image on the screen.
  • An image consists of many pixels.
  • RGB Red Green Blue, red green blue
  • RGB Red Green Blue, red green blue
  • RGB is the color representing the three channels of red, green and blue. This standard includes almost all colors that human vision can perceive, and is the most widely used.
  • All the colors on the computer screen are made up of the three colors of red, green and blue mixed in different proportions.
  • a set of red, green and blue is the smallest display unit.
  • the color of any pixel on the screen can be recorded and expressed by a set of RGB values.
  • the so-called "how much" of RGB refers to the brightness, and is represented by an integer.
  • Union check set It is a tree data structure used to manage the grouping of elements, and is used to deal with the merging and query problems of some disjoint sets (Disjoint Sets). Often used to represent forests.
  • the union search can efficiently perform the following operations: query whether element a and element b belong to the same group; merge the group where element a and element b are located.
  • the embodiments of the present invention provide a method and apparatus for locating a text area.
  • the method for locating a text region provided by the embodiment of the present invention can be applied to a locating scenario of a target text region, a text recognition scenario, and the like.
  • an application scenario of the method for locating a text area provided by the embodiment of the present invention includes a terminal device 101 , a server 102 , and a database 103 .
  • the terminal device 101 is an electronic device with a photographing or video recording function, various clients can be installed, and the running interface of the installed client can be displayed, and the electronic device can be mobile or fixed.
  • the client can be a video client or a browser client, etc.
  • Each terminal device 101 is connected to the server 102 through a communication network, and the communication network may be a wired network or a wireless network.
  • the server 102 may be a server corresponding to a client, a server or a server cluster or a cloud computing center composed of several servers, or a virtualization platform.
  • the database 103 exists independently of the server 102 .
  • the database 103 may also be located in the server 102 .
  • the server 102 is connected to the database 103, and the database 103 stores historical images, annotated samples, training text images, etc.
  • the server 102 receives the target image to be located sent by the terminal device 101, and determines the text according to the pixel value of each pixel in the target image. pixel points, and form multiple text connected domains, then calculate the difference eigenvalues and adjacent eigenvalues between any two text connected domains, merge multiple text connected domains according to the difference eigenvalues and adjacent eigenvalues, and combine The area of the text connected domain is determined, and the target text area in the target image is determined, so as to realize the positioning of the text area.
  • the server 102 also inputs the determined target text region into the trained feature extraction model, obtains the target feature vector, and compares the similarity between the target feature vector and the labeled feature vector of the labeled sample, and determines the label with the greatest similarity.
  • the text information of the marked image with the highest similarity is used as the text information of the target text area, so as to realize the text recognition of the target text area in the image.
  • the method for locating the text area provided by the present invention can be applied to the server 102, and the server executes the method for locating the text area provided by the embodiment of the present invention;
  • the implementation of the method for locating the text area provided by the present invention may also be completed by the server 102 in cooperation with the client in the terminal device 101 .
  • FIG. 2 shows a flowchart of a method for locating a text area provided by an embodiment of the present invention. As shown in Figure 2, the method includes the following steps:
  • Step S201 acquiring the pixel value of each pixel in the target image.
  • the target image may include but is not limited to image files in formats such as jpg, bmp, tif, gif, png, etc., and the target image may also be a screenshot.
  • the target image may be an image uploaded in real time by the terminal device, or the target image may be an image obtained from a network, or the target image may be an image stored locally.
  • the server determines the pixel value of each pixel in the target image.
  • the pixel value is the value assigned by the computer when the image is digitized, it represents the average brightness information of a pixel point, or the average reflection (transmission) density information of the pixel point.
  • the pixel value of the pixel point may be the color value of the RGB color model, the color value of the HSV (Hue-Saturation-Value, Hue-Saturation-Lightness) color model, or the color value of the pixel point. grayscale value.
  • Step S202 determine text pixels from all the pixels of the target image, and form a plurality of text connected domains from the text pixels.
  • the pixels in the target image can be divided into text pixels and non-text pixels, all pixels in the target image can be classified according to the pixel values of the pixels, and each pixel is determined to be a text pixel. Or non-text pixels.
  • the algorithm model can be used to classify the pixels, input the target image into the CNN network, perform feature extraction on the target image, and the output results correspond to the pixels one-to-one. For example, if the pixels are text pixels, then The pixel is marked as 1, and if the pixel is a non-text pixel, the pixel is marked as 0.
  • all text pixels are clustered together, adjacent text pixels can form a text connected domain, and all text pixels can form one or more text connected domains.
  • the text connected domain is the target text area, and no subsequent positioning process is required.
  • the target text area needs to be determined from the multiple text connected domains.
  • the algorithm model for classifying pixel points in the embodiment of the present invention may be a CNN network or other deep learning network models, which are only examples and are not limited here.
  • Step S203 for any two text connected domains, according to the color value of each pixel in the text connected domain, calculate the difference feature value between the two text connected domains, and according to the difference between the two text connected domains. distance, and calculate the adjacency feature value between the two text connected domains.
  • the pixel value of the pixel point may be the color value of the RGB color mode of the pixel point.
  • M i ⁇ R i ,G i ,B i ⁇ can be used to represent the color value of the ith pixel point, where R i is the red component value of the pixel, G i is the green component value of the pixel, and B i is the blue component value of the pixel.
  • the color value of the text connected domain can be calculated according to the color value of each pixel in the text connected domain, and the difference feature value between two text connected domains can be calculated from the color values of the two text connected domains.
  • the difference eigenvalue represents the degree of color difference between the two text connected domains. The greater the difference eigenvalue between the text connected domains, the greater the color difference between the two text connected domains and the difference between the text connected domains. The smaller the eigenvalue, the smaller the color difference between the two text connected domains.
  • the adjacency feature value is calculated according to the distance between the two text connected domains and represents the distance between the two text connected domains , the larger the overlapping area between the text connected domains, the closer the distance between the two text connected domains, the smaller the overlapping area between the text connected domains, the farther the distance between the two text connected domains.
  • Step S204 Combine the multiple text connected domains according to the difference feature value and the adjacent feature value.
  • two text connected domains with small color difference and small distance need to be merged. Therefore, for any two text connected domains, it is determined whether the two text connected domains are merged according to the difference eigenvalue and the adjacent eigenvalue between the two text connected domains. Furthermore, after merging multiple text connected domains, one or more merged text connected domains are obtained.
  • a merged text connected domain corresponds to a text area.
  • a business door header image includes a business name, a business address, a business trademark, etc.
  • the text area of the business name corresponds to a combined text connected domain.
  • the text area of the address corresponds to a merged text connected domain. Since the area of the merchant name in the image of the merchant's door is the largest, the merged text connected domain can be filtered according to the area of the merged text connected domain, and one or two merged texts left after filtering are connected. domain as the target text area.
  • Step S205 Determine the target text area in the target image according to the area of the merged text connected domain.
  • the pixel value of each pixel in the target image is acquired when the text area is located on the target image.
  • the text pixels are determined from all the pixels of the target image, and a plurality of text connected domains are formed by the text pixels.
  • calculate the difference feature value between the two text connected domains according to the color value of each pixel in the text connected domain, calculate the difference feature value between the two text connected domains, and at the same time, according to the distance between the two text connected domains, calculate the two Adjacent eigenvalues between two text connected domains.
  • the multiple text connected domains are merged, and the target text area in the target image is determined according to the area of the merged text connected domain.
  • the difference feature value and the adjacent feature value between the text connected domains are calculated, and multiple text connected domains are merged according to these two conditions, so that the text connected domains with similar colors and similar distances are merged.
  • the text of the name in the image of the merchant's door can be combined by color and distance to form the target text area.
  • the area of the merged text connected domain corresponding to the merchant name is the largest, and the merged text connected domain can be filtered according to the area to determine the target text area.
  • the embodiment of the present invention can effectively distinguish the text area and the picture area in the door header image of the merchant, and effectively distinguish different text areas, thereby improving the accuracy of the target text area positioning and further ensuring the accuracy of subsequent merchant name recognition.
  • step S202 determine the text pixel points from all the pixel points of the target image, including:
  • the target image is input into the trained pixel classification model, and the pixel feature extraction results of all pixels are obtained by alternating convolution operations and pooling operations in the pixel classification model;
  • the classification result of each pixel in the target image is determined, and the classification result of the pixel is that the pixel is a text pixel or a non-text pixel. Text pixels.
  • the pixel classification model may be a CNN network model, a DBN network model, an RNN network model, or the like.
  • the CNN network model in the embodiment of the present invention is taken as an example to introduce how to classify each pixel in the target image.
  • the embodiment of the present invention adopts a Unet-like CNN network structure to reconstruct the feature of the target image, that is, the pixel value of each pixel point in the target image is input into the trained CNN network model, and the feature extraction result is the same as the pixel point in the target image.
  • the feature extraction results in the embodiments of the present invention are classified into two categories, namely, text pixels or non-text pixels.
  • the text pixel can be set to 1
  • the non-text pixel can be set to 0, that is, if the classification result of a certain pixel is calculated by the CNN network model as a text pixel, then the classification result of the pixel Set to 1. If the classification result of the pixel is calculated as a non-text pixel through the CNN network model, the classification result of the pixel is set to 0.
  • the CNN network structure in this embodiment of the present application includes a 2n+1-level convolutional layer, an n-level pooling layer, and an n-level deconvolutional layer.
  • a first-level pooling layer is set after the convolutional layer, that is, the first n-level convolutional layers and n-level pooling layers are alternately set.
  • each level of convolution layer is used to perform at least one convolution process.
  • a feature map corresponding to the target image is obtained, wherein the number of channels of the feature map is equal to the number of channels of the target image, and the size of the feature map is equal to the target image. The size of the image.
  • the convolution layer is a layer used to extract features, which is divided into two parts: convolution operation and activation operation. Among them, during the convolution operation, the convolution kernel obtained by pre-training is used for feature extraction, and during the activation operation, the activation function is used to activate the feature map obtained by convolution.
  • the commonly used activation functions include linear rectification (Rectified Rectification). Linear Unit, ReLU) function, sigmoid (Sigmoid) function and hyperbolic tangent (Tanh) function, etc.
  • the pooling layer located after the convolutional layer, is used to reduce the feature vector output by the convolutional layer, that is, reduce the size of the feature map and improve the overfitting problem.
  • Commonly used pooling methods include mean-pooling, max-pooling, and stochastic-pooling.
  • Deconvolution layer a layer used to upsample the feature vector, that is, used to increase the size of the feature map.
  • the i-1th feature map is convolved and activated through the i-th level convolution layer, and the processed i-1th feature map is input to the i-th level pooling layer, 2 ⁇ i ⁇ n.
  • the first level convolutional layer its input is the target image; for the i-th level convolutional layer, its input is the feature map output by the i-1th level pooling layer.
  • the target image is subjected to a convolution operation through a preset convolution kernel, and then the activation operation is performed through a preset activation function;
  • the i-th convolutional layer obtains the i-1 After the i-1 th feature map output by the pooling layer, the i-1 th feature map is convolved through a preset convolution kernel, and then activated through a preset activation function, so as to extract features.
  • the number of channels of the feature map increases.
  • the first-level convolution layer performs two convolution processing on the target image; the second-level convolution layer performs two convolution processing on the first feature map output by the first pooling layer, and the third-level convolution layer
  • the convolution layer performs two convolution processing on the second feature map output by the second pooling layer, and the fourth-level convolution layer performs two convolution processing on the third feature map output by the third pooling layer.
  • the height of the multi-channel feature map is used to represent the size
  • the width is used to represent the number of channels.
  • the i-1th feature map after processing is pooled through the i-th level pooling layer to obtain the i-th feature map.
  • the processed i-1th feature map is input into the i-1th level pooling layer, and the i-1th level pooling layer performs pooling processing, thereby outputting the i-th level feature map.
  • the pooling layers at all levels are used to reduce the size of the feature map and retain important information in the feature map.
  • each pooling layer performs maximum pooling on the input feature map.
  • the first-level pooling layer processes the output feature map of the first-level convolutional layer to obtain the first feature map
  • the second-level pooling layer outputs features to the second-level convolutional layer.
  • the image is processed to obtain the second feature map
  • the third-level pooling layer processes the output feature map of the third-level convolution layer to obtain the third feature map.
  • the i-th feature map is fed into the i+1-th convolutional layer.
  • the i-th pooling layer inputs the i-th feature map into the next-level convolutional layer, and the next-level convolutional layer further performs feature extraction.
  • the target image goes through the first-level convolutional layer, the first-level pooling layer, the second-level convolutional layer and the second-level pooling layer, the third-level convolutional layer, and the third-level pooling layer.
  • the third-level pooling layer feeds the third feature map into the fourth-level convolutional layer.
  • the above-mentioned embodiment only takes performing three convolution and pooling operations as an example for description. In other possible implementations, the CNN network structure may perform multiple convolution and pooling operations, which is not limited in this embodiment.
  • the classification result map needs to be obtained through the deconvolution layer, and the classification result map needs to be obtained through the n+1 to 2n+1 convolutional layers and the n-level deconvolution layer. , perform convolution and deconvolution processing on the intermediate feature map to obtain the classification result map. Among them, the size of the classification result map is equal to the size of the target image.
  • the processing through the n+1th to 2n+1st convolutional layers and the nth deconvolutional layers includes the following steps:
  • deconvolution is performed on the feature map output by the j+nth convolution layer through the jth deconvolution layer, 1 ⁇ j ⁇ n.
  • deconvolution is performed on the feature map output by the fourth-level convolutional layer through the first-level deconvolution layer; the fifth-level convolutional layer is processed through the second-level deconvolution layer.
  • the output feature map is subjected to deconvolution processing; the feature map output by the sixth-stage convolution layer is deconvolved through the third-stage deconvolution layer.
  • the deconvolution process as the inverse process of the convolution process, is used to upsample the feature map, thereby reducing the size of the feature map.
  • the size of the feature map is reduced.
  • the feature map after deconvolution processing is spliced with the feature map output by the n-j+1 level convolution layer, and the spliced feature map is input into the j+n+1 level convolution layer, and the convolution layer is reversed.
  • the feature map after product processing is the same size as the feature map output by the n-j+1th convolutional layer.
  • the feature map output by the third-level convolution layer and the feature map output by the first-level deconvolution layer are spliced as the input of the fifth-level convolution layer;
  • the feature map output by the product layer and the feature map output by the second-level deconvolution layer are spliced as the input of the sixth-level convolution layer, and the feature map output by the first-level convolution layer and the third-level deconvolution layer are output.
  • the feature map concatenation of as the input of the seventh convolutional layer.
  • the convolution process is performed on the spliced feature map through the j+n+1th convolutional layer, and the final output is a classification result map that is consistent with the size of the target image.
  • the CNN network structure can be trained through the classification results of historical images, and then the classification results can be extracted according to the trained CNN network structure.
  • the text pixel points After classifying each pixel point, the text pixel points can be formed into a text connected domain according to the classification result.
  • multiple text connected domains are formed by text pixels, including:
  • the text pixels are connected to form multiple text connected domains.
  • the classification result of each pixel is obtained through the pixel classification model, and the adjacency relationship between each pixel and adjacent pixels can be obtained according to the classification result, wherein, except for the pixels on the four sides of the target image,
  • Each pixel in the target image has 8 adjacent pixels, namely up, down, left, right, upper right, lower right, upper left, and lower left 8 pixels.
  • the relationship between the text pixel and any adjacent pixel can be marked. For example, if the adjacent pixel is also a text pixel, it is marked as 1; if the adjacent pixel is For non-text pixels, marked as 0, each text pixel corresponds to 8 adjacencies.
  • the minimum circumscribed rectangle of each text connected domain is determined.
  • the embodiment of the present invention determines a minimum circumscribed rectangle for each text connected domain.
  • the minimum circumscribed rectangle is given a polygon (or a group of points) to find the rectangle with the smallest area and the circumscribed polygon.
  • a simple circumscribed rectangle is a circumscribed rectangle whose sides are parallel to the x-axis or the y-axis.
  • the simple circumscribed rectangle is probably not the smallest circumscribed rectangle, but it is a very easy to obtain circumscribed rectangle.
  • the subsequent steps can use the corresponding minimum circumscribed rectangle to replace the text connected domain for calculation.
  • the difference eigenvalue between the two minimum enclosing rectangles is calculated.
  • calculating the difference eigenvalue between two text connected domains is to calculate the difference eigenvalue of the minimum circumscribed rectangle corresponding to the two text connected domains, including:
  • the minimum circumscribed rectangle of each text connected domain obtain the color value of each pixel in the minimum circumscribed rectangle; calculate the mean value of the color values of all pixels as the color feature value of the minimum circumscribed rectangle; the color feature Values include red component value, green component value and blue component value;
  • the color difference component with the largest value is selected as the difference feature value between the two smallest circumscribed rectangles.
  • the color value of the pixel in the embodiment of the present invention may be the color value of the RGB color mode or the color value of the HSV color model.
  • the color value of the RGB color mode is used as an example for introduction.
  • the RGB value of each pixel in the minimum circumscribed rectangle is obtained, and the RGB value includes the red component, green component, and blue component of the pixel.
  • M i ⁇ R i , G i , B i ⁇ represent.
  • the color feature value of the minimum circumscribed rectangle includes the red feature value, green feature value, and blue feature value of the minimum circumscribed rectangle.
  • the red feature of the minimum circumscribed rectangle The value is equal to the mean value of the red components of all pixels in the minimum circumscribed rectangle
  • the green eigenvalue of the minimum circumscribed rectangle is equal to the mean value of the green components of all pixels in the minimum circumscribed rectangle
  • the blue eigenvalue of the minimum circumscribed rectangle is equal to the minimum circumscribed rectangle.
  • R c is the red eigenvalue of the smallest circumscribed rectangle
  • G c is the green eigenvalue of the smallest circumscribed rectangle
  • B c is the blue eigenvalue of the smallest circumscribed rectangle.
  • the color difference components of the two minimum circumscribed rectangles are calculated.
  • the color difference components may include luminance difference, hue difference value, and color density difference value. That is, according to the color eigenvalues of the two smallest circumscribed rectangles, the brightness difference, hue difference value and color density difference value of the two smallest circumscribed rectangles are calculated. Then, the color difference component with the largest value is selected as the difference feature value of the two minimum circumscribed rectangles.
  • the adjacency eigenvalues between two text-connected domains are calculated using the minimum circumscribed rectangle of the text-connected domains. According to the distance between the two text connected domains, the adjacent feature value between the two text connected domains is calculated, including:
  • the adjacency eigenvalues between the two minimum enclosing rectangles are calculated.
  • the adjacent feature value between the two minimum circumscribed rectangles is calculated, including:
  • the area of the minimum circumscribed rectangle can be represented by the number of pixels contained in the minimum circumscribed rectangle. For example, the smallest circumscribed rectangle a contains 100 pixels, then the area of the smallest circumscribed rectangle a is 100, and the smallest circumscribed rectangle b contains 80 pixels, then the area of the smallest circumscribed rectangle b is 80. If the minimum enclosing rectangle a and the smallest enclosing rectangle b contain 20 identical pixels, the overlapping area of the smallest enclosing rectangle a and the smallest enclosing rectangle b is marked as 20.
  • the adjacency eigenvalue between the two smallest circumscribed rectangles is equal to the ratio of the overlapping area between the smallest circumscribed rectangles to the sum of the area of the smallest circumscribed rectangle, that is, the adjacency eigenvalue is equal to the ratio of the sum of 20 and 100 plus 80, which is equal to 1/ 9.
  • the combination of the multiple text connected domains according to the difference eigenvalues and the adjacent eigenvalues includes:
  • the difference feature value is compared with the color threshold.
  • the color threshold can be set to 21. If the difference feature value is smaller than the color threshold, it is considered that the colors between the minimum circumscribed rectangles are similar and can be merged; If it is greater than or equal to the color threshold, it is considered that the color difference between the minimum circumscribed rectangles is relatively large and will not be merged.
  • the adjacent eigenvalues the adjacent eigenvalues are compared with the area threshold.
  • the adjacent eigenvalues are greater than the area threshold, it is considered that the distance between the minimum circumscribed rectangles is close and can be merged; if the adjacent eigenvalues are less than or equal to the area threshold, it is considered that The distance between the minimum bounding rectangles is far, and no merging is performed.
  • the difference feature value is smaller than the color threshold value, and the two smallest circumscribed rectangles whose adjacent feature value is greater than the area threshold value have an associated relationship, and can be merged.
  • the union search algorithm can be used to determine all the minimum circumscribed rectangles that need to be merged.
  • the target text area can be determined according to the area of the merged minimum circumscribed rectangle. Specifically, since the merchant name in the image of the merchant's door is generally the area with the largest area, the target image can be noise filtered according to the area, and the smallest circumscribed rectangle with the largest combined area is used as the target text area in the target image.
  • the text in the target text area can be recognized.
  • the text in the target text area can be recognized.
  • the text in the target text area can be recognized.
  • the text in the target text area can be recognized.
  • Step S206 Input the target text region into the trained feature extraction model, and obtain the target feature vector of the target text region.
  • the feature extraction model is trained by using training text images and corresponding text information.
  • the feature extraction model may be a deep learning network model, such as CTPN, PSEnet and other models.
  • the feature extraction model is a VGG network as an example.
  • the VGG network here is trained by using the marked image of the door of the merchant and the text information of the corresponding merchant name.
  • the target feature vector of the target text area is obtained through the VGG network, and the target feature vector can be a 1 ⁇ 1024 vector.
  • Step S207 comparing the similarity between the target feature vector and the labeling feature vector of the labeling sample, and determining the labeling text image with the greatest similarity.
  • the labeling sample includes the labeling text image, the corresponding labeling feature vector and text information.
  • annotation samples are stored in the database, and the annotation samples include annotated text images, annotated feature vectors and corresponding text information. Compare the similarity between the target feature vector obtained above and the labeled feature vector in the database, and select the labeled text image corresponding to the labeled feature vector with the largest similarity.
  • the similarity calculation here can be calculated using the cosine similarity formula.
  • the specific similarity can be calculated according to the following formula:
  • A is the target feature vector
  • B is the label feature vector
  • both are one-dimensional feature vectors.
  • Step S208 Use the text information of the marked image with the highest similarity as the text information of the target text area.
  • the labeled feature vector with the greatest similarity with the target feature vector is selected, and the text information of the labeled feature vector is used as the text information of the target feature vector, that is, the text information of the target text area.
  • the embodiment of the present invention reduces the image size of the input feature extraction model by pre-extracting the target text area during the text recognition process of the door header image of the merchant, which can reduce the influence of the shooting angle and noise on the image retrieval effect, and avoid complex The impact of background on the performance of text recognition, and improve the accuracy of text recognition.
  • the target image is received, and the pixel value of each pixel in the target image is determined.
  • the pixel value of each pixel is input into the pixel classification model, and the pixel classification model adopts the convolutional neural network of the class Unet.
  • the pixel feature extraction results of all pixels are obtained by alternating convolution and pooling operations in the pixel classification model.
  • the classification result of each pixel in the target image is determined, wherein the classification result of the pixel is that the pixel is a text pixel or a non-text pixel.
  • the adjacency relationship between the text pixel and the adjacent pixel is determined.
  • the adjacency relationship includes top, bottom, left, right, top right, bottom right, top left, bottom left. Connect the text pixels according to the adjacency relationship to form multiple text connected domains, and determine the minimum circumscribed rectangle of each text connected domain.
  • the difference eigenvalue between the two minimum enclosing rectangles is calculated. Specifically, the color value of each pixel in the minimum circumscribed rectangle is acquired, wherein the color feature value includes a red component value, a green component value and a blue component value. Calculate the mean of the color values of all pixel points as the color feature value of the minimum circumscribed rectangle. According to the color eigenvalues of the minimum circumscribed rectangles, multiple color difference components between the two minimum circumscribed rectangles are calculated, and the color difference component with the largest value is selected as the difference eigenvalue between the two minimum circumscribed rectangles.
  • the difference feature value is less than the color threshold, and there is an associated relationship between the two smallest circumscribed rectangles whose adjacent feature values are greater than the area threshold.
  • all the minimum circumscribed rectangles are merged according to the association relationship.
  • the text connected region with the largest combined area is taken as the target text area in the target image.
  • the similarity between the target feature vector and the labeled feature vector of the labeled sample is compared, and the labeled text image with the greatest similarity is determined.
  • the annotation samples include annotated text images, corresponding annotation feature vectors, and text information.
  • the text information of the marked image with the highest similarity is used as the text information of the target text area.
  • FIG. 5 shows a block diagram of the structure of an apparatus for locating a text area provided by an embodiment of the present invention.
  • the apparatus includes: an acquisition unit 501 , a communication unit 502 , a calculation unit 503 , a merging unit 504 , and a filtering unit 505 .
  • the obtaining unit 501 is used to obtain the pixel value of each pixel in the target image
  • Connectivity unit 502 for determining text pixels from all pixels of the target image according to pixel values, and forming a plurality of text connected domains by the text pixels;
  • the computing unit 503 is used for any two text connected domains, according to the color value of each pixel in the text connected domain, calculate the difference feature value between the two text connected domains, and according to the two text connected domains The distance between, calculate the adjacency eigenvalue between the two text connected domains;
  • a merging unit 504 configured to merge the plurality of text connected domains according to the difference feature value and the adjacent feature value
  • the filtering unit 505 is configured to determine the target text area in the target image according to the area of the merged text connected domain.
  • the communication unit 502 is specifically used for:
  • the target image is input into the trained pixel classification model, and the pixel feature extraction results of all pixels are obtained by alternating convolution operations and pooling operations in the pixel classification model;
  • the classification result of each pixel in the target image is determined, and the classification result of the pixel is that the pixel is a text pixel or a non-text pixel. Text pixels.
  • the communication unit 502 is specifically used for:
  • the text pixels are connected to form multiple text connected domains.
  • the computing unit 503 is specifically configured to:
  • any text connected domain obtain the color value of each pixel in the text connected domain; calculate the mean value of the color values of all pixels as the color feature value of the text connected domain; the color feature value includes the red component value, green component value and blue component value;
  • the color difference component with the largest value is selected as the difference feature value between the two connected domains.
  • the computing unit 503 is specifically configured to:
  • the merging unit 504 is specifically configured to:
  • the union search algorithm is used to merge all text connected domains.
  • the connecting unit 502 is further configured to determine the minimum circumscribed rectangle of each text connected domain
  • the computing unit is further configured to calculate the difference feature value between the two text connected domains according to the color value of each pixel in the minimum circumscribed rectangle corresponding to each text connected domain; according to the minimum circumscribed value of the two text connected domains The overlapping area between the rectangles, and the adjacency eigenvalues between the two text connected domains are calculated.
  • the embodiments of the present invention further provide an electronic device.
  • the electronic device may be a server, such as server 102 shown in FIG. 1 , which includes at least a memory for storing data and a processor for data processing.
  • a processor used for data processing, when performing processing, a microprocessor, a CPU, a GPU (Graphics Processing Unit, graphics processing unit), a DSP or an FPGA can be used for implementation.
  • a microprocessor a CPU, a GPU (Graphics Processing Unit, graphics processing unit), a DSP or an FPGA can be used for implementation.
  • an operation instruction is stored in the memory, and the operation instruction may be a computer-executable code, and each step in the flow of the video screening method according to the above embodiment of the present invention is implemented by the operation instruction.
  • FIG. 6 is a schematic structural diagram of an electronic device provided by an embodiment of the present invention.
  • the electronic device 60 includes: a processor 61 , a display 62 , a memory 63 , an input device 66 , and a bus 65 and communication device 64; the processor 61, memory 63, input device 66, display 62 and communication device 64 are all connected through a bus 65 for the processor 61, memory 63, display 62, communication device 64 and input Data is transferred between devices 66 .
  • the memory 63 can be used to store software programs and modules, such as program instructions/modules corresponding to the method for locating the text area in the embodiment of the present invention, and the processor 61 executes the electronic program by running the software programs and modules stored in the memory 63.
  • Various functional applications and data processing of the device 60 such as the positioning method of the text area provided by the embodiment of the present invention.
  • the memory 63 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program of at least one application, etc.; the storage data area may store data (such as animation clips) created according to the use of the electronic device 60 , control strategy network) and so on.
  • memory 63 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the processor 61 is the control center of the electronic device 60, using the bus 65 and various interfaces and lines to connect various parts of the entire electronic device 60, by running or executing the software programs and/or modules stored in the memory 63, and calling the stored in the memory 63.
  • the data in the memory 63 executes various functions of the electronic device 60 and processes data.
  • the processor 61 may include one or more processing units, such as a CPU, a GPU (Graphics Processing Unit, graphics processing unit), a digital processing unit, and the like.
  • the processor 61 displays the determined target text area and text information to the user through the display 62 .
  • the processor 61 can also be connected to the network through the communication device 64. If the electronic device is a server, the processor 61 can transmit data between the communication device 64 and the terminal device.
  • the input device 66 is mainly used to obtain the user's input operation.
  • the input device 66 may also be different.
  • the input device 66 can be an input device such as a mouse and a keyboard; when the electronic device is a portable device such as a smartphone or a tablet computer, the input device 66 can be a touch screen.
  • An embodiment of the present invention further provides a computer storage medium, where computer-executable instructions are stored in the computer storage medium, and the computer-executable instructions are used to implement the method for locating a text area according to any embodiment of the present invention.
  • various aspects of the method for locating a text area provided by the present invention can also be implemented in the form of a program product, which includes program code, and when the program product runs on a computer device, the program code is used for Make the computer device execute the steps of the method for locating the text area according to various exemplary embodiments of the present invention described above in this specification.
  • the computer device may execute the process of locating the text area in steps S201 to S208 as shown in FIG. 2 . .
  • the program product may employ any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination of the foregoing.
  • a readable signal medium may include a propagated data signal in baseband or as part of a carrier wave, carrying readable program code therein. Such propagated data signals may take a variety of forms including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a readable signal medium can also be any readable medium, other than a readable storage medium, that can transmit, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device.
  • the disclosed apparatus and method may be implemented in other manners.
  • the device embodiments described above are only illustrative.
  • the division of units is only a logical function division.
  • multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the coupling, or direct coupling, or communication connection between the various components shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be electrical, mechanical or other forms. of.
  • the unit described above as a separate component may or may not be physically separated, and the component displayed as a unit may or may not be a physical unit, that is, it may be located in one place or distributed to multiple network units; Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present invention may all be integrated into one processing unit, or each unit may be separately used as a unit, or two or more units may be integrated into one unit; the above-mentioned integration
  • the unit can be implemented either in the form of hardware or in the form of hardware plus software functional units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)
  • Document Processing Apparatus (AREA)

Abstract

一种文本区域的定位方法及装置,属于计算机技术领域,涉及人工智能和计算机视觉技术,用于提高对商户门头图片中文字区域定位的精确性。其中,文本区域的定位方法包括:获取目标图像中各个像素点的像素值(201);根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域(202);针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值(203);根据差异特征值和邻接特征值,将所述多个文本连通域进行合并(204);根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域(205)。

Description

一种文本区域的定位方法及装置
相关申请的交叉引用
本申请要求在2020年08月14日提交中国专利局、申请号为202010817763.0、申请名称为“一种文本区域的定位方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及计算机技术领域,尤其涉及一种文本区域的定位方法及装置。
背景技术
门头,是指企业、事业单位和个体工商户在门口设置的牌匾及相关设施,是一个商铺店门外的装饰形式,是美化销售场所和装饰店铺、吸引顾客的一种手段。
商户的门头中一般包含有商户名称、商户地址等文字内容,在审核商户真实性时,需要巡检人员前往商铺的地址进行拍照,然后再由审核人员进行信息核对,效率低且易出错。目前,为了实现商户门头图片中自动识别文字,需要在街拍的商户门头图片中定位商户名称的文字位置。
现有的图像文字识别一般是对图像中的全部文字进行识别,不能对商户门头图片中的商户名称文字区域和其他文字区域进行有效区分,影响后续商户名称识别的准确性。
发明内容
本发明实施例提供了一种文本区域的定位方法及装置,用于提高对商户门头图片中文字区域定位的精确性。
一方面,本发明实施例提供了一种文本区域的定位方法,包括:
获取目标图像中各个像素点的像素值;
根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域;
针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值;
根据差异特征值和邻接特征值,将所述多个文本连通域进行合并;
根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域。
可选的,所述根据像素值,从所述目标图像的所有像素点中确定文本像素点,包括:
将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
可选的,所述由文本像素点形成多个文本连通域,包括:
针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
根据邻接关系,连通文本像素点,形成多个文本连通域。
可选的,所述由文本像素点形成多个文本连通域之后,还包括:
确定每个文本连通域的最小外接矩形;
所述根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,包括:
根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值;
所述根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值,包括:
根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最 小外接矩形之间的邻接特征值。
可选的,所述根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值,包括:
针对每一个文本连通域的最小外接矩形,获取所述最小外接矩形中各个像素点的颜色值;计算所有像素点的颜色值的均值,作为所述最小外接矩形的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
根据最小外接矩形的颜色特征值,计算所述两个最小外接矩形之间的多个颜色差异分量;
选取值最大的颜色差异分量作为所述两个最小外接矩形之间的差异特征值。
可选的,所述根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最小外接矩形之间的邻接特征值,包括:
将两个最小外接矩形之间的重叠面积与所述两个最小外接矩形的面积之和相比,得到所述两个最小外接矩形之间的邻接特征值。
可选的,所述根据差异特征值和邻接特征值,将所述多个文本连通域进行合并,包括:
确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个最小外接矩形存在关联关系;
利用并查集算法,根据关联关系对所有最小外接矩形进行合并。
另一方面,本发明实施例还提供一种图像文字识别方法,所述方法包括:
确定目标图像中的目标文本区域,其中,所述目标图像中的目标文本区域是通过如上述文本区域的定位方法得到的;
将所述目标文本区域输入已训练的特征提取模型中,得到所述目标文本区域的目标特征向量,所述特征提取模型利用训练文本图像以及对应的文字信息进行训练;
将所述目标特征向量与标注样本的标注特征向量进行相似度对比,确定相似度最大的标注文本图像,所述标注样本包括标注文本图像、对应的标注 特征向量以及文字信息;
将所述相似度最大的标注图像的文字信息作为所述目标文本区域的文字信息。
另一方面,本发明实施例还提供一种文本区域的定位装置,所述装置包括:
获取单元,用于获取目标图像中各个像素点的像素值;
连通单元,用于根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域;
计算单元,用于针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值;
合并单元,用于根据差异特征值和邻接特征值,将所述多个文本连通域进行合并;
过滤单元,用于根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域。
可选的,所述连通单元,具体用于:
将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
可选的,所述连通单元,具体用于:
针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
根据邻接关系,连通文本像素点,形成多个文本连通域。
可选的,所述计算单元,具体用于:
针对任一文本连通域,获取所述文本连通域中各个像素点的颜色值;计 算所有像素点的颜色值的均值,作为所述文本连通域的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
根据文本连通域的颜色特征值,计算所述两个文本连通域之间的多个颜色差异分量;
选取值最大的颜色差异分量作为所述两个连通域之间的差异特征值。
可选的,所述计算单元,具体用于:
将所述两个文本连通域之间的距离与所述两个文本连通域的面积之和相比,得到所述两个文本连通域之间的邻接特征值;
可选的,所述合并单元,具体用于:
确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个文本连通域存在关联关系;
根据关联关系,利用并查集算法对所有文本连通域进行合并。
可选的,所述连通单元,还用于确定每个文本连通域的最小外接矩形;
所述计算单元,还用于根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算所述两个文本连通域之间的差异特征值;根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个文本连通域之间的邻接特征值。
另一方面,本发明实施例还提供一种图像文字识别装置,所述装置包括:
定位单元,所述定位单元包括如上述的文本区域的定位装置;
将所述目标文本区域输入特征提取模型中,得到所述目标文本区域的目标特征向量;
将所述目标特征向量与标注样本的标注特征向量相对比,确定相似度最大的标注图像,所述标注样本包括标注图像、对应的标注特征向量以及文字信息;
将所述相似度最大的标注图像的文字信息作为所述目标文本区域的文字信息。
另一方面,本发明实施例还提供一种计算机可读存储介质,所述计算机 可读存储介质内存储有计算机程序,所述计算机程序被处理器执行时,实现第一方面的文本区域的定位方法。
另一方面,本发明实施例还提供一种电子设备,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的计算机程序,当所述计算机程序被所述处理器执行时,使得所述处理器实现第一方面的文本区域的定位方法。
本发明实施例在对目标图像进行文本区域定位时,获取目标图像中各个像素点的像素值。根据像素值,从目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域。针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算这两个文本连通域之间的差异特征值,同时,根据两个文本连通域之间的距离,计算这两个文本连通域之间的邻接特征值。之后,根据差异特征值和邻接特征值,将多个文本连通域进行合并,并根据合并后的文本连通域的面积,确定目标图像中的目标文本区域。本发明实施例中,计算文本连通域之间的差异特征值和邻接特征值,根据这两个条件将多个文本连通域进行合并,从而将颜色相近且距离相近的文本连通域合并,这样,通过颜色和距离可将商户门头图片中名称的文字进行合并,形成目标文本区域。且由于商户门头图片中商户名称所占面积最大,因此商户名称对应的合并后的文本连通域的面积最大,可以根据面积对合并后的文本连通域进行筛选,从而确定出目标文本区域。本发明实施例可以对商户门头图片中文字区域与图片区域进行有效区分,且对不同文字区域进行有效区分,从而提高了目标文本区域定位的准确性,进一步保证后续商户名称识别的准确性。
附图说明
为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简要介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域的普通技术人员来讲,在不付出创造性劳动性 的前提下,还可以根据这些附图获得其他的附图。
图1为本发明实施例提供的一种文本区域的定位方法的系统架构示意图;
图2为本发明实施例提供的一种文本区域的定位方法的流程图;
图3为本发明实施例提供的一种CNN像素分类模型的结构示意图;
图4为本发明实施例提供的另一种文本区域的定位方法的流程图;
图5为本发明实施例提供的一种文本区域的定位装置的结构示意图;
图6为本发明实施例提供的一种电子设备的结构示意图。
具体实施方式
为了使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明作进一步地详细描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其它实施例,都属于本发明保护的范围。
下文中所用的词语“示例性”的意思为“用作例子、实施例或说明性”。作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。
文中的术语“第一”、“第二”仅用于描述目的,而不能理解为明示或暗示相对重要性或者隐含指明所指示的技术特征的数量。由此,限定有“第一”、“第二”的特征可以明示或者隐含地包括一个或者更多个该特征,在本发明实施例的描述中,除非另有说明,“多个”的含义是两个或两个以上。此外,术语“包括”以及它们任何变形,意图在于覆盖不排他的保护。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
以下对本发明实施例中的部分用语进行解释说明,以便于本领域技术人员理解。
CNN:(Convolutional Neural Networks,卷积神经网络)是一类包含卷积 计算且具有深度结构的前馈神经网络(Feedforward Neural Networks),是深度学习(deep learning)的代表算法之一。卷积神经网络具有表征学习(representation learning)能力,能够按其阶层结构对输入信息进行平移不变分类(shift-invariant classification),因此也被称为“平移不变人工神经网络。
DBN:(Deep belief network,深度置信网络)神经网络的一种,包含全连接计算且具有深度结构的前馈神经网络,既可以用于非监督学习,类似于一个自编码机;也可以用于监督学习,作为分类器来使用。从非监督学习来讲,其目的是尽可能地保留原始特征的特点,同时降低特征的维度。从监督学习来讲,其目的在于使得分类错误率尽可能地小。而不论是监督学习还是非监督学习,DBN的本质都是如何得到更好的特征表达。
RNN:(Recurrent neural network,递归神经网络)包含循环链接结构且具有深度结构的前馈神经网络。是一类以序列(sequence)数据为输入,在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络(recursive neural network)。递归神经网络具有记忆性、参数共享并且图灵完备(Turing completeness),因此在对序列的非线性特征进行学习时具有一定优势。递归神经网络在自然语言处理(Natural Language Processing,NLP),例如语音识别、语言建模、机器翻译等领域有应用,也被用于各类时间序列预报。引入了CNN构筑的递归神经网络可以处理包含序列输入的计算机视觉问题。
CRAFT:(Character Region Awareness For Text detection,文本检测中的字符区域识别)一种用于文本定位的深度网络结构,提出单字分割以及单字间分割的方法,更符合目标检测这一核心概念,不是把文本框当做目标,这样使用小感受野也能预测大文本和长文本,只需要关注字符级别的内容而不需要关注整个文本实例,还提出如何利用现有文本检测数据集合成数据得到真实数据的单字标注的弱监督方法。
CTPN:(Connectionist Text Proposal Network,基于链接关系的文本区域建议网络)一种用于文本定位的深度网络结构,CTPN结合CNN与LSTM深 度网络,能有效的检测出复杂场景的横向分布的文字,是目前效果比较好的文字检测算法。
PSEnet:(Progressive Scale Expansion Network,渐进式规模扩张网络),一种用于文本定位的深度网络结构,是一种新的实例分割网络,有两方面的优势。首先,PSEnet作为一种基于分割的方法,能够对任意形状的文本进行定位;其次,该模型提出了一种渐进的尺度扩展算法,该算法可以成功地识别相邻文本实例。
VGG:(Very Deep Convolutional Networks For Large-scale Image Recognition,面向大规模图像识别的深度卷积网络)包含卷积计算且具有深度结构的前馈神经网络,在VGG中,使用了3个3×3卷积核来代替7×7卷积核,使用了2个3×3卷积核来代替5×5卷积核,这样做的主要目的是在保证具有相同感知野的条件下,提升了网络的深度,在一定程度上提升了神经网络的效果。
最小外接矩形:是指以二维坐标表示的若干二维形状(例如点、直线、多边形)的最大范围,即以给定的二维形状各顶点中的最大横坐标、最小横坐标、最大纵坐标、最小纵坐标定下边界的矩形。这样的一个矩形包含给定的二维形状,且边与坐标轴平行。最小外接矩形是最小外接框(minimum bounding box)的二维形式。
像素点:是指在由一个数字序列表示的图像中的一个最小单位,也称为像素。像素是整个图像中不可分割的单位或者是元素。每一个点阵图像包含了一定量的像素,这些像素决定图像在屏幕上所呈现的大小。一张图片由好多的像素点组成。例如图片尺寸是500×338的,表示图片是由一个500×338的像素点矩阵构成的,这张图片的宽度是500个像素点的长度,高度是338个像素点的长度,共有500×338=149000个像素点。把鼠标放在一个图片上,这个时候会显示尺寸和大小,这里的尺寸就是像素。
颜色值:即RGB(Red Green Blue,红绿蓝)色彩模式,是工业界的一种颜色标准,是通过对红(R)、绿(G)、蓝(B)三个颜色通道的变化以及它们相互 之间的叠加来得到各式各样的颜色的,RGB即是代表红、绿、蓝三个通道的颜色,这个标准几乎包括了人类视力所能感知的所有颜色,是运用最广的颜色系统之一。电脑屏幕上的所有颜色,都由这红色绿色蓝色三种色光按照不同的比例混合而成的。一组红色绿色蓝色就是一个最小的显示单位。屏幕上的任何一个像素点的颜色都可以由一组RGB值来记录和表达。在电脑中,RGB的所谓“多少”就是指亮度,并使用整数来表示。通常情况下,RGB各有256级亮度,用数字表示为从0、1、2...直到255。按照计算,256级的RGB色彩总共能组合出约1678万种色彩,即256×256×256=16777216。
并查集:是一种用来管理元素分组情况的树型的数据结构,用于处理一些不相交集合(Disjoint Sets)的合并及查询问题。常常在使用中以森林来表示。并查集可以高效地进行如下操作:查询元素a和元素b是否属于同一组;合并元素a和元素b所在的组。
为了解决相关技术中的技术问题,本发明实施例提供了一种文本区域的定位方法及装置。本发明实施例提供的文本区域的定位方法可以应用于目标文本区域的定位场景、文本识别场景等。
下面对本申请实施例的技术方案能够适用的应用场景做一些简单介绍,需要说明的是,以下介绍的应用场景仅用于说明本申请实施例而非限定。在具体实施时,可以根据实际需要灵活地应用本申请实施例提供的技术方案。
为进一步说明本申请实施例提供的技术方案,下面结合附图以及具体实施方式对此进行详细的说明。虽然本申请实施例提供了如下述实施例或附图所示的方法操作步骤,但基于常规或者无需创造性的劳动在所述方法中可以包括更多或者更少的操作步骤。在逻辑上不存在必要因果关系的步骤中,这些步骤的执行顺序不限于本申请实施例提供的执行顺序。
本发明实施例提供的文本区域的定位方法的一种应用场景可以参见图1所示,该应用场景中包括终端设备101、服务器102和数据库103。
其中,终端设备101为具有拍照或摄像功能,可以安装各类客户端,并且能够将已安装的客户端的运行界面进行显示的电子设备,该电子设备可以 是移动的,也可以是固定的。例如,手机、平板电脑、笔记本电脑、台式电脑、各类可穿戴设备、智能电视、车载设备或其它能够实现上述功能的电子设备等。客户端可以是视频客户端或浏览器客户端等。各终端设备101通过通信网络与服务器102连接,该通信网络可以是有线网络或无线网络。服务器102可以是客户端对应的服务器,可以是一台服务器或由若干台服务器组成的服务器集群或云计算中心,或者是一个虚拟化平台。
其中,图1是以数据库103独立于所述服务器102存在进行说明的,在其他可能的实现方式中,数据库103也可以位于服务器102中。
服务器102与数据库103连接,数据库103中存储有历史图像、标注样本、训练文本图像等,服务器102接收终端设备101发送的待定位的目标图像,根据目标图像中各个像素点的像素值,确定文本像素点,并形成多个文本连通域,再计算任意两个文本连通域之间的差异特征值和邻接特征值,根据差异特征值和邻接特征值将多个文本连通域合并,并根据合并后的文本连通域的面积,确定目标图像中的目标文本区域,从而实现文本区域的定位。进一步地,服务器102还将确定出的目标文本区域输入已训练的特征提取模型中,得到目标特征向量,并将目标特征向量与标注样本的标注特征向量进行相似度对比,确定相似度最大的标注文本图像,将相似度最大的标注图像的文字信息作为目标文本区域的文字信息,从而实现图像中目标文本区域的文字识别。
需要说明的是,本发明提供的文本区域的定位方法可以应用于服务器102,由服务器执行本发明实施例提供的文本区域的定位方法;也可以应用于终端设备的客户端中,由终端设备101实施本发明提供的文本区域的定位方法,还可以由服务器102与终端设备101中的客户端配合完成。
图2示出了本发明一个实施例提供的文本区域的定位方法的流程图。如图2所示,该方法包括如下步骤:
步骤S201,获取目标图像中各个像素点的像素值。
其中,目标图像可以包括但不限于jpg、bmp、tif、gif、png等格式的图 像文件,目标图像也可以是截图。目标图像可以是终端设备实时拍摄后上传的图像,或者目标图像可以是从网络中获取的图像,或者,目标图像可以是本地存储的图像。
服务器获取目标图像后,确定目标图像中各个像素点的像素值。像素值是图像被数字化时由计算机赋予的值,它代表了一个像素点的平均亮度信息,或者说是该像素点的平均反射(透射)密度信息。本发明实施例中,像素点的像素值可以是RGB色彩模式的颜色值,也可以是HSV(Hue-Saturation-Value,色调-饱和度-明度)颜色模型的颜色值,还可以是像素点的灰度值。
本领域技术人员应能理解,上述几种场景和图像来源仅为举例,基于这些范例进行的适当变化也可适用于本发明,本发明实施例并不对目标图像的来源和场景进行限定。
步骤S202、根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域。
具体实施过程中,目标图像中的像素点可以分为文本像素点和非文本像素点,根据像素点的像素值可以将目标图像中的所有像素点进行分类,确定每一个像素点是文本像素点还是非文本像素点。具体地,可以利用算法模型对像素点进行分类,将目标图像输入CNN网络中,对目标图像进行特征提取,输出的结果与像素点一一对应,例如,若像素点为文本像素点,则对该像素点标记为1,若像素点为非文本像素点,则对该像素点标记为0。
然后,根据像素点的分类,将所有文本像素点聚集在一起,相邻的文本像素点可以形成一个文本连通域,所有文本像素点可以形成一个或多个文本连通域。对于所有文本像素点形成一个文本连通域的情况,该文本连通域即为目标文本区域,无需后续的定位过程。对于所有文本像素点形成多个文本连通域的情况,需要从这多个文本连通域中确定出目标文本区域。
本发明实施例中对像素点进行分类的算法模型,可以是CNN网络,也可以是其它深度学习网络模型,这里仅为举例,不做限制。
步骤S203、针对任意两个文本连通域,根据文本连通域中各个像素点的 颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值。
其中,像素点的像素值可以是该像素点的RGB色彩模式的颜色值,具体可以用M i={R i,G i,B i}表示第i个像素点的颜色值,其中,R i为该像素点的红色分量值,G i为像素点的绿色分量值,B i为像素点的蓝色分量值。
根据文本连通域中各个像素点的颜色值可以计算出该文本连通域的颜色值,两个文本连通域之间的差异特征值可以根据两个文本连通域的颜色值计算得出。差异特征值表征了两个文本连通域之间颜色的差异程度,文本连通域之间的差异特征值越大,表明两个文本连通域之间的颜色差异越大,文本连通域之间的差异特征值越小,表明两个文本连通域之间的颜色差异越小。
另一方面,还需要计算两个文本连通域之间的邻接特征值,这里的邻接特征值为根据两个文本连通域之间的距离计算得出,表征了两个文本连通域之间的距离,文本连通域之间的重叠面积越大,表明两个文本连通域之间的距离越近,文本连通域之间的重叠面积越小,表明两个文本连通域之间的距离越远。
步骤S204、根据差异特征值和邻接特征值,将所述多个文本连通域进行合并。
具体实施过程中,需要将颜色差异较小、相距较小的两个文本连通域进行合并。因此,针对任意两个文本连通域,根据两个文本连通域之间的差异特征值和邻接特征值,确定两个文本连通域是否合并。进而,多个文本连通域之间进行合并后,得到一个或多个合并后的文本连通域。
一般来说,一个合并后的文本连通域对应一个文本区域,例如商户门头图片中包括商户名称、商户地址、商户商标等,其中,商户名称的文本区域对应一个合并后的文本连通域,商户地址的文本区域对应一个合并后的文本连通域。由于商户门头图片中商户名称的面积最大,因此,可以根据合并后的文本连通域的面积,对合并后的文本连通域进行过滤,将过滤后留下的一个或两个合并后的文本连通域作为目标文本区域。
步骤S205、根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域。
本发明实施例在对目标图像进行文本区域定位时,获取目标图像中各个像素点的像素值。根据像素值,从目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域。针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算这两个文本连通域之间的差异特征值,同时,根据两个文本连通域之间的距离,计算这两个文本连通域之间的邻接特征值。之后,根据差异特征值和邻接特征值,将多个文本连通域进行合并,并根据合并后的文本连通域的面积,确定目标图像中的目标文本区域。本发明实施例中,计算文本连通域之间的差异特征值和邻接特征值,根据这两个条件将多个文本连通域进行合并,从而将颜色相近且距离相近的文本连通域合并,这样,通过颜色和距离可将商户门头图片中名称的文字进行合并,形成目标文本区域。且由于商户门头图片中商户名称所占面积最大,因此商户名称对应的合并后的文本连通域的面积最大,可以根据面积对合并后的文本连通域进行筛选,从而确定出目标文本区域。本发明实施例可以对商户门头图片中文字区域与图片区域进行有效区分,且对不同文字区域进行有效区分,从而提高了目标文本区域定位的准确性,进一步保证后续商户名称识别的准确性。
进一步地,上述步骤S202、根据像素值,从所述目标图像的所有像素点中确定文本像素点,包括:
将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
具体实施过程中,像素分类模型可以为CNN网络模型,也可以为DBN网络模型,或者RNN网络模型等。本发明实施例CNN网络模型为例,介绍 如何目标图像中各个像素点的分类过程。
本发明实施例采用类Unet的CNN网络结构,对目标图像进行特征重构,即将目标图像中每一个像素点的像素值输入已训练的CNN网络模型中,特征提取结果与目标图像中的像素点一一对应。本发明实施例中的特征提取结果分为两类,即文本像素点或非文本像素点。具体实施过程中,可以将文本像素点设置为1,非文本像素点设置为0,即若经过CNN网络模型计算得出某像素点的分类结果为文本像素点,则将该像素点的分类结果设置为1,若经过CNN网络模型计算出该像素点的分类结果为非文本像素点,则将该像素点的分类结果设置为0。
可选的,本申请实施例中的CNN网络结构包括2n+1级卷积层、n级池化层和n级反卷积层,其中,第1至第n级卷积层中,每级卷积层之后设置有一级池化层,即前n级卷积层与n级池化层交替设置。可选的,每级卷积层用于进行至少一次卷积处理。相应的,目标图像经过n级卷积层和n即池化层处理后,即得到目标图像对应的特征图,其中,特征图的通道数等于目标图像的通道数,且特征图的尺寸等于目标图像的尺寸。
下面以CNN像素分类模型为7级卷积层、3级池化层和3级反卷积层构成的U型的网络结构为例进行说明。卷积层用于提取特征的层,分为卷积操作和激活操作两部分。其中,进行卷积操作时,使用预先经过训练学习得到的卷积核进行特征提取,进行激活操作时,使用激活函数对卷积得到的特征图进行激活处理,常用的激活函数包括线性整流(Rectified Linear Unit,ReLU)函数、S型(Sigmoid)函数和双曲正切(Tanh)函数等。
池化(pooling)层,位于卷积层之后,用于降低卷积层输出的特征向量,即缩小特征图的尺寸,同时改善过拟合问题。常用的池化方式包括平均池化(mean-pooling)、最大池化(max-pooling)和随机池化(stochastic-pooling)等。
反卷积层(deconvolution),用于对特征向量进行上采样的层,即用于增大特征图的尺寸。
如图3所示,首先通过第i级卷积层对第i-1特征图进行卷积以及激活处 理,并将处理后的第i-1特征图输入第i级池化层,2≤i≤n。对于第一级卷积层,其输入为目标图像;而对于第i级卷积层,其输入则为第i-1级池化层输出的特征图。可选的,第一级卷积层获取到目标图像后,通过预设卷积核对目标图像进行卷积操作,再通过预设激活函数进行激活操作;第i级卷积层获取第i-1池化层输出的第i-1特征图后,通过预设卷积核对第i-1特征图进行卷积操作,再通过预设激活函数进行激活操作,从而起到提取特征的作用,其中,进行卷积处理后,特征图的通道数增加。如图3所示,第一级卷积层对目标图像进行两次卷积处理;第二级卷积层对第一池化层输出的第一特征图进行两次卷积处理,第三级卷积层对第二池化层输出的第二特征图进行两次卷积处理,第四级卷积层对第三池化层输出的第三特征图进行两次卷积处理。其中,多通道特征图的高度用于表示尺寸,而宽度则用于表示通道数。
其次,通过第i级池化层对处理后的第i-1特征图进行池化处理,得到第i特征图。第i级卷积层完成卷积处理后,将处理后的第i-1特征图输入第i-1级池化层,由第i-1级池化层进行池化处理,从而输出第i特征图。其中,各级池化层用于缩小特征图的尺寸,并保留特征图中的重要信息。可选的,各级池化层对输入的特征图进行最大池化处理。示意性的,如图3所示,第一级池化层对第一级卷积层输出特征图进行处理,得到第一特征图,第二级池化层对第二级卷积层输出特征图进行处理,得到第二特征图,第三级池化层对第三级卷积层输出特征图进行处理,得到第三特征图。
最后,将第i特征图输入第i+1级卷积层。完成池化处理后,第i级池化层将第i特征图输入下一级卷积层,由下一级卷积层进一步进行特征提取。如图3所示,目标图像依次经过第一级卷积层、第一级池化层、第二级卷积层和第二级池化层、第三卷积层以及第三池化层后,由第三级池化层将第三特征图输入第四级卷积层。上述实施例仅以进行三次卷积、池化操作为例进行说明,在其他可能的实施方式中,CNN网络结构可以进行多次卷积、池化操作,本实施例并不对此构成限定。
在进行了交替的卷积层和池化层的处理操作后,还需要通过反卷积层得 到分类结果图,通过第n+1至第2n+1级卷积层和n级反卷积层,对中间特征图进行卷积以及反卷积处理,得到分类结果图。其中,分类结果图的尺寸等于目标图像的尺寸。
在一种可能的实施方式中,通过第n+1至第2n+1级卷积层和n级反卷积层进行处理时包括如下步骤:
首先,通过第j级反卷积层对第j+n级卷积层输出的特征图进行反卷积处理,1≤j≤n。示意性的,如图3所示,通过第一级反卷积层对第四级卷积层输出的特征图进行反卷积处理;通过第二级反卷积层对第五级卷积层输出的特征图进行反卷积处理;通过第三级反卷积层对第六级卷积层输出的特征图进行反卷积处理。其中,反卷积处理作为卷积处理的逆过程,用于对特征图进行上采样,从而缩小特征图的尺寸。如图3所示,经过反卷积层处理后,特征图的尺寸减小。
其次,对反卷积处理后的特征图与第n-j+1级卷积层输出的特征图进行拼接,并将拼接后的特征图输入第j+n+1级卷积层,反卷积处理后的特征图与第n-j+1级卷积层输出的特征图的尺寸相同。示意性的,如图3所示,将第三级卷积层输出的特征图以及第一级反卷积层输出的特征图拼接,作为第五级卷积层的输入;将第二级卷积层输出的特征图以及第二级反卷积层输出的特征图拼接,作为第六级卷积层的输入,将第一级卷积层输出的特征图以及第三级反卷积层输出的特征图拼接,作为第七级卷积层的输入。
最后,通过第j+n+1级卷积层对拼接后的特征图进行卷积处理,最终输出与目标图像尺寸一致的分类结果图。
在确定了CNN网络结构和处理过程后,就可以通过历史图像的分类结果训练CNN网络结构,然后根据训练完成的CNN网络结构提取出分类结果。
将每一个像素点分类后,可根据分类结果,将文本像素点形成文本连通域。其中,由文本像素点形成多个文本连通域,包括:
针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
根据邻接关系,连通文本像素点,形成多个文本连通域。
具体实施过程中,通过像素分类模型得到每一个像素点的分类结果,根据分类结果可以得出每个像素点与相邻像素点之间的邻接关系,其中,除了目标图像四边上的像素点,目标图像内部的每个像素点存在8个相邻的像素点,即上、下、左、右,右上、右下、左上、左下8个像素点。针对每一个文本像素点,可以对该文本像素点与任一个相邻像素点之间的关系进行标记,例如,若相邻像素点也为文本像素点,标记为1,若相邻像素点为非文本像素点,标记为0,则每一个文本像素点对应8个邻接关系。
进而,根据邻接关系,可以将相邻的文本像素点连通,形成文本连通域,其中,一个文本连通域可以用一个集合CC标记,则CC={C 1,C 2,...,C n},C n为文本连通域集合CC中的第n个文本像素点。
进一步地,为了便于计算,本发明实施例中,针对每个文本连通域,确定每个文本连通域的最小外接矩形。
由于文本连通域的形状不确定,不同形状不便于后续计算,因此,为了减少计算难度,本发明实施例对每个文本连通域均确定最小外接矩形。最小外接矩形即为在给出一个多边形(或一群点),求出面积最小且外接多边形的矩形。
以直角坐标系为例,其求解方法如下:
(1)先确定文本连通域的简单外接矩形。简单外接矩形是指边平行于x轴或y轴的外接矩形。简单外接矩形很有可能不是最小外接矩形,却是非常容易求得的外接矩形。
(2)将文本连通域在平面上绕某一固定点旋转某一角度。数学基础是,设平面上点(x 1,y 1)绕另一点(x 0,y 0)逆时针旋转A角度后的点为(x 2,y 2),则有:
x 2=(x 1-x 0)×cosA-(y 1-y 0)×sinA+x 0……公式1
y 2=(x 1-x 0)×sinA+(y 1-y 0)×cosA+y 0……公式2
顺时针时,A改写成-A即可。
(3)旋转文本连通域(循环,0-90°,间距设为1°),求旋转每个度数后的文本连通域的简单外接矩形,记录简单外接矩形的面积、顶点坐标以及此时旋转的度数。
(4)比较在旋转过程中文本连通域求得的所有简单外接矩形,得到面积最小的简单外接矩形,获取该简单外接矩形的顶点坐标和旋转的角度。
(5)旋转外接矩形。将上一步获得面积最小的简单外接矩形反方向(与第3步方向相反)旋转相同的角度,即得最小外接矩形。
得到文本连通域的最小外接矩形后,后续步骤均可利用对应的最小外接矩形代替文本连通域进行计算。
所述根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,包括:
根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值。
具体实施过程中,计算两个文本连通域之间的差异特征值即计算这两个文本连通域对应的最小外接矩形的差异特征值,包括:
针对每一个文本连通域的最小外接矩形,获取所述最小外接矩形中各个像素点的颜色值;计算所有像素点的颜色值的均值,作为所述最小外接矩形的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
根据最小外接矩形的颜色特征值,计算所述两个最小外接矩形之间的多个颜色差异分量;
选取值最大的颜色差异分量作为所述两个最小外接矩形之间的差异特征值。
具体来说,本发明实施例中像素点的颜色值可以是RGB色彩模式的颜色值,也可以是HSV颜色模型的颜色值,这里以RGB色彩模式的颜色值为例进行介绍。针对一个文本连通域对应的最小外接矩形,获取该最小外接矩形中各个像素点的RGB值,RGB值中包括该像素点的红色分量、绿色分量、蓝色分量,可以用M i={R i,G i,B i}表示。
根据所有像素点的RGB值计算该最小外接矩形的颜色特征值,最小外接矩形的颜色特征值包括最小外接矩形的红色特征值、绿色特征值、蓝色特征值,其中,最小外接矩形的红色特征值等于该最小外接矩形中所有像素点的红色分量的均值,最小外接矩形的绿色特征值等于该最小外接矩形中所有像素点的绿色分量的均值,最小外接矩形的蓝色特征值等于该最小外接矩形中所有像素点的蓝色分量的均值。最小外接矩形C的颜色特征值用M c={R c,G c,B c}表示,则:
Figure PCTCN2021093660-appb-000001
其中,R c为最小外接矩形的红色特征值,G c为最小外接矩形的绿色特征值,B c为最小外接矩形的蓝色特征值。
之后,根据颜色特征值,计算两个最小外接矩形的颜色差异分量。一种具体的实施例中,颜色差异分量可以包括亮度差异、色调差异值、色彩浓度差异值。即根据两个最小外接矩形的颜色特征值,计算得出这两个最小外接矩形的亮度差异、色调差异值和色彩浓度差异值。再从中选取值最大的颜色差异分量作为这两个最小外接矩形的差异特征值。
另一方面,利用文本连通域的最小外接矩形计算两个文本连通域之间的邻接特征值。根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值,包括:
根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最小外接矩形之间的邻接特征值。
具体地,根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最小外接矩形之间的邻接特征值,包括:
将两个最小外接矩形之间的重叠面积与所述两个最小外接矩形的面积之和相比,得到所述两个最小外接矩形之间的邻接特征值。
具体实施过程中,最小外接矩形的面积可以用最小外接矩形中包含的像 素点的个数表示。例如最小外接矩形a包含100个像素点,则最小外接矩形a的面积为100,最小外接矩形b包含80个像素点,则最小外接矩形b的面积为80。最小外接矩形a和最小外接矩形b中包含20个相同的像素点,则将最小外接矩形a和最小外接矩形b的重叠面积标记为20。则两个最小外接矩形之间的邻接特征值等于最小外接矩形之间的重叠面积与最小外接矩形的面积之和的比值,即邻接特征值等于20与100加80之和的比值,等于1/9。
计算得到文本连通域之间的差异特征值和邻接特征值之后,可以根据差异特征值和邻接特征值确定不同文本连通域之间是否合并。
所述根据差异特征值和邻接特征值,将所述多个文本连通域进行合并,包括:
确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个最小外接矩形存在关联关系;
利用并查集算法,根据关联关系对所有最小外接矩形进行合并。
具体实施过程中,将差异特征值与颜色阈值相对比,例如,颜色阈值可以设置为21,若差异特征值小于颜色阈值,则认为最小外接矩形之间的颜色相近,可以合并;若差异特征值大于或等于颜色阈值,则认为最小外接矩形之间的颜色差异较大,不进行合并。对于邻接特征值,将邻接特征值与面积阈值相对比,若邻接特征值大于面积阈值,则认为最小外接矩形之间的距离较近,可以合并;若邻接特征值小于或等于面积阈值,则认为最小外接矩形之间的距离较远,不进行合并。本发明实施例中,认为差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个最小外接矩形存在关联关系,可以进行合并。
将互相存在关联关系的最小外接矩形进行合并,具体可以利用并查集算法,确定需要合并的所有最小外接矩形。
最小外接矩形合并之后,可以根据合并后的最小外接矩形的面积,确定目标文本区域。具体来说,由于商户门头图片中的商户名称一般为面积最大的区域,因此,可以根据面积对目标图像进行噪声过滤,将合并后面积最大 的最小外接矩形作为目标图像中的目标文本区域。
进一步地,一种可选的实施例中,本发明实施例确定目标图像中的目标文本区域之后,可以对目标文本区域中的文本识别,如图4所示,上述步骤S205、根据合并后的文本连通域的面积,确定目标图像中的目标文本区域之后,还包括:
步骤S206、将所述目标文本区域输入已训练的特征提取模型中,得到所述目标文本区域的目标特征向量。其中,特征提取模型利用训练文本图像以及对应的文字信息进行训练。
具体地,特征提取模型可以为深度学习网络模型,如CTPN、PSEnet等模型,本发明实施例中以特征提取模型为VGG网络为例。这里的VGG网络利用标注的商户门头图片以及对应的商户名称的文字信息进行训练。通过VGG网络得到目标文本区域的目标特征向量,该目标特征向量可以是一个1×1024的向量。
步骤S207、将所述目标特征向量与标注样本的标注特征向量进行相似度对比,确定相似度最大的标注文本图像,所述标注样本包括标注文本图像、对应的标注特征向量以及文字信息。
具体实施过程中,数据库中存储有大量的标注样本,标注样本包括标注文本图像、标注特征向量以及对应的文字信息。将上述得到的目标特征向量与数据库中的标注特征向量进行相似度对比,选取相似度最大的标注特征向量对应的标注文本图像。
这里的相似度计算可以利用余弦相似度公式进行计算。具体的相似度可以根据以下公式计算:
Figure PCTCN2021093660-appb-000002
其中,A为目标特征向量,B为标注特征向量,两者均为一维特征向量。
步骤S208、将所述相似度最大的标注图像的文字信息作为所述目标文本区域的文字信息。
最后,选取与目标特征向量相似度最大的标注特征向量,将该标注特征向量的文字信息作为目标特征向量的文字信息,即目标文本区域的文字信息。
本发明实施例在商户门头图片的文本识别过程中,通过预先提取出目标文本区域,缩小了输入特征提取模型的图像大小,能够降低拍摄角度、噪声对图像检索效果的影响,同时避免了复杂背景对文字识别性能的影响,提升文字识别准确率。
以下通过具体实例说明本发明实施例提供的文本区域的定位方法以及文本识别的实现过程。
首先接收目标图像,确定目标图像中各个像素点的像素值。将各个像素点的像素值输入像素分类模型中,像素分类模型采用类Unet的卷积神经网络。通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果。
根据像素分类模型学习到的历史图像中像素点的分类结果,确定目标图像中每个像素点的分类结果,其中,像素点的分类结果为所述像素点为文本像素点或非文本像素点。
针对每一个文本像素点,确定该文本像素点与相邻的像素点之前的邻接关系。邻接关系包括上、下、左、右、右上、右下、左上、左下。根据邻接关系连通文本像素点,形成多个文本连通域,并确定每个文本连通域的最小外接矩形。
接下来,计算文本连通域之间的差异特征值以及邻接特征值。
根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值。具体的,获取最小外接矩形中各个像素点的颜色值,其中,颜色特征值包括红色分量值、绿色分量值和蓝色分量值。计算所有像素点的颜色值的均值,作为最小外接矩形的颜色特征值。根据最小外接矩形的颜色特征值,计算两个最小外接矩形之间的多个颜色差异分量, 选取值最大的颜色差异分量作为两个最小外接矩形之间的差异特征值。
将两个最小外接矩形之间的重叠面积与所述两个最小外接矩形的面积之和相比,得到两个最小外接矩形之间的邻接特征值。
确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个最小外接矩形存在关联关系。利用并查集算法,根据关联关系对所有最小外接矩形进行合并。将合并后面积最大的文本连通域作为目标图像中的目标文本区域。
将目标文本区域输入已训练的特征提取模型中,得到所述目标文本区域的目标特征向量。
将目标特征向量与标注样本的标注特征向量进行相似度对比,确定相似度最大的标注文本图像。其中,标注样本包括标注文本图像、对应的标注特征向量以及文字信息。
将所述相似度最大的标注图像的文字信息作为目标文本区域的文字信息。
下述为本发明装置实施例,对于装置实施例中未详尽描述的细节,可以参考上述一一对应的方法实施例。
请参考图5,其示出了本发明一个实施例提供的文本区域的定位装置的结构方框图。该装置包括:获取单元501、连通单元502、计算单元503、合并单元504、过滤单元505。
其中,获取单元501,用于获取目标图像中各个像素点的像素值;
连通单元502,用于根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域;
计算单元503,用于针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值;
合并单元504,用于根据差异特征值和邻接特征值,将所述多个文本连通域进行合并;
过滤单元505,用于根据合并后的文本连通域的面积,确定所述目标图像 中的目标文本区域。
一种可选的实施例中,所述连通单元502,具体用于:
将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
一种可选的实施例中,所述连通单元502,具体用于:
针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
根据邻接关系,连通文本像素点,形成多个文本连通域。
一种可选的实施例中,所述计算单元503,具体用于:
针对任一文本连通域,获取所述文本连通域中各个像素点的颜色值;计算所有像素点的颜色值的均值,作为所述文本连通域的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
根据文本连通域的颜色特征值,计算所述两个文本连通域之间的多个颜色差异分量;
选取值最大的颜色差异分量作为所述两个连通域之间的差异特征值。
一种可选的实施例中,所述计算单元503,具体用于:
将所述两个文本连通域之间的距离与所述两个文本连通域的面积之和相比,得到所述两个文本连通域之间的邻接特征值;
一种可选的实施例中,所述合并单元504,具体用于:
确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个文本连通域存在关联关系;
根据关联关系,利用并查集算法对所有文本连通域进行合并。
一种可选的实施例中,所述连通单元502,还用于确定每个文本连通域的最小外接矩形;
所述计算单元,还用于根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算所述两个文本连通域之间的差异特征值;根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个文本连通域之间的邻接特征值。
与上述方法实施例相对应地,本发明实施例还提供了一种电子设备。该电子设备可以是服务器,如图1中所示的服务器102,该电子设备至少包括用于存储数据的存储器和用于数据处理的处理器。其中,对于用于数据处理的处理器而言,在执行处理时,可以采用微处理器、CPU、GPU(Graphics Processing Unit,图形处理单元)、DSP或FPGA实现。对于存储器来说,存储器中存储有操作指令,该操作指令可以为计算机可执行代码,通过该操作指令来实现上述本发明实施例的视频筛选方法的流程中的各个步骤。
图6为本发明实施例提供的一种电子设备的结构示意图;如图6所示,本发明实施例中该电子设备60包括:处理器61、显示器62、存储器63、输入设备66、总线65和通讯设备64;该处理器61、存储器63、输入设备66、显示器62和通讯设备64均通过总线65连接,该总线65用于该处理器61、存储器63、显示器62、通讯设备64和输入设备66之间传输数据。
其中,存储器63可用于存储软件程序以及模块,如本发明实施例中的文本区域的定位方法对应的程序指令/模块,处理器61通过运行存储在存储器63中的软件程序以及模块,从而执行电子设备60的各种功能应用以及数据处理,如本发明实施例提供的文本区域的定位方法。存储器63可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个应用的应用程序等;存储数据区可存储根据电子设备60的使用所创建的数据(比如动画片段、控制策略网络)等。此外,存储器63可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
处理器61是电子设备60的控制中心,利用总线65以及各种接口和线路连接整个电子设备60的各个部分,通过运行或执行存储在存储器63内的软 件程序和/或模块,以及调用存储在存储器63内的数据,执行电子设备60的各种功能和处理数据。可选的,处理器61可包括一个或多个处理单元,如CPU、GPU(Graphics Processing Unit,图形处理单元)、数字处理单元等。
本发明实施例中,处理器61将确定的目标文本区域以及文字信息通过显示器62展示给用户。
处理器61还可以通过通讯设备64连接网络,如果电子设备是服务器,则处理器61可以通过通讯设备64与终端设备之间传输数据。
该输入设备66主要用于获得用户的输入操作,当该电子设备不同时,该输入设备66也可能不同。例如,当该电子设备为计算机时,该输入设备66可以为鼠标、键盘等输入设备;当该电子设备为智能手机、平板电脑等便携设备时,该输入设备66可以为触控屏。
本发明实施例还提供了一种计算机存储介质,该计算机存储介质中存储有计算机可执行指令,该计算机可执行指令用于实现本发明任一实施例的文本区域的定位方法。
在一些可能的实施方式中,本发明提供的文本区域的定位方法的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在计算机设备上运行时,程序代码用于使计算机设备执行本说明书上述描述的根据本发明各种示例性实施方式的文本区域的定位方法的步骤,例如,计算机设备可以执行如图2所示的步骤S201至S208中的文本区域的定位流程。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括——但不限于——电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
在本发明所提供的几个实施例中,应该理解到,所揭露的设备和方法,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。
另外,在本发明各实施例中的各功能单元可以全部集成在一个处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。
以上所述,仅为本发明的具体实施方式,但本发明的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。

Claims (18)

  1. 一种文本区域的定位方法,其特征在于,所述方法包括:
    获取目标图像中各个像素点的像素值;
    根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域;
    针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值;
    根据差异特征值和邻接特征值,将所述多个文本连通域进行合并;
    根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域。
  2. 根据权利要求1所述的方法,其特征在于,所述根据像素值,从所述目标图像的所有像素点中确定文本像素点,包括:
    将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
    根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
  3. 根据权利要求1所述的方法,其特征在于,所述由文本像素点形成多个文本连通域,包括:
    针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
    根据邻接关系,连通文本像素点,形成多个文本连通域。
  4. 根据权利要求1至3任一项所述的方法,其特征在于,所述由文本像素点形成多个文本连通域之后,还包括:
    确定每个文本连通域的最小外接矩形;
    所述根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域 之间的差异特征值,包括:
    根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值;
    所述根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值,包括:
    根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最小外接矩形之间的邻接特征值。
  5. 根据权利要求4所述的方法,其特征在于,所述根据每个文本连通域对应的最小外接矩形中各个像素的颜色值,计算两个最小外接矩形之间的差异特征值,包括:
    针对每一个文本连通域的最小外接矩形,获取所述最小外接矩形中各个像素点的颜色值;计算所有像素点的颜色值的均值,作为所述最小外接矩形的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
    根据最小外接矩形的颜色特征值,计算所述两个最小外接矩形之间的多个颜色差异分量;
    选取值最大的颜色差异分量作为所述两个最小外接矩形之间的差异特征值。
  6. 根据权利要求4所述的方法,其特征在于,所述根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个最小外接矩形之间的邻接特征值,包括:
    将两个最小外接矩形之间的重叠面积与所述两个最小外接矩形的面积之和相比,得到所述两个最小外接矩形之间的邻接特征值。
  7. 根据权利要求5或6所述的方法,其特征在于,所述根据差异特征值和邻接特征值,将所述多个文本连通域进行合并,包括:
    确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个最小外接矩形存在关联关系;
    利用并查集算法,根据关联关系对所有最小外接矩形进行合并。
  8. 一种图像文字识别方法,其特征在于,所述方法包括:
    确定目标图像中的目标文本区域,其中,所述目标图像中的目标文本区域是通过如权利要求1-7中任一项所述的方法得到的;
    将所述目标文本区域输入已训练的特征提取模型中,得到所述目标文本区域的目标特征向量,所述特征提取模型利用训练文本图像以及对应的文字信息进行训练;
    将所述目标特征向量与标注样本的标注特征向量进行相似度对比,确定相似度最大的标注文本图像,所述标注样本包括标注文本图像、对应的标注特征向量以及文字信息;
    将所述相似度最大的标注图像的文字信息作为所述目标文本区域的文字信息。
  9. 一种文本区域的定位装置,其特征在于,所述装置包括:
    获取单元,用于获取目标图像中各个像素点的像素值;
    连通单元,用于根据像素值,从所述目标图像的所有像素点中确定文本像素点,并由文本像素点形成多个文本连通域;
    计算单元,用于针对任意两个文本连通域,根据文本连通域中各个像素点的颜色值,计算所述两个文本连通域之间的差异特征值,并根据所述两个文本连通域之间的距离,计算所述两个文本连通域之间的邻接特征值;
    合并单元,用于根据差异特征值和邻接特征值,将所述多个文本连通域进行合并;
    过滤单元,用于根据合并后的文本连通域的面积,确定所述目标图像中的目标文本区域。
  10. 根据权利要求9所述的装置,其特征在于,所述连通单元,具体用于:
    将所述目标图像输入已训练的像素分类模型中,通过像素分类模型中交替的卷积操作和池化操作得到所有像素点的像素特征提取结果;
    根据所述像素分类模型学习到的历史图像中像素点的分类结果,确定所 述目标图像中每个像素点的分类结果,所述像素点的分类结果为所述像素点为文本像素点或非文本像素点。
  11. 根据权利要求9所述的装置,其特征在于,所述连通单元,具体用于:
    针对每一个文本像素点,确定所述文本像素点与所述文本像素点相邻的像素点之前的邻接关系;
    根据邻接关系,连通文本像素点,形成多个文本连通域。
  12. 根据权利要求9所述的装置,其特征在于,所述计算单元,具体用于:
    针对任一文本连通域,获取所述文本连通域中各个像素点的颜色值;计算所有像素点的颜色值的均值,作为所述文本连通域的颜色特征值;所述颜色特征值包括红色分量值、绿色分量值和蓝色分量值;
    根据文本连通域的颜色特征值,计算所述两个文本连通域之间的多个颜色差异分量;
    选取值最大的颜色差异分量作为所述两个连通域之间的差异特征值。
  13. 根据权利要求9所述的装置,其特征在于,所述计算单元,具体用于:
    将所述两个文本连通域之间的距离与所述两个文本连通域的面积之和相比,得到所述两个文本连通域之间的邻接特征值。
  14. 根据权利要求12或13所述的装置,其特征在于,所述合并单元,具体用于:
    确定差异特征值小于颜色阈值,并且邻接特征值大于面积阈值的两个文本连通域存在关联关系;
    根据关联关系,利用并查集算法对所有文本连通域进行合并。
  15. 根据权利要求9至13任一项所述的装置,其特征在于,所述连通单元,还用于确定每个文本连通域的最小外接矩形;
    所述计算单元,还用于根据每个文本连通域对应的最小外接矩形中各个 像素的颜色值,计算所述两个文本连通域之间的差异特征值;根据两个文本连通域的最小外接矩形之间的重叠面积,计算所述两个文本连通域之间的邻接特征值。
  16. 一种图像文字识别装置,其特征在于,所述装置包括:
    定位单元,所述定位单元包括如权利要求9-15所述的文本区域的定位装置;
    将所述目标文本区域输入特征提取模型中,得到所述目标文本区域的目标特征向量;
    将所述目标特征向量与标注样本的标注特征向量相对比,确定相似度最大的标注图像,所述标注样本包括标注图像、对应的标注特征向量以及文字信息;
    将所述相似度最大的标注图像的文字信息作为所述目标文本区域的文字信息。
  17. 一种计算机可读存储介质,所述计算机可读存储介质内存储有计算机程序,其特征在于:所述计算机程序被处理器执行时,实现权利要求1~7任一项所述的方法。
  18. 一种电子设备,其特征在于,包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的计算机程序,当所述计算机程序被所述处理器执行时,使得所述处理器实现权利要求1~7任一项所述的方法。
PCT/CN2021/093660 2020-08-14 2021-05-13 一种文本区域的定位方法及装置 WO2022033095A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010817763.0A CN112016546A (zh) 2020-08-14 2020-08-14 一种文本区域的定位方法及装置
CN202010817763.0 2020-08-14

Publications (1)

Publication Number Publication Date
WO2022033095A1 true WO2022033095A1 (zh) 2022-02-17

Family

ID=73504461

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/093660 WO2022033095A1 (zh) 2020-08-14 2021-05-13 一种文本区域的定位方法及装置

Country Status (3)

Country Link
CN (1) CN112016546A (zh)
TW (1) TWI821671B (zh)
WO (1) WO2022033095A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049649A (zh) * 2022-08-12 2022-09-13 山东振鹏建筑钢品科技有限公司 基于锈蚀程度的钢筋打磨除锈控制方法
CN115995080A (zh) * 2023-03-22 2023-04-21 曲阜市检验检测中心 基于ocr识别的档案智能管理系统
CN116453030A (zh) * 2023-04-07 2023-07-18 郑州工程技术学院 一种基于计算机视觉的建筑材料回收方法

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112016546A (zh) * 2020-08-14 2020-12-01 中国银联股份有限公司 一种文本区域的定位方法及装置
CN112528827B (zh) * 2020-12-03 2023-04-07 和远智能科技股份有限公司 一种高铁接触网供电设备裂损自动检测方法
CN112766073B (zh) * 2020-12-31 2022-06-10 贝壳找房(北京)科技有限公司 表格提取方法、装置、电子设备及可读存储介质
CN112801030B (zh) * 2021-02-10 2023-09-01 中国银联股份有限公司 一种目标文本区域的定位方法及装置
CN113780098B (zh) * 2021-08-17 2024-02-06 北京百度网讯科技有限公司 文字识别方法、装置、电子设备以及存储介质
CN116993133B (zh) * 2023-09-27 2024-01-26 尚云(广州)信息科技有限公司 一种基于人脸识别的智能工单系统
CN117593527A (zh) * 2024-01-18 2024-02-23 厦门大学 一种基于链式感知的指向性3d实例分割方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090003700A1 (en) * 2007-06-27 2009-01-01 Jing Xiao Precise Identification of Text Pixels from Scanned Document Images
CN103093228A (zh) * 2013-01-17 2013-05-08 上海交通大学 一种在自然场景图像中基于连通域的中文检测方法
CN106529380A (zh) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 图像的识别方法及装置
CN107784301A (zh) * 2016-08-31 2018-03-09 百度在线网络技术(北京)有限公司 用于识别图像中文字区域的方法和装置
CN112016546A (zh) * 2020-08-14 2020-12-01 中国银联股份有限公司 一种文本区域的定位方法及装置

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB0814468D0 (en) * 2008-08-07 2008-09-10 Rugg Gordon Methdo of and apparatus for analysing data files
TW201039149A (en) * 2009-04-17 2010-11-01 Yu-Chieh Wu Robust algorithms for video text information extraction and question-answer retrieval

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090003700A1 (en) * 2007-06-27 2009-01-01 Jing Xiao Precise Identification of Text Pixels from Scanned Document Images
CN103093228A (zh) * 2013-01-17 2013-05-08 上海交通大学 一种在自然场景图像中基于连通域的中文检测方法
CN106529380A (zh) * 2015-09-15 2017-03-22 阿里巴巴集团控股有限公司 图像的识别方法及装置
CN107784301A (zh) * 2016-08-31 2018-03-09 百度在线网络技术(北京)有限公司 用于识别图像中文字区域的方法和装置
CN112016546A (zh) * 2020-08-14 2020-12-01 中国银联股份有限公司 一种文本区域的定位方法及装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115049649A (zh) * 2022-08-12 2022-09-13 山东振鹏建筑钢品科技有限公司 基于锈蚀程度的钢筋打磨除锈控制方法
CN115049649B (zh) * 2022-08-12 2022-11-11 山东振鹏建筑钢品科技有限公司 基于锈蚀程度的钢筋打磨除锈控制方法
CN115995080A (zh) * 2023-03-22 2023-04-21 曲阜市检验检测中心 基于ocr识别的档案智能管理系统
CN116453030A (zh) * 2023-04-07 2023-07-18 郑州工程技术学院 一种基于计算机视觉的建筑材料回收方法

Also Published As

Publication number Publication date
CN112016546A (zh) 2020-12-01
TW202207077A (zh) 2022-02-16
TWI821671B (zh) 2023-11-11

Similar Documents

Publication Publication Date Title
WO2022033095A1 (zh) 一种文本区域的定位方法及装置
US10740647B2 (en) Detecting objects using a weakly supervised model
CN111797893B (zh) 一种神经网络的训练方法、图像分类系统及相关设备
CN111488826B (zh) 一种文本识别方法、装置、电子设备和存储介质
CN106547880B (zh) 一种融合地理区域知识的多维度地理场景识别方法
US20190385054A1 (en) Text field detection using neural networks
WO2020182121A1 (zh) 表情识别方法及相关装置
US11900611B2 (en) Generating object masks of object parts utlizing deep learning
WO2019075130A1 (en) IMAGE PROCESSING DEVICE AND METHOD
CN108734210B (zh) 一种基于跨模态多尺度特征融合的对象检测方法
US7653244B2 (en) Intelligent importation of information from foreign applications user interface
US10572760B1 (en) Image text localization
US11875512B2 (en) Attributionally robust training for weakly supervised localization and segmentation
CN114677565B (zh) 特征提取网络的训练方法和图像处理方法、装置
CN114120349B (zh) 基于深度学习的试卷识别方法及系统
CN109740135A (zh) 图表生成方法及装置、电子设备和存储介质
WO2023284608A1 (zh) 字符识别模型生成方法、装置、计算机设备和存储介质
US20210073530A1 (en) Handwritten Diagram Recognition Using Deep Learning Models
CN111899203A (zh) 基于标注图在无监督训练下的真实图像生成方法及存储介质
CN113487610B (zh) 疱疹图像识别方法、装置、计算机设备和存储介质
CN117593752A (zh) 一种pdf文档录入方法、系统、存储介质及电子设备
WO2023246912A1 (zh) 图像文字结构化输出方法、装置、电子设备和存储介质
Evangelou et al. PU learning-based recognition of structural elements in architectural floor plans
CN113192085A (zh) 三维器官图像分割方法、装置及计算机设备
Sun et al. Contextual models for automatic building extraction in high resolution remote sensing image using object-based boosting method

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21855151

Country of ref document: EP

Kind code of ref document: A1