CN112801030A - Method and device for positioning target text area - Google Patents

Method and device for positioning target text area Download PDF

Info

Publication number
CN112801030A
CN112801030A CN202110185262.XA CN202110185262A CN112801030A CN 112801030 A CN112801030 A CN 112801030A CN 202110185262 A CN202110185262 A CN 202110185262A CN 112801030 A CN112801030 A CN 112801030A
Authority
CN
China
Prior art keywords
text
target
image
area
template image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110185262.XA
Other languages
Chinese (zh)
Other versions
CN112801030B (en
Inventor
费志军
邱雪涛
高鹏飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Unionpay Co Ltd
Original Assignee
China Unionpay Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Unionpay Co Ltd filed Critical China Unionpay Co Ltd
Priority to CN202110185262.XA priority Critical patent/CN112801030B/en
Publication of CN112801030A publication Critical patent/CN112801030A/en
Application granted granted Critical
Publication of CN112801030B publication Critical patent/CN112801030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/412Layout analysis of documents structured with printed lines or input boxes, e.g. business forms or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a method and a device for positioning a target text region, belongs to the technical field of computers, relates to artificial intelligence and computer vision technology, and is used for improving the accuracy of positioning a character region in a shop front picture. Determining at least one text initial selection area in a target image, and acquiring a text template image corresponding to the target image; performing feature extraction on at least one text primary selection area to obtain primary selection area features; comparing the characteristics of the primarily selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from at least one text primarily selected area; performing text recognition on the at least one text selected area, and determining a target text area from the at least one text selected area according to a text recognition result; and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image.

Description

Method and device for positioning target text area
Technical Field
The invention relates to the technical field of computers, in particular to a method and a device for positioning a target text region.
Background
The door head refers to a plaque and related facilities arranged at the door of an enterprise, an institution and an individual industrial and commercial company, is a decoration form outside a shop, and is a means for beautifying a sales place and decorating a shop and attracting customers.
The merchant door head generally contains the word contents such as the merchant name and the merchant address, when the authenticity of the merchant is verified, a polling person needs to go to the address of the merchant to take a picture, and then the auditing person checks the information, so that the efficiency is low and the error is easy to occur. At present, in order to automatically identify characters in a shop front picture, the character position of a shop name needs to be located in the street-shot shop front picture.
The existing image character recognition generally recognizes all characters in an image, and can not effectively distinguish a commercial tenant name character area from other character areas in a commercial tenant shop head picture, so that the accuracy of subsequent commercial tenant name recognition is influenced.
Disclosure of Invention
The embodiment of the invention provides a method and a device for positioning a target text region, which are used for improving the accuracy of positioning the text region in a shop heading picture.
In one aspect, an embodiment of the present invention provides a method for locating a target text region, including:
determining at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
performing feature extraction on the at least one text initial selection area to obtain initial selection area features;
comparing the characteristics of the initially selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from the at least one text initially selected area;
performing text recognition on the at least one text fine selection area, and determining a target text area from the at least one text fine selection area according to a text recognition result;
and comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
Optionally, the expanding the range of the target text region according to the label of the text template image includes:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
Optionally, before performing feature extraction on the at least one text initial selection region to obtain the initial selection region feature, the method further includes:
performing feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the step of extracting the features of the at least one text initial selection area to obtain the initial selection area features of the at least one text initial selection area comprises the following steps:
performing feature extraction on the at least one text primary selection area by using the image feature point extraction model to obtain a primary selection area feature set of the at least one text primary selection area;
the comparing the characteristics of the initially selected area with the characteristics of the template image of the text template image to determine at least one text carefully selected area from the at least one text initially selected area comprises:
matching the primary selection area characteristic set of the at least one text primary selection area with the template image characteristic set of the text template image;
and taking the text primary selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
Optionally, the performing text recognition on the at least one text culling area, and determining a target text area from the at least one text culling area according to a text recognition result includes:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
The embodiment of the invention also provides an image text positioning network training method, which comprises the following steps:
acquiring a training image;
inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by the method described above;
calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function, and determining corresponding parameters as parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value to obtain the merchant text positioning network.
An embodiment of the present invention further provides a device for locating a target text region, where the device includes:
the acquisition unit is used for determining at least one text initial selection area in a target image and acquiring a text template image corresponding to the target image;
the extraction unit is used for extracting the characteristics of the at least one text primary selection area to obtain the characteristics of the primary selection area;
the selection unit is used for comparing the characteristics of the initial selection area with the characteristics of the template image of the text template image and determining at least one text selection area from the at least one text initial selection area;
the determining unit is used for performing text recognition on the at least one text carefully-selected area and determining a target text area from the at least one text carefully-selected area according to a text recognition result;
and the amplifying unit is used for comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, the amplifying unit is used for amplifying the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
Optionally, the amplification unit is specifically configured to:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
Optionally, the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; performing feature extraction on the at least one text primary selection area by using the image feature point extraction model to obtain a primary selection area feature set of the at least one text primary selection area;
the fine selection unit is used for matching the primary selection region feature set of the at least one text primary selection region with the template image feature set of the text template image; and taking the text primary selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
Optionally, the determining unit is configured to:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
The embodiment of the invention also provides an image text positioning network training device, which comprises:
an acquisition unit configured to acquire a training image;
the input unit is used for inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image;
a positioning unit, configured to determine a target text region in the training image, where the target text region in the training image is obtained by the method as described above;
and the optimization unit is used for calculating a loss function according to the merchant text position and the target text region, optimizing the parameters of the merchant text positioning network according to the loss function, and determining the corresponding parameters as the parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value, so as to obtain the merchant text positioning network.
In another aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the method for locating a target text region in the first aspect is implemented.
In another aspect, an embodiment of the present invention further provides an electronic device, including a memory and a processor, where the memory stores a computer program that is executable on the processor, and when the computer program is executed by the processor, the processor is enabled to implement the method for locating a target text region according to the first aspect.
When the text region of the target image of the merchant is positioned, at least one text candidate region in the target image is determined, and a text template image corresponding to the target image is obtained. Performing feature extraction on at least one text primary selection area to obtain primary selection area features; comparing the characteristics of the primarily selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from at least one text primarily selected area; and performing text recognition on the at least one text selected area, and determining a target text area from the at least one text selected area according to a text recognition result. And then comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image. The embodiment of the invention can accurately position the merchant gate head characters under the complex background of the merchant gate head picture, can effectively shield the interference character area in the picture, and can effectively determine the merchant gate head characters with longer intervals into the same target text area, thereby improving the positioning accuracy of the target text area and further ensuring the accuracy of the subsequent merchant name identification.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic system architecture diagram of a method for locating a target text region according to an embodiment of the present invention;
FIG. 2 is a flowchart illustrating a method for locating a target text region according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a merchant portal image according to an embodiment of the present invention;
FIG. 4 is a schematic structural diagram of a system framework corresponding to a method for locating a target text region according to an embodiment of the present invention;
FIG. 5 is a block diagram of a text recognition module according to an embodiment of the present invention;
FIG. 6 is a schematic structural diagram of an apparatus for locating a target text region according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the present invention will be described in further detail with reference to the accompanying drawings, and it is apparent that the described embodiments are only a part of the embodiments of the present invention, not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The word "exemplary" is used hereinafter to mean "serving as an example, embodiment, or illustration. Any embodiment described as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
The terms "first" and "second" are used herein for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first" and "second" may explicitly or implicitly include one or more of the features, and in the description of embodiments of the invention, "plurality" means two or more unless indicated otherwise. Furthermore, the term "comprises" and any variations thereof, which are intended to cover non-exclusive protection. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Commercial tenant gate head identification belongs to the technical field of this positioning and identification under natural scene conditions, and a gate head text of a commercial tenant needs to be identified in a normally shot picture. The deep network model is the best technical method in the field at present. Natural scene text recognition is generally divided into three steps: 1) text positioning based on a deep network; 2) text recognition based on a deep network; 3) and checking a character recognition result.
Text recognition based on a deep network firstly needs to establish a large-scale character image database, the database must contain character marking samples with various colors and fonts, the construction difficulty is high, and when the current text positioning model is used for positioning the name of a merchant, two errors easily occur: dividing the same business name into two text boxes; and positioning the irrelevant characters and the merchant name to the same text box.
In order to solve technical problems in the related art, embodiments of the present invention provide a method and an apparatus for locating a target text region. The method for positioning the target text region provided by the embodiment of the invention can be applied to a positioning scene, a text recognition scene and the like of the target text region.
Some brief descriptions are given below to application scenarios to which the technical solution of the embodiment of the present application can be applied, and it should be noted that the application scenarios described below are only used for describing the embodiment of the present application and are not limited. In specific implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to actual needs.
To further illustrate the technical solutions provided by the embodiments of the present application, the following detailed description is made with reference to the accompanying drawings and the detailed description. Although the embodiments of the present application provide method steps as shown in the following embodiments or figures, more or fewer steps may be included in the method based on conventional or non-inventive efforts. In steps where no necessary causal relationship exists logically, the order of execution of the steps is not limited to that provided by the embodiments of the present application.
An application scenario of the method for locating a target text region according to the embodiment of the present invention can be seen in fig. 1, where the application scenario includes a terminal device 101, a server 102, and a database 103.
The terminal device 101 is an electronic device that has a photographing or shooting function, can be installed with various clients, and can display an operation interface of the installed client, and the electronic device may be mobile or fixed. For example, a mobile phone, a tablet computer, a notebook computer, a desktop computer, various wearable devices, a smart television, a vehicle-mounted device, or other electronic devices capable of implementing the above functions may be used. The client may be a video client or a browser client, etc. Each terminal apparatus 101 is connected to the server 102 through a communication network, which may be a wired network or a wireless network. The server 102 may be a server corresponding to a client, may be a server or a server cluster or a cloud computing center composed of several servers, or may be a virtualization platform.
Fig. 1 illustrates that the database 103 exists independently from the server 102, and in other possible implementations, the database 103 may be located in the server 102.
The server 102 is connected with the database 103, historical images, labeled samples, training text images and the like are stored in the database 103, the server 102 receives a target image to be positioned, which is sent by the terminal equipment 101, determines at least one text candidate area in the target image, and acquires a text template image corresponding to the target image; performing feature extraction on at least one text primary selection area to obtain primary selection area features; comparing the characteristics of the primarily selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from at least one text primarily selected area; performing text recognition on the at least one text selected area, and determining a target text area from the at least one text selected area according to a text recognition result; and comparing the text recognition result of the target text region with the label of the text template image, if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image, thereby realizing the positioning of the target text region. Further, the server 102 also obtains training images; inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image; determining a target text area in the training image according to the mode; calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function until the loss function is smaller than a preset threshold value, determining the corresponding parameters as the parameters corresponding to the merchant text positioning network, and obtaining the merchant text positioning network, so that the training of the merchant text positioning network is realized.
It should be noted that the method for positioning a target text region provided by the present invention may be applied to the server 102, and the server executes the method for positioning a target text region provided by the embodiment of the present invention; the method for locating the target text region provided by the invention can be applied to a client of the terminal device, the terminal device 101 implements the method for locating the target text region, and the server 102 can be matched with the client in the terminal device 101.
Fig. 2 is a flowchart illustrating a method for locating a target text region according to an embodiment of the present invention. As shown in fig. 2, the method comprises the steps of:
step S201, at least one text initial selection area in a target image is determined, and a text template image corresponding to the target image is obtained.
The target image may include, but is not limited to, image files in the formats of jpg, bmp, tif, gif, png, and the like, and the target image may also be a screenshot. The target image may be an image uploaded after being photographed by the terminal device in real time, or the target image may be an image acquired from a network, or the target image may be a locally stored image.
The text template image is shown in fig. 3, which is a text template image of a merchant door of a bank.
After the server obtains the target image, the text positioning model TDN, such as a CRAFT model, may be used to determine a text region in the target image, that is, the target image is input into the text positioning model to obtain a region including text, generally speaking, if the merchant portal image includes a plurality of text regions, the plurality of text regions may be used as a text primary selection region to form a set of text primary selection regions.
And S202, extracting the characteristics of the at least one text primary selection area to obtain the characteristics of the primary selection area.
Specifically, feature extraction may be performed on each pixel point of the text initial selection region.
A pixel point refers to a minimum unit, also called a pixel, in an image represented by a sequence of numbers. A pixel is an indivisible unit or element in the entire image. Each dot matrix image contains a certain number of pixels that determine the size of the image to be presented on the screen. One picture is composed of many pixels. For example, the picture size is 500 × 338, the picture is represented by a 500 × 338 pixel matrix, the width of the picture is 500 pixels, the height of the picture is 338 pixels, and the total 500 × 338 is 149000 pixels. The mouse is placed on a picture, which at this time shows the size and size, where the size is the pixels.
And aiming at each text initial selection area, performing feature extraction on each pixel point in the text initial selection area to obtain the initial selection area features of the text initial selection area.
And S203, comparing the characteristics of the primary selection area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from the at least one text primary selection area.
Specifically, feature extraction may be performed on each pixel point of the text template image, the feature of the primary selection region of each text primary selection region is compared with the template image feature of the text template image, and a text selection region is selected from all the text primary selection regions according to the comparison result.
And S204, performing text recognition on the at least one text selected area, and determining a target text area from the at least one text selected area according to a text recognition result.
Step S205, comparing the text recognition result of the target text region with the label of the text template image, and if it is determined that the text recognition result of the target text region is inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image, to obtain the final target text region of the target image.
When the text region of the target image of the merchant is positioned, at least one text candidate region in the target image is determined, and a text template image corresponding to the target image is obtained. Performing feature extraction on at least one text primary selection area to obtain primary selection area features; comparing the characteristics of the primarily selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from at least one text primarily selected area; and performing text recognition on the at least one text selected area, and determining a target text area from the at least one text selected area according to a text recognition result. And then comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image. The embodiment of the invention can accurately position the merchant gate head characters under the complex background of the merchant gate head picture, can effectively shield the interference character area in the picture, and can effectively determine the merchant gate head characters with longer intervals into the same target text area, thereby improving the positioning accuracy of the target text area and further ensuring the accuracy of the subsequent merchant name identification.
Further, before the feature extraction is performed on the at least one text initial selection region to obtain the initial selection region features, the method further includes:
performing feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the step of extracting the features of the at least one text initial selection area to obtain the initial selection area features of the at least one text initial selection area comprises the following steps:
performing feature extraction on the at least one text primary selection area by using the image feature point extraction model to obtain a primary selection area feature set of the at least one text primary selection area;
the comparing the characteristics of the initially selected area with the characteristics of the template image of the text template image to determine at least one text carefully selected area from the at least one text initially selected area comprises:
matching the primary selection area characteristic set of the at least one text primary selection area with the template image characteristic set of the text template image;
and taking the text primary selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
Specifically, a feature point extraction algorithm (e.g., SIFT) may be employed to generate a template image feature set of the text template image. For example, a text template image is input into an image feature point extraction model (FS), and feature extraction is performed on each pixel point of the text template image by using the FS to obtain a template image feature set.
And on the other hand, the text initial selection areas are sequentially input into the image feature point extraction model, and the initial selection area feature set of each text initial selection area is obtained.
Then, aiming at each text initial selection area, matching the initial selection area characteristic set of the text initial selection area with the template image characteristic set of the text template image, and determining the number of pixel points in the initial selection area characteristic set which are matched with the pixel points in the template image characteristic set. And if the number of the matched pixel points is larger than the threshold value of the feature points, taking the text initial selection area as a text carefully selected area, and otherwise, deleting the text initial selection area.
Further, the performing text recognition on the at least one text culling area, and determining a target text area from the at least one text culling area according to a text recognition result includes:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
In the specific implementation process, a text recognition model (TR) is used for carrying out character recognition on each text selection area to obtain a text recognition result of each text selection area. And comparing the text recognition result with the label of the text template image, and selecting a text selection area containing most characters in the merchant name in the recognition result as a target text area.
Further, the expanding the range of the target text region according to the label of the text template image includes:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
In a specific implementation process, it is assumed that a feature point set extracted by an image feature point extraction model (FS) from a text template image Ti is Xt { (Xt1, yt1),. (xtm, ytm) }, and a feature point set extracted by FS from a target text region is Xta { (xta1, yta1), · (xtam, ytam) }.
And if the obtained merchant gate header information, namely the text recognition result of the target text area is less than the label of the text template image, supplementing the target text area according to the missing direction of the merchant label. For example, if the label is "recruit bank" and the recognition result is "recruit", the missing direction is right, 1 minimum unit (1x) is added to the right based on max (xtam), and if the obtained merchant gate header information does not include the merchant name, 1 minimum unit (ly) is added to the upper and lower sides based on max (ytam) and min (ytam), respectively, to obtain a new target text region.
And continuing to recognize the text aiming at the new target text area, comparing the text recognition result with the label of the text template image, and repeating the supplement process until the text recognition result is consistent with the label of the text template image or the supplement frequency exceeds the supplement threshold value.
Further, in an optional embodiment, after determining the target text region in the target image, the embodiment of the present invention may perform model training by using the target image. After obtaining the final target text region of the target image, the method further includes:
acquiring a training image;
inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by a positioning method of the target text region;
calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function, and determining corresponding parameters as parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value to obtain the merchant text positioning network.
In a specific embodiment, a training image may be input into a merchant text positioning network (MDN) (e.g., a CRAFT model), a merchant gate head position is predicted, a predicted position Loc is output, a positioning loss function is calculated by comparing the predicted position Loc with a merchant gate head position obtained according to the above positioning method for the target text region, and a reverse propagation parameter optimization is performed on the merchant text positioning network according to the loss function until the loss function converges.
The following describes an implementation process of the method for locating a target text region according to an embodiment of the present invention by using specific examples.
Fig. 4 shows a system framework corresponding to the method for locating a target text region in an embodiment. The text positioning module is used for positioning the head character position of the commercial tenant and sending the position information to the subsequent identification module. The character recognition module is used for recognizing the image containing the merchant gate head to obtain the merchant name information. The structure of the character recognition module is shown in fig. 5, and includes:
merchant gate head positioning network (MDN): and the system is responsible for positioning the shop heading character position.
Text positioning network (TDN): and performing text positioning on the input image information to obtain all character position information in the image to form a merchant gate position candidate set H.
Image feature point extraction model (FS): extracting feature points (S) of the merchant gate head template image and a feature point set of a merchant gate head position candidate set H, removing all candidate areas of non-merchant gate head positions in a feature point matching mode, combining feedback results of a text positioning network in the later period, dynamically adjusting a merchant gate head position identification result, feeding the merchant gate head position back to the merchant gate head positioning network, and supervising MDN training.
Text recognition network (TR): and extracting the character information in the image.
In a specific embodiment, the method for locating the target text region includes the following steps:
1. merchant gate head information labeling
For each merchant portal, a merchant portal image template Ti is set, for example labeled as a tenderer bank. The other bank door head images are only marked with the name 'the tenderer bank', and the specific position marking is not carried out.
The image feature point extraction model (FS) generates a feature point set TVi for each image template Ti using a feature point extraction algorithm (e.g., SIFT).
2. Merchant gate head positioning Model (MDN) training
(1) Merchant gate head position candidate set generation
A text positioning network (TDN), such as a CRAFT model, captures all text regions in the merchant image to form a candidate set Ac.
(2) Merchant gate candidate set screening
And (3) carrying out feature extraction on the merchant gate head position candidate set Ac generated by the TDN by using an image feature point extraction model (FS), and matching an extraction result with the corresponding template feature point set TVi. And if the matching point number of a certain candidate region is less than the threshold value L, deleting the candidate region to finally form a simplified candidate set As.
(3) Obtaining text information
And (3) performing character recognition on each area in As by a text recognition network (TR), and selecting an area Asx with the most characters in the merchant name in a recognition result. If the result of identifying Asx by TR is inconsistent with the merchant name label, text information area adjustment is required.
(4) Text message region adjustment
Assume that the feature point set extracted by the image feature point extraction model (FS) for the merchant image template Ti is Xt { (Xt1, yt1),. ((xtm, ytm) }, and the initial point set extracted by FS for Asx is Xta { (xta1, yta1),. ((xtam, ytam) }.
And (4) if the merchant gate header information obtained in the step (3) is less than the actual label, supplementing Asx according to the missing direction of the merchant label. For example, if the label is "recruit bank" and the recognition result is "recruit", the missing direction is right, and the minimum unit lx is added to the right based on max (xtam), and if the merchant gate header information obtained in step (3) does not include the merchant name, the minimum unit ly is added to the upper and lower sides based on max (ytam) and min (ytam), respectively, to obtain new Asx.
Step (3) is performed for the new Asx. And (4) circulating execution until the result of the step (3) contains the actual label or the supplement times exceed the maximum threshold value.
(5) Merchant gate head positioning network (MDN) parameter optimization
And (3) carrying out merchant gate position prediction on the input sample by a merchant gate head positioning network (MDN) (such as a CRAFT model), outputting a predicted position Loc, calculating a positioning Loss function Loss according to the merchant gate head position obtained in the step (3) (or the step (4)), and carrying out reverse conducted parameter optimization on the MDN according to the Loss until the Loss converges.
The following are embodiments of the apparatus of the present invention, and for details not described in detail in the embodiments of the apparatus, reference may be made to the above-mentioned one-to-one corresponding method embodiments.
Referring to fig. 6, a block diagram of a device for locating a target text region according to an embodiment of the present invention is shown. The device includes:
an obtaining unit 601, configured to determine at least one text initial selection area in a target image, and obtain a text template image corresponding to the target image;
an extracting unit 602, configured to perform feature extraction on the at least one text primary selection region to obtain primary selection region features;
a selecting unit 603, configured to compare the feature of the initially selected region with a template image feature of a text template image, and determine at least one text selected region from the at least one text initially selected region;
a determining unit 604, configured to perform text recognition on the at least one text culling area, and determine a target text area from the at least one text culling area according to a text recognition result;
an enlarging unit 605, configured to compare the text recognition result of the target text region with the label of the text template image, and if it is determined that the text recognition result of the target text region is inconsistent with the label of the text template image, enlarge the range of the target text region according to the label of the text template image, and obtain a final target text region of the target image.
Optionally, the amplification unit is specifically configured to:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
Optionally, the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; performing feature extraction on the at least one text primary selection area by using the image feature point extraction model to obtain a primary selection area feature set of the at least one text primary selection area;
the fine selection unit is used for matching the primary selection region feature set of the at least one text primary selection region with the template image feature set of the text template image; and taking the text primary selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
Optionally, the determining unit is configured to:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
Correspondingly to the method embodiment, the embodiment of the invention also provides the electronic equipment. The electronic device may be a server, such as server 102 shown in FIG. 1, that includes at least a memory for storing data and a processor for data processing. The processor for data Processing may be implemented by a microprocessor, a CPU, a GPU (Graphics Processing Unit), a DSP, or an FPGA when executing Processing. For the memory, the memory stores therein an operation instruction, which may be a computer executable code, and the operation instruction implements the steps in the flow of the video screening method according to the embodiment of the present invention.
Fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present invention; as shown in fig. 7, the electronic device 70 in the embodiment of the present invention includes: a processor 71, a display 72, a memory 73, an input device 76, a bus 75, and a communication device 74; the processor 71, the memory 73, the input device 76, the display 72 and the communication device 74 are all connected by a bus 75, the bus 75 being used for data transmission between the processor 71, the memory 73, the display 72, the communication device 74 and the input device 76.
The memory 73 may be configured to store software programs and modules, such as program instructions/modules corresponding to the method for positioning a target text region in the embodiment of the present invention, and the processor 71 executes various functional applications and data processing of the electronic device 70 by running the software programs and modules stored in the memory 73, such as the method for positioning a target text region provided in the embodiment of the present invention. The memory 73 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program of at least one application, and the like; the storage data area may store data (such as animation segments, control policy networks) created according to the use of the electronic device 70, and the like. Further, the memory 73 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
The processor 71 is a control center of the electronic device 70, connects various parts of the entire electronic device 70 by using the bus 75 and various interfaces and lines, and performs various functions of the electronic device 70 and processes data by running or executing software programs and/or modules stored in the memory 73 and calling data stored in the memory 73. Alternatively, the processor 71 may include one or more Processing units, such as a CPU, a GPU (Graphics Processing Unit), a digital Processing Unit, and the like.
In the embodiment of the present invention, the processor 71 displays the determined target text area and the text information to the user through the display 72.
The processor 71 may also be connected to a network via a communication device 74, and if the electronic device is a server, the processor 71 may transmit data between the communication device 74 and the terminal device.
The input device 76 is mainly used for obtaining input operations of a user, and when the electronic devices are different, the input device 76 may be different. For example, when the electronic device is a computer, the input device 76 may be a mouse, a keyboard, or other input device; when the electronic device is a portable device such as a smart phone or a tablet computer, the input device 76 may be a touch screen.
The embodiment of the invention also provides a computer storage medium, wherein the computer storage medium stores computer executable instructions, and the computer executable instructions are used for realizing the positioning method of the target text region in any embodiment of the invention.
In some possible embodiments, various aspects of the method for locating a target text region provided by the present invention may also be implemented in the form of a program product including program code for causing a computer device to perform the steps of the method for locating a target text region according to various exemplary embodiments of the present invention described above in this specification when the program product is run on the computer device, for example, the computer device may perform the locating procedure of a target text region in steps S201 to S205 shown in fig. 2.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
A readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described device embodiments are merely illustrative, for example, the division of a unit is only one logical function division, and there may be other division ways in actual implementation, such as: multiple units or components may be combined, or may be integrated into another system, or some features may be omitted, or not implemented. In addition, the coupling, direct coupling or communication connection between the components shown or discussed may be through some interfaces, and the indirect coupling or communication connection between the devices or units may be electrical, mechanical or other forms.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed on a plurality of network units; some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, all the functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may be separately regarded as one unit, or two or more units may be integrated into one unit; the integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.
The above description is only for the specific embodiments of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present invention, and all the changes or substitutions should be covered within the scope of the present invention.

Claims (12)

1. A method for locating a target text region, the method comprising:
determining at least one text primary selection area in a target image, and acquiring a text template image corresponding to the target image;
performing feature extraction on the at least one text initial selection area to obtain initial selection area features;
comparing the characteristics of the initially selected area with the characteristics of the template image of the text template image, and determining at least one text carefully selected area from the at least one text initially selected area;
performing text recognition on the at least one text fine selection area, and determining a target text area from the at least one text fine selection area according to a text recognition result;
and comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, expanding the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
2. The method according to claim 1, wherein the expanding the range of the target text region according to the label of the text template image comprises:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
3. The method according to claim 1, wherein before the feature extraction of the at least one text initial selection region to obtain the initial selection region features, the method further comprises:
performing feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image;
the step of extracting the features of the at least one text initial selection area to obtain the initial selection area features of the at least one text initial selection area comprises the following steps:
performing feature extraction on the at least one text initial selection area by using the image feature point extraction model to obtain an initial selection area feature set of the at least one text initial selection area,
the comparing the characteristics of the initially selected area with the characteristics of the template image of the text template image to determine at least one text carefully selected area from the at least one text initially selected area comprises:
matching the primary selection area characteristic set of the at least one text primary selection area with the template image characteristic set of the text template image;
and taking the text initial selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
4. The method of claim 1, wherein performing text recognition on the at least one text culling area, and determining a target text area from the at least one text culling area according to a text recognition result comprises:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
5. An image text positioning network training method is characterized by comprising the following steps:
acquiring a training image;
inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image;
determining a target text region in the training image, wherein the target text region in the training image is obtained by the method according to any one of claims 1-4;
calculating a loss function according to the merchant text position and the target text region, optimizing parameters of the merchant text positioning network according to the loss function, and determining corresponding parameters as parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value to obtain the merchant text positioning network.
6. An apparatus for locating a target text region, the apparatus comprising:
the acquisition unit is used for determining at least one text initial selection area in a target image and acquiring a text template image corresponding to the target image;
the extraction unit is used for extracting the characteristics of the at least one text primary selection area to obtain the characteristics of the primary selection area;
the selection unit is used for comparing the characteristics of the initial selection area with the characteristics of the template image of the text template image and determining at least one text selection area from the at least one text initial selection area;
the determining unit is used for performing text recognition on the at least one text carefully-selected area and determining a target text area from the at least one text carefully-selected area according to a text recognition result;
and the amplifying unit is used for comparing the text recognition result of the target text region with the label of the text template image, and if the text recognition result of the target text region is determined to be inconsistent with the label of the text template image, the amplifying unit is used for amplifying the range of the target text region according to the label of the text template image to obtain the final target text region of the target image.
7. The device according to claim 6, wherein the amplification unit is specifically configured to:
determining the amplification direction of the target text region according to the label of the text template image and the text recognition result;
and expanding the target text area towards the amplification direction until the text recognition result of the target text area is consistent with the label of the text template image.
8. The apparatus according to claim 6, wherein the extracting unit is specifically configured to perform feature extraction on the text template image by using an image feature point extraction model to obtain a template image feature set of the text template image; performing feature extraction on the at least one text primary selection area by using the image feature point extraction model to obtain a primary selection area feature set of the at least one text primary selection area;
the fine selection unit is used for matching the primary selection region feature set of the at least one text primary selection region with the template image feature set of the text template image; and taking the text primary selection area with the matching point number larger than the characteristic point threshold value as the text fine selection area.
9. The apparatus of claim 6, wherein the determining unit is configured to:
performing text recognition on the at least one text carefully selected area by using a text recognition model to obtain a text recognition result;
and taking the text selection area with the maximum target word number in the text recognition result as the target text area.
10. An image text positioning network training device, characterized in that the device comprises:
an acquisition unit configured to acquire a training image;
the input unit is used for inputting the training image into a merchant text positioning network to obtain a merchant text position in the training image;
a positioning unit for determining a target text region in the training image, wherein the target text region in the training image is obtained by the method according to any one of claims 1-4;
and the optimization unit is used for calculating a loss function according to the merchant text position and the target text region, optimizing the parameters of the merchant text positioning network according to the loss function, and determining the corresponding parameters as the parameters corresponding to the merchant text positioning network when the loss function is smaller than a preset threshold value, so as to obtain the merchant text positioning network.
11. A computer-readable storage medium having a computer program stored therein, the computer program characterized by: the computer program, when executed by a processor, implements the method of any of claims 1 to 4.
12. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program operable on the processor, the computer program, when executed by the processor, causing the processor to carry out the method of any one of claims 1 to 4.
CN202110185262.XA 2021-02-10 2021-02-10 Target text region positioning method and device Active CN112801030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110185262.XA CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110185262.XA CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Publications (2)

Publication Number Publication Date
CN112801030A true CN112801030A (en) 2021-05-14
CN112801030B CN112801030B (en) 2023-09-01

Family

ID=75815110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110185262.XA Active CN112801030B (en) 2021-02-10 2021-02-10 Target text region positioning method and device

Country Status (1)

Country Link
CN (1) CN112801030B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN111814794A (en) * 2020-09-15 2020-10-23 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112016546A (en) * 2020-08-14 2020-12-01 中国银联股份有限公司 Text region positioning method and device
CN112241739A (en) * 2020-12-17 2021-01-19 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for identifying text errors
CN112308046A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Method, device, server and readable storage medium for positioning text region of image

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10032072B1 (en) * 2016-06-21 2018-07-24 A9.Com, Inc. Text recognition and localization with deep learning
CN107622266A (en) * 2017-09-21 2018-01-23 平安科技(深圳)有限公司 A kind of processing method, storage medium and the server of OCR identifications
CN112016546A (en) * 2020-08-14 2020-12-01 中国银联股份有限公司 Text region positioning method and device
CN111814794A (en) * 2020-09-15 2020-10-23 北京易真学思教育科技有限公司 Text detection method and device, electronic equipment and storage medium
CN112308046A (en) * 2020-12-02 2021-02-02 龙马智芯(珠海横琴)科技有限公司 Method, device, server and readable storage medium for positioning text region of image
CN112241739A (en) * 2020-12-17 2021-01-19 北京沃东天骏信息技术有限公司 Method, device, equipment and computer readable medium for identifying text errors

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SONI R.等: "Text detection and localization in natural scene images based on text awareness score", APPL INTELL 49 *
林涵阳等: "复杂场景中机动车行驶证快速检测与识别", 小型微型计算机系统, no. 05 *
王毅: "自然场景下文本区域定位方法的研究", 中国硕士论文全文库 *

Also Published As

Publication number Publication date
CN112801030B (en) 2023-09-01

Similar Documents

Publication Publication Date Title
CN108898186B (en) Method and device for extracting image
CN108229322B (en) Video-based face recognition method and device, electronic equipment and storage medium
CN108229341B (en) Classification method and device, electronic equipment and computer storage medium
CN112580623B (en) Image generation method, model training method, related device and electronic equipment
CN109344762B (en) Image processing method and device
CN109858333B (en) Image processing method, image processing device, electronic equipment and computer readable medium
CN112016546A (en) Text region positioning method and device
US20210271872A1 (en) Machine Learned Structured Data Extraction From Document Image
US11164004B2 (en) Keyframe scheduling method and apparatus, electronic device, program and medium
CN112016545A (en) Image generation method and device containing text
CN113763249A (en) Text image super-resolution reconstruction method and related equipment thereof
CN113627439A (en) Text structuring method, processing device, electronic device and storage medium
CN108597034B (en) Method and apparatus for generating information
CN109389096A (en) Detection method and device
CN112580666A (en) Image feature extraction method, training method, device, electronic equipment and medium
CN113657518B (en) Training method, target image detection method, device, electronic device, and medium
CN108230332B (en) Character image processing method and device, electronic equipment and computer storage medium
CN113378958A (en) Automatic labeling method, device, equipment, storage medium and computer program product
CN113642481A (en) Recognition method, training method, device, electronic equipment and storage medium
CN113657396A (en) Training method, translation display method, device, electronic equipment and storage medium
CN108664948B (en) Method and apparatus for generating information
US20230005171A1 (en) Visual positioning method, related apparatus and computer program product
CN112801030B (en) Target text region positioning method and device
CN115564976A (en) Image processing method, apparatus, medium, and device
CN114093006A (en) Training method, device and equipment of living human face detection model and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant