CN113743394B

CN113743394B - Method, device, equipment and readable medium for detecting characters in tag

Info

Publication number: CN113743394B
Application number: CN202110904891.3A
Authority: CN
Inventors: 许博
Original assignee: Suzhou Inspur Intelligent Technology Co Ltd
Current assignee: Suzhou Inspur Intelligent Technology Co Ltd
Priority date: 2021-08-07
Filing date: 2021-08-07
Publication date: 2023-08-11
Anticipated expiration: 2041-08-07
Also published as: CN113743394A

Abstract

The invention discloses a method for detecting characters in a label, which comprises the following steps: detecting the positions of the labels in the shot pictures and cutting to obtain the label pictures; performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size; performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters; and judging whether the number of characters reaches the preset number of characters, and if so, outputting the text box information. The invention also discloses a device for detecting the characters in the label, a computer device and a readable storage medium. The invention extracts the tag characters by adopting tag picture extraction, feature fusion and text detection card combination, thereby realizing that the output text box information can be directly applied to terminal equipment.

Description

Method, device, equipment and readable medium for detecting characters in tag

Technical Field

The present invention relates to the field of image recognition technologies, and in particular, to a method, an apparatus, a device, and a readable medium for detecting characters in a tag.

Background

In the field of computer vision, a morphological operation method is generally used for detecting characters, and character detection of a simple scene, such as detecting the position of a character area in a photographed image of a book, can be realized by using image morphological operations in computer vision, including basic operations such as expansion, corrosion and the like.

However, the conventional method has unavoidable drawbacks, and the conventional detection method generally scans the whole image, so that many interference pixels of natural scenes, such as symbols or objects with many similar characters, are introduced, which increases the difficulty of finding the target by using artificial features. In addition, in most conventional detection methods, a threshold is manually set to adjust the detection sensitivity, and too high a threshold may result in detection of too many interference factors to perform manual feature matching, while too low a threshold may result in failure to detect a desired character.

Deep learning-based character detection algorithms have been under considerable study over the past few years, with the development of artificial intelligence and the development of some CNN (Convolutional Neural Networks, deep convolutional neural network) network-based character algorithms, such as the classical character detection network CTPN (Detecting Text in Natural Image with Connectionist Text Proposal Network, based on text detection coupled to a pre-selected box network). The detection method based on deep learning has achieved a good effect, and with the improvement of the architecture performance of the deep convolutional neural network, the detection performance of the deep convolutional neural network is better.

However, the network model used by these conventional deep learning character detection methods is huge, the required computational effort is high, and it is difficult to directly deploy the network model on the terminal device, because the storage and computational effort of the terminal device are limited.

Disclosure of Invention

Accordingly, an object of the embodiments of the present invention is to provide a method, apparatus, device and readable medium for detecting characters in a tag, which can directly apply the output text box information in a terminal device by extracting tag characters through tag picture extraction, feature fusion and text detection card combination.

Based on the above object, an aspect of the embodiments of the present invention provides a method for detecting characters in a tag, including the following steps: detecting the positions of the labels in the shot pictures and cutting to obtain the label pictures; performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size; performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters; and judging whether the number of characters reaches the preset number of characters, and if so, outputting the text box information.

In some embodiments, further comprising: if the number of the characters does not reach the preset number of the characters, carrying out text box merging or non-maximum value suppression processing to obtain final text box information.

In some embodiments, performing text box merging or non-maximum suppression processing to obtain final text box information further includes: generating a score based on the text box information and the number of characters; judging whether the midpoint distance in the text box information is smaller than a preset midpoint distance or not; if the midpoint distance in the text box information is smaller than the preset midpoint distance, text box combination is carried out according to the score to obtain final text box information; and if the midpoint distance in the text box information is not smaller than the preset midpoint distance, performing non-maximum suppression processing to obtain final text box information.

In some embodiments, detecting the tag location in the captured picture and cropping to obtain the tag picture includes: acquiring a shot picture, and detecting the position of a label in the shot picture based on an edge detection algorithm; and cutting based on the label position to obtain a label picture.

In some embodiments, performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size includes: extracting four vertexes of a character area in the tag picture, and correcting the tag picture based on the four vertexes; and uniformly scaling the corrected tag pictures to a preset size, scaling the long sides to the preset size, and filling the short sides in gray scale based on the preset size to obtain the character region pictures.

In some embodiments, performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters includes: performing feature extraction and feature fusion on the character region picture based on a neural network model; and outputting text box information and the number of characters in an up-sampling mode by passing the character region picture through a convolution layer and a maximum pooling layer.

In some embodiments, further comprising: and generating a score based on the text box information and the number of characters, and optimizing parameters of the neural network model based on the score.

In another aspect of the embodiment of the present invention, there is also provided a device for detecting characters in a tag, including: the first module is configured to detect the positions of the labels in the shot pictures and cut the labels so as to obtain the label pictures; the second module is configured to perform position correction based on the shape of the character area in the tag picture, and perform scaling and normalization processing on the corrected tag picture to obtain a character area picture with a preset size; the third module is configured to perform text box detection on the character area picture based on the neural network model so as to obtain text box information and the number of characters; and a fourth module configured to determine whether the number of characters reaches a preset number of characters, and if so, output the text box information.

In still another aspect of the embodiment of the present invention, there is also provided a computer apparatus, including: at least one processor; and a memory storing computer instructions executable on the processor, the instructions when executed by the processor performing steps of a method comprising: detecting the positions of the labels in the shot pictures and cutting to obtain the label pictures; performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size; performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters; and judging whether the number of characters reaches the preset number of characters, and if so, outputting the text box information.

In yet another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method steps as described above.

The invention has the following beneficial technical effects: the extraction of the tag characters is carried out by adopting tag picture extraction, feature fusion and text detection card combination, so that the output of text box information can be directly applied to terminal equipment.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are necessary for the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention and that other embodiments may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an embodiment of a method for detecting characters in a tag according to the present invention;

FIG. 2 is a schematic diagram of an embodiment of a device for detecting characters in a tag according to the present invention;

FIG. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention;

fig. 4 is a schematic diagram of an embodiment of a computer readable storage medium provided by the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the following embodiments of the present invention will be described in further detail with reference to the accompanying drawings.

It should be noted that, in the embodiments of the present invention, all the expressions "first" and "second" are used to distinguish two entities with the same name but different entities or different parameters, and it is noted that the "first" and "second" are only used for convenience of expression, and should not be construed as limiting the embodiments of the present invention, and the following embodiments are not described one by one.

Based on the above object, in a first aspect of the embodiments of the present invention, an embodiment of a method for detecting characters in a tag is provided. Fig. 1 is a schematic diagram of an embodiment of a method for detecting characters in a tag according to the present invention. As shown in fig. 1, the embodiment of the present invention includes the following steps:

s01, detecting the position of a tag in a shot picture and cutting to obtain the tag picture;

s02, performing position correction based on the shape of a character area in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character area picture with a preset size;

s03, carrying out text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters; and

s04, judging whether the number of characters reaches the preset number of characters, and if the number of characters reaches the preset number of characters, outputting text box information.

In this embodiment, the problem to be solved is to calibrate the character position of the tag, where the background of the tag is single, the character format is regular, no rotation or tilting occurs, and the corresponding position can be detected by using a traditional OCR character detection model, but the efficiency is low. The invention extracts the tag characters by adopting the methods of tag picture extraction, feature fusion, text detection card combination and the like, thereby achieving the purpose of being directly applied to the terminal.

In some embodiments of the invention, further comprising: if the number of characters does not reach the preset number of characters, text box merging or non-maximum value suppression processing is carried out to obtain final text box information.

In this embodiment, text box merging is performed according to the deviation of the score value and the center value, and then non-maximum suppression merging is performed.

In some embodiments of the present invention, performing text box merging or non-maximum suppression processing to obtain final text box information further includes: generating a score based on the text box information and the number of characters; judging whether the midpoint distance in the text box information is smaller than a preset midpoint distance or not; if the midpoint distance in the text box information is smaller than the preset midpoint distance, merging the text boxes according to the score to obtain final text box information; and if the midpoint distance in the text box information is not smaller than the preset midpoint distance, performing non-maximum value inhibition processing to obtain the final text box information.

In this embodiment, sorting is performed according to the score condition of each text box; sequentially detecting from high score to low score, and if the distance between the center points is smaller than a threshold value for setting the distance between the center points, if the distance between the center points is smaller than the threshold value for setting the distance between the center points, merging the text boxes according to the weight; if the threshold value of the set center point distance is not met, performing NMS processing on the text detection box; and obtaining a text detection box and a corresponding score.

Among them, NMS (Non-Maximum Suppression ) processing is widely used in many computer vision tasks, such as: edge detection, object detection, etc. The purpose is to remove redundant detection frames, leaving the best one.

In some embodiments of the present invention, detecting the tag position in the captured image and cropping to obtain the tag image includes: acquiring a shot picture, and detecting the position of a label in the shot picture based on an edge detection algorithm; and cutting based on the label position to obtain a label picture.

In this embodiment, detecting the label position in the shot image based on the edge detection algorithm includes extracting the label contour by using a canny operator, and performing binarization on the image according to the difference between the label and the color of the box, for example, converting the label position into 1 and converting the box position into 0. Among them, the edge detection algorithm is a conventional CV (computer vision) problem, and the conventional CV method has a canny algorithm. The Canny edge operator is formed by three main targets: first, optimal detection without additional response, i.e. without losing important edges, should not be false edges; the second, the deviation between the actual edge and the detected edge position is minimal; third, the multiple responses of the single edge are reduced to obtain a single response. This is first aimed at reducing the noise response. The second object is correctness, i.e. edges are to be detected at the correct position. A third object is to limit the positioning of a single edge point junction to a brightness variation. Canny indicates that the gaussian operator is optimal for image smoothing. Picture binarization is a process of setting the gray value of a pixel point on an image to 0 or 255, that is, displaying a significant black-and-white effect on the whole image.

In some embodiments of the present invention, performing position correction based on a shape of a character region in a tag image, and performing scaling and normalization processing on the corrected tag image to obtain a character region image with a preset size includes: extracting four vertexes of a character area in the tag picture, and correcting the tag picture based on the four vertexes; and uniformly scaling the corrected label pictures to a preset size, wherein the long sides are scaled to the preset size, and gray filling is carried out on the short sides based on the preset size to obtain character region pictures.

In this embodiment, a binary pattern is adopted, the picture is corroded first, the bar code area is corroded into a connected whole, then the closed operation is performed, four vertex positions of the bar code are extracted, and the picture is corrected according to the positioning positions.

In this embodiment, scaling the corrected tag image to 512×512 is taken as an example, in the process of scaling the image size, scaling is performed without changing the aspect ratio of the image, scaling the maximum edge to 512, and filling the short edge in gray scale.

In some embodiments of the present invention, performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters includes: performing feature extraction and feature fusion on the character region picture based on the neural network model; and outputting text box information and the number of characters in an up-sampling mode by passing the character region picture through a convolution layer and a maximum pooling layer.

In this embodiment, an input value (1,512,512,3) is taken as an example, wherein the first bit 1 represents a picture, the second bit 512 and the third bit 512 represent the picture size, and the fourth bit 3 represents the channel number. The input value (1,512,512,3) passes through the first convolution layer and the maximum pooling layer, and the output becomes (1,128,128,64); through the second convolution layer and the max pooling layer, the output becomes (1,64,64,128); through the third convolution layer and the max pooling layer, the output becomes (1,32,32,256); through the fourth convolution layer and the max pooling layer, the output becomes (1,16,16,512); the output value is combined with the output value (1,32,32,256) passing through the third convolution layer and the maximum pooling layer after fourth upsampling, and then the output value is changed into (1,32,32,256) after passing through the fourth convolution fusion layer; the output value is combined with the output value (1,64,64,128) passing through the second convolution layer and the maximum pooling layer after the third upsampling, and then the output value is changed into (1,64,64,128) after passing through the third convolution fusion layer; the output is up sampled and combined with the output value (1,128,128,64) through the first convolutional layer and the max-pooling layer, and then through the second convolutional fusion layer, the output becomes (1,128,128,64).

In some embodiments of the invention, further comprising: and generating a score based on the text box information and the number of characters, and optimizing parameters of the neural network model based on the score.

In this embodiment, the text box information includes a midpoint and a side length, and is an array in the float format, such as [ center point, side length 1, side length 2]. Numbers scored as float type.

In the present embodiment, during the back propagation of the model, the Loss function of the model is based on the score Loss _sco Midpoint and side length loss _g Character number Loss _n Based on the following formula:

Loss=αloss _sco +βloss _g +λLoss _n

the score los is obtained by adopting class balance cross entropy, and the values of alpha, beta and lambda are determined according to training conditions.

It should be noted that, in the above-mentioned embodiments of the method for detecting characters in a tag, the steps may be intersected, replaced, added and deleted, so that the method for detecting characters in a tag with reasonable permutation and combination should also belong to the protection scope of the present invention, and the protection scope of the present invention should not be limited to the embodiments.

Based on the above object, a second aspect of the embodiments of the present invention provides a device for detecting characters in a tag. Fig. 2 is a schematic diagram of an embodiment of a device for detecting characters in a tag according to the present invention. As shown in fig. 2, the embodiment of the invention includes the following modules: the first module S11 is configured to detect the position of the tag in the shot picture and cut the tag to obtain a tag picture; a second module S12, configured to perform position correction based on the shape of the character area in the tag picture, and perform scaling and normalization processing on the corrected tag picture to obtain a character area picture with a preset size; a third module S13 configured to perform text box detection on the character area picture based on the neural network model, so as to obtain text box information and the number of characters; and a fourth module S14 configured to determine whether the number of characters reaches a preset number of characters, and if the number of characters reaches the preset number of characters, output text box information.

Based on the above object, a third aspect of the embodiments of the present invention proposes a computer device. Fig. 3 is a schematic diagram of an embodiment of a computer device provided by the present invention. As shown in fig. 3, an embodiment of the present invention includes the following means: at least one processor S21; and a memory S22, the memory S22 storing computer instructions S23 executable on the processor, the instructions when executed by the processor performing the steps of the method comprising: detecting the positions of the labels in the shot pictures and cutting to obtain the label pictures; performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size; performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters; and judging whether the number of characters reaches the preset number of characters, and if the number of characters reaches the preset number of characters, outputting text box information.

The invention also provides a computer readable storage medium. Fig. 4 is a schematic diagram of an embodiment of a computer-readable storage medium provided by the present invention. As shown in fig. 4, the computer-readable storage medium S31 stores a computer program S32 that, when executed by a processor, performs the method as described above.

Finally, it should be noted that, as will be understood by those skilled in the art, implementing all or part of the above-described methods in the embodiments may be implemented by a computer program to instruct related hardware, and the program of the method for detecting characters in a tag may be stored in a computer readable storage medium, where the program may include the steps of the embodiments of the methods described above when executed. The storage medium of the program may be a magnetic disk, an optical disk, a read-only memory (ROM), a random-access memory (RAM), or the like. The computer program embodiments described above may achieve the same or similar effects as any of the method embodiments described above.

Furthermore, the method disclosed according to the embodiment of the present invention may also be implemented as a computer program executed by a processor, which may be stored in a computer-readable storage medium. The above-described functions defined in the methods disclosed in the embodiments of the present invention are performed when the computer program is executed by a processor.

Furthermore, the above-described method steps and system units may also be implemented using a controller and a computer-readable storage medium storing a computer program for causing the controller to implement the above-described steps or unit functions.

Those of skill would further appreciate that the various illustrative logical blocks, modules, circuits, and algorithm steps described in connection with the disclosure herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate this interchangeability of hardware and software, various illustrative components, blocks, modules, circuits, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as software or hardware depends upon the particular application and design constraints imposed on the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present disclosure.

In one or more exemplary designs, the functions may be implemented in hardware, software, firmware, or any combination thereof. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media including any medium that facilitates transfer of a computer program from one location to another. A storage media may be any available media that can be accessed by a general purpose or special purpose computer. By way of example, and not limitation, such computer-readable media can comprise RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer or general purpose or special purpose processor. Further, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, digital Subscriber Line (DSL), or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital Versatile Disc (DVD), floppy disk, blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of computer-readable media.

The foregoing is an exemplary embodiment of the present disclosure, but it should be noted that various changes and modifications could be made herein without departing from the scope of the disclosure as defined by the appended claims. The functions, steps and/or actions of the method claims in accordance with the disclosed embodiments described herein need not be performed in any particular order. Furthermore, although elements of the disclosed embodiments may be described or claimed in the singular, the plural is contemplated unless limitation to the singular is explicitly stated.

It should be understood that as used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly supports the exception. It should also be understood that "and/or" as used herein is meant to include any and all possible combinations of one or more of the associated listed items.

The foregoing embodiment of the present invention has been disclosed with reference to the number of embodiments for the purpose of description only, and does not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, and the program may be stored in a computer readable storage medium, where the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Those of ordinary skill in the art will appreciate that: the above discussion of any embodiment is merely exemplary and is not intended to imply that the scope of the disclosure of embodiments of the invention, including the claims, is limited to such examples; combinations of features of the above embodiments or in different embodiments are also possible within the idea of an embodiment of the invention, and many other variations of the different aspects of the embodiments of the invention as described above exist, which are not provided in detail for the sake of brevity. Therefore, any omission, modification, equivalent replacement, improvement, etc. of the embodiments should be included in the protection scope of the embodiments of the present invention.

Claims

1. The method for detecting the characters in the label is characterized by comprising the following steps:

detecting the positions of the labels in the shot pictures and cutting to obtain the label pictures;

performing position correction based on the shape of the character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture with a preset size;

performing text box detection on the character region picture based on the neural network model to obtain text box information and the number of characters;

judging whether the number of the characters reaches the preset number of characters, and if so, outputting the text box information; and

if the number of the characters does not reach the preset number of the characters, carrying out text box merging or non-maximum value suppression processing to obtain final text box information.

2. The method of claim 1, wherein performing text box merging or non-maximum suppression processing to obtain final text box information further comprises:

generating a score based on the text box information and the number of characters;

judging whether the midpoint distance in the text box information is smaller than a preset midpoint distance or not;

if the midpoint distance in the text box information is smaller than the preset midpoint distance, text box combination is carried out according to the score to obtain final text box information;

and if the midpoint distance in the text box information is not smaller than the preset midpoint distance, performing non-maximum suppression processing to obtain final text box information.

3. The method for detecting characters in a tag according to claim 1, wherein detecting the position of the tag in the photographed picture and cutting to obtain the tag picture comprises:

acquiring a shot picture, and detecting the position of a label in the shot picture based on an edge detection algorithm;

and cutting based on the label position to obtain a label picture.

4. The method for detecting characters in a tag according to claim 1, wherein performing position correction based on the shape of a character region in the tag picture, and performing scaling and normalization processing on the corrected tag picture to obtain a character region picture of a preset size comprises:

extracting four vertexes of a character area in the tag picture, and correcting the tag picture based on the four vertexes;

and uniformly scaling the corrected tag pictures to a preset size, scaling the long sides to the preset size, and filling the short sides in gray scale based on the preset size to obtain the character region pictures.

5. The method for detecting characters in a tag according to claim 1, wherein performing text box detection on the character region picture based on a neural network model to obtain text box information and the number of characters comprises:

performing feature extraction and feature fusion on the character region picture based on a neural network model;

and outputting text box information and the number of characters in an up-sampling mode by passing the character region picture through a convolution layer and a maximum pooling layer.

6. The method for detecting characters in a tag of claim 5, further comprising:

and generating a score based on the text box information and the number of characters, and optimizing parameters of the neural network model based on the score.

7. A device for detecting characters in a label, comprising:

the first module is configured to detect the positions of the labels in the shot pictures and cut the labels so as to obtain the label pictures;

the second module is configured to perform position correction based on the shape of the character area in the tag picture, and perform scaling and normalization processing on the corrected tag picture to obtain a character area picture with a preset size;

the third module is configured to perform text box detection on the character area picture based on the neural network model so as to obtain text box information and the number of characters;

a fourth module configured to determine whether the number of characters reaches a preset number of characters, and if so, output the text box information; and

and a fifth module configured to perform text box merging or non-maximum suppression processing to obtain final text box information if the number of characters does not reach the preset number of characters.

8. A computer device, comprising:

at least one processor; and

a memory storing computer instructions executable on the processor, which when executed by the processor, perform the steps of the method of any one of claims 1-6.

9. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method of any one of claims 1-6.