CN111079749B

CN111079749B - End-to-end commodity price tag character recognition method and system with gesture correction

Info

Publication number: CN111079749B
Application number: CN201911273581.5A
Authority: CN
Inventors: 秦永强; 张发恩; 高达辉
Original assignee: Ainnovation Chongqing Technology Co ltd
Current assignee: Ainnovation Chongqing Technology Co ltd
Priority date: 2019-12-12
Filing date: 2019-12-12
Publication date: 2023-12-22
Anticipated expiration: 2039-12-12
Also published as: CN111079749A

Abstract

The invention provides an end-to-end commodity price tag character recognition method and system with gesture correction, which belong to the technical field of computer vision and comprise the following steps: acquiring a commodity price tag image and extracting features to obtain a corresponding feature map; performing region selection processing on the feature map to obtain a text suggestion region; dividing the text suggestion region to obtain a processed text suggestion region, and performing graphic expansion processing on the processed text suggestion region to obtain a text feature map; performing key point detection processing on the character feature map to obtain a plurality of key points surrounding the character feature map; performing attitude correction processing on the character feature map according to the plurality of key points and by utilizing thin plate spline interpolation to obtain a feature map to be processed with a fixed size and a horizontal feature map to be processed; and carrying out literal processing on the feature map to be processed to obtain corresponding words. The invention has the beneficial effects that: the robustness and the efficiency of the character recognition of the complex scene can be improved.

Description

End-to-end commodity price tag character recognition method and system with gesture correction

Technical Field

The invention relates to the field of computer vision, in particular to an end-to-end commodity price tag character recognition method and system with gesture correction.

Background

The commodity price labels in the channel display images are identified through computer vision technology, so that knowledge of commodity price information has become an important solution for managing and controlling the terminal prices of distribution terminals of various fast-selling brand merchants. In the scheme, the quick and accurate identification of commodity price is realized, and the accurate identification of characters on the price tag is key.

Due to the reasons of image shooting angles, commodity price tags in images have the characteristic of random postures, the directions and the postures of characters on the price tags are uncertain, and great difficulty is brought to accurate identification of the characters. In addition, commodity price identification based on computer vision technology generally has high effectiveness requirements, and near real-time identification speed is required. However, the number of tags in a single channel display image is typically high (typically up to tens), and the number of text fields on a single tag is also typically up to tens, which presents a significant challenge for recognition speed.

Most of the existing character recognition schemes adopt an algorithm scheme of character detection, gesture correction and character recognition, firstly, the character detection algorithm is utilized to locate the position of characters, then, a character image area is cut, gesture correction (affine transformation, perspective transformation and the like) is carried out on the character image through an image processing technology, and then, the character recognition algorithm is utilized to carry out recognition. The method realizes character recognition step by step through a plurality of stages, and mainly has two defects:

1) Inefficiency of recognition

The character detection stage and the character recognition stage can perform feature extraction on the same image area, so that repeated calculation is caused. The calculation amount of the feature extraction stage often occupies most of the total calculation amount, which results in particularly long commodity price identification time of a single channel display image, usually reaching identification time of tens of seconds to minutes, and being difficult to meet the real-time requirement.

2) Algorithm robustness is not enough

Text recognition is typically performed after gesture correction. The existing gesture correction algorithm is basically carried out after a strict area (such as any quadrilateral or rotary rectangular frame area) of a character is determined, all areas (including interference information) of an input character image participate in character recognition after gesture correction, and the problems of character information loss (less-framed part of character areas) and interference information increase (more-framed part of character areas) caused by inaccurate character areas cannot be corrected, namely, the positioning accuracy of the character frame is relatively sensitive, and the robustness is insufficient.

In order to improve the robustness of a character recognition algorithm to the gesture, the prior art provides a character recognition algorithm with gesture correction, and a space conversion module is added in an algorithm model, so that the character recognition with different gestures is realized by selecting an effective character area in an input image to carry out gesture correction based on a plurality of key points predicted by the model, and the character recognition algorithm is insensitive to redundant interference information of the input character image, so that a better effect is obtained. However, the cut text segment image is still required to be used as input, text features can be repeatedly extracted, and end-to-end training can not be realized together with text detection.

In the aspect of end-to-end character recognition, a great deal of work is carried out in a large number of documents, wherein most of the work still adopts a scheme of multi-stage combined training, the character recognition algorithm further proposed in the prior art directly cuts out the character region of interest on the feature map to carry out character recognition, repeated extraction of features is avoided, and meanwhile, multi-task training can be utilized to promote each other, but character gesture correction is not considered. In the prior art, the attitude correction is further performed by performing radial transformation correction on the segmented character feature region of interest, which cannot correct more complicated attitudes such as perspective states, and the problem of losing information of the character region (less-framed part of the effective character region) cannot be solved.

Disclosure of Invention

The invention aims to provide an end-to-end commodity price tag character recognition method with gesture correction, which is applied to channel display, scene character recognition and similar scenes and can improve the robustness and efficiency of complex scene character recognition.

To achieve the purpose, the invention adopts the following technical scheme:

provided is an algorithm model training method, comprising:

the end-to-end commodity price tag character recognition method with gesture correction comprises the following steps:

s1, acquiring a commodity price tag image and extracting features to obtain a corresponding feature map;

s2, carrying out region selection processing on the feature map to obtain a text suggestion region;

s3, dividing the text suggestion region to obtain a processed text suggestion region, and performing graphic expansion processing on the processed text suggestion region to obtain a text feature map;

s4, performing key point detection processing on the character feature map to obtain a plurality of key points surrounding the character feature map;

s5, carrying out posture correction processing on the character feature map according to the plurality of key points and by utilizing thin plate spline interpolation to obtain a feature map to be processed with a fixed size and a horizontal feature map to be processed;

and S6, performing literal processing on the feature map to be processed to obtain corresponding words.

In the step S1, feature extraction is performed on the commodity price tag image by using a deep learning network to extract character features and obtain the multidimensional feature map.

In the step S2, the character suggestion area and the position of the circumscribed rectangle frame are obtained by carrying out the area selection processing on the feature map by utilizing an RPN network.

As a preferred solution of the end-to-end commodity price tag text recognition method with gesture correction, in the step S3, the specific steps of the segmentation process include:

step S31, carrying out de-duplication processing and up-sampling processing on the text suggestion region to obtain at least one high-resolution region, wherein the resolution of the high-resolution region is higher than that of the text suggestion region;

step S32, respectively carrying out pixel-by-pixel segmentation processing on each high-resolution area to obtain a segmentation probability image and attribute probability information of each pixel point in the segmentation probability image, wherein the attribute probability information is used for indicating whether the pixel point is a character and a probability value of the character;

step S33, performing region score calculation processing on each of the segmentation probability images to obtain an average value of the probability values of all the pixel points with characters as attributes in the segmentation probability images, and judging whether the average value corresponding to each of the segmentation probability images is greater than a preset threshold value or not:

if the judgment result is yes, reserving the segmentation probability image;

and if the judgment result is negative, deleting the segmentation probability image.

As a preferred scheme of the character recognition method for the end-to-end commodity price tag with gesture correction, in the step S3, the specific steps of the graphic expansion processing include:

and step S34, performing outward expansion on the segmentation probability image according to the length and width dimensions of the segmentation probability image and a preset proportion to obtain the segmentation probability image after outward expansion and a peripheral part image surrounding the segmentation probability image after outward expansion as the character feature map.

As a preferred embodiment of the method for recognizing the characters of the end-to-end commodity price tag with gesture correction, in the step S4, the key point detection process is performed on the character feature map by using the key point detection with the attention mechanism to obtain a plurality of key points surrounding the concerned character feature map.

In the step S5, according to the plurality of key points and by using thin plate spline interpolation, a feature area actually required to be used in the character feature map is constrained, irrelevant disturbance feature information is filtered to obtain the feature map to be processed, the feature area actually required to be used is a valid text field concerned by a attention mechanism, irrelevant disturbance feature information is an invalid text field surrounding the valid text field, and the feature map to be processed is a horizontal feature area with a fixed size.

As a preferred embodiment of the method for recognizing characters of end-to-end commodity price tags with gesture correction, in the step S6, the specific steps of the word processing include:

step S61, performing code conversion processing on the feature image to be processed to obtain a feature sequence with a fixed length;

step S62, calculating output features of a feature sequence with a fixed length by using an attention mechanism and the BLSTM;

and step S63, decoding the output characteristics to obtain the intelligible characters.

The invention also provides an end-to-end commodity price tag character recognition system with gesture correction, which can realize the end-to-end commodity price tag character recognition method, and comprises the following steps:

the feature extraction module is used for acquiring commodity price tag images and extracting features to obtain corresponding feature images;

the character region cutting module is used for carrying out region selection processing on the feature map to obtain a character suggestion region, carrying out segmentation processing on the character suggestion region to obtain a processed character suggestion region, and carrying out graphic expansion processing on the processed character suggestion region to obtain a character feature map;

the key point detection module is used for carrying out key point detection processing on the character feature map to obtain a plurality of key points surrounding the character feature map;

the gesture correction module is used for carrying out gesture correction processing on the character feature map to obtain a feature map to be processed according to the plurality of key points and by utilizing thin plate spline interpolation;

and the literal module is used for literaling the feature map to be processed to obtain corresponding words.

As a preferable scheme of the end-to-end commodity price tag character recognition system with gesture correction, the system performs commodity price tag character recognition based on a preset processing model, and updates and optimizes the processing model according to a recognition process and a recognition result.

The invention has the beneficial effects that: after extracting the feature map from the commodity price tag image, directly processing the feature map to obtain a processed text suggestion region for subsequent text processing, and only carrying out feature extraction once, thereby effectively improving the text recognition efficiency;

after the character suggestion area is obtained, character segmentation processing is carried out to obtain a processed character suggestion area containing effective character fields, and graphic expansion processing is carried out to obtain a character feature map, so that the problem that recognition results are affected due to the fact that part of character features are lost is solved, and the robustness and the efficiency of character recognition of complex scenes are improved;

and performing key point detection on the character preferential total energy diagram to obtain a plurality of key points surrounding the character feature diagram, adjusting the character gesture corresponding to the character feature diagram to the horizontal direction based on the key points by utilizing thin plate spline interpolation to obtain a feature diagram to be processed with a fixed size and horizontally, identifying characters in different directions and curves, and improving the robustness and the efficiency of character identification of a complex scene.

Drawings

In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are required to be used in the embodiments of the present invention will be briefly described below. It is evident that the drawings described below are only some embodiments of the present invention and that other drawings may be obtained from these drawings without inventive effort for a person of ordinary skill in the art.

FIG. 1 is a flow chart of a method for end-to-end commodity price tag text identification with gesture correction according to an embodiment of the present invention.

FIG. 2 is a flow chart of step S3 according to another embodiment of the present invention;

FIG. 3 is a flowchart of step S6 according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of functional modules of an end-to-end commodity price tag text recognition system with gesture correction according to an embodiment of the present invention.

Detailed Description

The technical scheme of the invention is further described below by the specific embodiments with reference to the accompanying drawings.

Wherein the drawings are for illustrative purposes only and are shown in schematic, non-physical, and not intended to be limiting of the present patent; for the purpose of better illustrating embodiments of the invention, certain elements of the drawings may be omitted, enlarged or reduced and do not represent the size of the actual product; it will be appreciated by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numbers in the drawings of embodiments of the invention correspond to the same or similar components; in the description of the present invention, it should be understood that, if the terms "upper", "lower", "left", "right", "inner", "outer", etc. indicate orientations or positional relationships based on the orientations or positional relationships shown in the drawings, only for convenience in describing the present invention and simplifying the description, rather than indicating or implying that the apparatus or elements being referred to must have a specific orientation, be constructed and operated in a specific orientation, so that the terms describing the positional relationships in the drawings are merely for exemplary illustration and should not be construed as limiting the present patent, and that the specific meaning of the terms described above may be understood by those of ordinary skill in the art according to specific circumstances.

In the description of the present invention, unless explicitly stated and limited otherwise, the term "coupled" or the like should be interpreted broadly, as it may be fixedly coupled, detachably coupled, or integrally formed, as indicating the relationship of components; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communication between the two parts or interaction relationship between the two parts. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

As shown in fig. 1, the method for recognizing the price tag words of the end-to-end commodity with gesture correction provided by the embodiment of the invention comprises the following steps:

In the embodiment, after the feature map is extracted from the commodity price tag image, the feature map is directly processed to obtain the processed text suggestion region for subsequent text processing, and only one feature extraction is needed, so that the text recognition efficiency is effectively improved;

and performing key point detection on the character feature map to obtain a plurality of key points surrounding the character feature map, adjusting the character gesture corresponding to the character feature map to the horizontal direction based on the key points by utilizing thin plate spline interpolation to obtain a feature map to be processed with a fixed size and in a horizontal state, and can identify characters in different directions and curves, so that the robustness and the efficiency of character identification of complex scenes are improved.

Further, in the step S1, feature extraction is performed on the commodity price tag image by using a deep learning network to extract text features so as to obtain the multi-dimensional feature map.

Further, in the step S2, the region selection process is performed on the feature map by using an RPN network to obtain the text suggestion region and the position of the circumscribed rectangular frame thereof.

Specifically, a regression branch is utilized to obtain the position of the circumscribed rectangle frame of the text suggestion region,

as shown in fig. 2, in the step S3, the specific steps of the segmentation process include:

step S32, respectively carrying out pixel-by-pixel segmentation processing on each high-resolution area to obtain a segmentation probability image and attribute probability information of each pixel point in the segmentation probability image, wherein the attribute probability information is used for indicating whether the pixel point is a character or not and a probability value of the character;

step S33, performing region score calculation processing on each of the segmentation probability images to obtain an average value of the probability values of all the pixels with characters as attributes in the segmentation probability images, and determining whether the average value corresponding to each of the segmentation probability images is greater than a preset threshold value or not:

if the judgment result is yes, the segmentation probability image is reserved;

if the judgment result is negative, deleting the segmentation probability image.

Specifically, whether each pixel point is a segmentation map of a character or not and a probability map corresponding to the segmentation map are obtained by using another segmentation branch (the segmentation map and the probability map are collectively called as a segmentation probability image);

and then calculating the average score of the text suggestion areas according to the probability value scores of the pixels belonging to the text in each text suggestion area, and reserving the text suggestion areas with the scores higher than a certain threshold value.

Further, in the step S3, the specific steps of the pattern expansion process include:

as shown in fig. 2, step S34 is to expand the segmentation probability image according to a preset ratio according to the length and width of the segmentation probability image, so as to obtain the expanded segmentation probability image and a peripheral part image surrounding the expanded segmentation probability image as the text feature map.

Specifically, according to the length and width dimensions of the text suggestion region, a certain proportion of expansion is performed, and then the expanded text suggestion region (i.e. the text feature map) is cut and input to the next stage.

Further, in the step S4, the keyword detection process is performed on the text feature map by using the keyword detection with attention mechanism to obtain the plurality of the keywords surrounding the text feature map of attention.

Specifically, according to the characteristics of the cut text suggestion region (i.e. the text feature map), a key point detection network with a attention mechanism is utilized to detect k key points surrounding the concerned text feature map.

Further, in the step S5, according to the plurality of key points and by using thin-plate spline interpolation, a feature area actually required to be used in the text feature map is constrained, irrelevant disturbance feature information is filtered to obtain the feature map to be processed, the feature area actually required to be used is a valid text field concerned by a attention mechanism, irrelevant disturbance feature information is an invalid text field surrounding the valid text field, and the feature map to be processed is a horizontal feature area with a fixed size.

Specifically, according to k key points, a feature map area (namely a text feature map) of interest is transformed into a horizontal feature area with a fixed size by utilizing thin-plate spline interpolation;

as shown in fig. 3, in the step S6, the specific steps of the word processing include:

Specifically, an encoder +LSTM +intent is then used to identify the corresponding text.

As shown in fig. 4, an end-to-end commodity price tag text recognition system with gesture correction, comprising:

the feature extraction module 1 is used for acquiring commodity price tag images and extracting features to obtain corresponding feature images, mainly based on the input commodity price tag images, extracting character features by using a convolutional neural network, and outputting a multidimensional feature image;

the text region cutting module 2 is used for carrying out region selection processing on the feature map to obtain a text suggestion region, carrying out segmentation processing on the text suggestion region to obtain a processed text suggestion region, and carrying out graphic expansion processing on the processed text suggestion region to obtain a text feature map;

the key point detection module 3 is used for performing key point detection processing on the character feature map to obtain a plurality of key points surrounding the character feature map;

the gesture correction module 4 is used for carrying out gesture correction processing on the character feature map to obtain a feature map to be processed according to the plurality of key points and by utilizing thin plate spline interpolation;

and the literal module 5 is used for literaling the feature map to be processed to obtain corresponding words.

Further, the text region cutting module 2 includes:

a text region suggesting unit 21 for obtaining the position of the circumscribed rectangular frame of the text region suggested by the text region using an RPN network based on the extracted feature map;

an nms unit 22 for performing a de-duplication process on the obtained text suggestion region;

an up-sampling unit 23 for transforming the low resolution feature into the high resolution feature so as to divide the text region later;

the segmentation unit 24 performs pixel-by-pixel segmentation according to the feature map obtained by the vegetable sample loading unit, and determines whether each pixel belongs to a text region and the probability thereof;

a score calculating unit 25 that calculates, for each text suggestion region, an average probability of all pixels belonging to the text contained therein as a score of the text suggestion region;

a character region cutting unit 26 for performing outward expansion according to a certain proportion of the length and width of each character suggestion region with the score higher than a certain threshold value obtained in the previous process, and cutting a feature map containing the character suggestion region and the peripheral part region thereof as a character feature map input to the next stage; wherein the expansion scale factor is inversely proportional to the size of the text suggestion region.

Further, the key point detection module 3 detects peripheral key points of the concerned text region in the input text feature map so as to restrict the feature region actually needed to be used, mainly for filtering irrelevant interference feature information. Because the entered text feature map may contain partial feature information for other text fields surrounding the text segment of interest. The key point detection module 3 includes:

a first attention unit 31 that calculates an attention parameter for controlling a region of interest at the time of keypoint prediction;

a key point detection unit 32 for converting the input feature map to an output feature map of a fixed size by thin-plate spline interpolation based on the obtained key points;

further, the literal module 5 includes:

a coding unit 51 for coding and converting the feature map with fixed size into a feature sequence with fixed length;

a second attention unit 52 and a BLSTM unit 53, with which output features are calculated;

the decoding unit 54 transcribes the output features into intelligible text.

Further, the system carries out commodity price tag character recognition based on a preset processing model, and updates and optimizes the processing model according to the recognition process and the recognition result. In the model training process, character rectangular frame detection, character segmentation detection and character recognition all participate in loss calculation, and the performance is improved through multitasking training.

The character detection and character recognition multiplexing feature extractor can effectively improve recognition efficiency;

the problem that the recognition result is affected due to the loss of character part characteristics can be solved by utilizing a character region cutting module with a self-adaptive expansion function;

the influence of redundant character areas in the cut character feature area of interest can be relieved by utilizing a character key point detection module with an attention mechanism;

based on the detected text key points, the text gesture is corrected to the horizontal direction by utilizing the thin plate spline interpolation, and the recognition effect is improved.

It should be understood that the above description is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be apparent to those skilled in the art that various modifications, equivalents, variations, and the like can be made to the present invention. However, such modifications are intended to fall within the scope of the present invention without departing from the spirit of the present invention. In addition, some terms used in the specification and claims of the present application are not limiting, but are merely for convenience of description.

Claims

1. An end-to-end commodity price tag character recognition method with gesture correction is characterized by comprising the following steps:

s6, performing literal processing on the feature map to be processed to obtain corresponding words;

in the step S4, the key point detection processing is performed on the text feature map by using the key point detection with the attention mechanism to obtain a plurality of key points surrounding the text feature map of interest;

in step S5, according to the plurality of key points and by using thin-plate spline interpolation, a feature area actually required to be used in the text feature map is constrained, irrelevant disturbance feature information is filtered to obtain the feature map to be processed, the feature area actually required to be used is a valid text field concerned by an attention mechanism, irrelevant disturbance feature information is an invalid text field surrounding the valid text field, and the feature map to be processed is a horizontal feature area with a fixed size.

2. The method for recognizing the characters of the end-to-end commodity price tag with gesture correction according to claim 1, wherein in the step S1, feature extraction is performed on the commodity price tag image by using a deep learning network to extract character features so as to obtain the multi-dimensional feature map.

3. The end-to-end commodity price tag character recognition method with gesture correction according to claim 1, wherein in the step S2, the character suggestion region and the circumscribed rectangular frame position thereof are obtained by performing the region selection processing on the feature map by using an RPN network.

4. The method for recognizing end-to-end commodity price tag text with posture correction according to claim 1, wherein in said step S3, the specific steps of said dividing process include:

step S32, respectively carrying out pixel-by-pixel segmentation processing on each high-resolution area to obtain a segmentation probability image and attribute probability information of each pixel point in the segmentation change image, wherein the attribute probability information is used for indicating whether the pixel point is a character and a probability value of the character;

if the judgment result is yes, reserving the segmentation probability image;

5. The method for recognizing end-to-end commodity price tag text with posture correction according to claim 4, wherein in said step S3, the specific step of said graphic expansion process comprises:

6. The method for recognizing end-to-end commodity price tag text with posture correction according to claim 1, wherein in said step S6, the specific steps of said text processing include:

7. An end-to-end commodity price tag word recognition system with gesture correction, capable of implementing the end-to-end commodity price tag word recognition method according to any one of claims 1 to 6, comprising:

8. The end-to-end commodity price tag word recognition system with gesture correction according to claim 7, wherein said system performs commodity price tag word recognition based on a preset process model, and updates and optimizes said process model according to the recognition process and recognition result.