CN114495132A

CN114495132A - Character recognition method, device, equipment and storage medium

Info

Publication number: CN114495132A
Application number: CN202111535285.5A
Authority: CN
Inventors: 文玉茹; 卢道和; 杨军; 程志峰; 李勋棋; 罗海湾; 何勇彬; 陈鉴镔; 胡仲臣; 陈刚; 周佳振; 朱嘉伟; 郭英亚; 李兴龙; 周琪; 熊思清
Original assignee: WeBank Co Ltd
Current assignee: WeBank Co Ltd
Priority date: 2021-12-15
Filing date: 2021-12-15
Publication date: 2022-05-13
Also published as: WO2023109086A1

Abstract

The method comprises the steps of obtaining an image to be processed, wherein the image to be processed carries one or more characters, further extracting the characteristics of the image to be processed to obtain image characteristics, obtaining character frames with different scales in the image to be processed according to the image characteristics, performing character frame regression processing on the character frames with different scales, solving the problem of image deformation or angle movement, determining the positions of the characters in the image to be processed according to the character frames with different scales after the character frame regression processing, and performing character recognition on the image to be processed based on the positions, so that the character recognition rate is improved, and a better character recognition effect is achieved.

Description

Character recognition method, device, equipment and storage medium

Technical Field

The present application relates to image recognition technology of financial technology (Fintech), and in particular, to a method, an apparatus, a device, and a storage medium for character recognition.

Background

With the development of information technology, more and more technologies are applied in the financial field, the traditional financial industry is gradually changing to financial technology, the image recognition technology is no exception, but higher requirements are also put on the image recognition technology due to the requirements of the financial industry on safety and real-time performance.

In the related art, the image recognition technology mainly refers to processing a captured system front-end picture according to a set target by using a computer, and in the field of artificial intelligence, a neural network is the most widely applied in the field of image recognition. The neural network model can realize the functions of face recognition, image detection, image classification, target tracking, character recognition and the like. Among them, the functions of face recognition, image classification, character recognition, etc. have been developed for a long time to achieve a better recognition effect.

Character recognition generally refers to a technique for automatically recognizing characters using various devices including computers, and has an important application in many fields of today's society. However, after the image is deformed or angularly moved, the conventional image recognition technology does not have the equal variation property, so that the character recognition rate is reduced, and an ideal recognition effect cannot be achieved.

Disclosure of Invention

In order to solve the problems in the prior art, the application provides a character recognition method, a device, equipment and a storage medium.

In a first aspect, an embodiment of the present application provides a text recognition method, where the text recognition method includes:

acquiring an image to be processed, wherein the image to be processed carries one or more characters;

extracting the features of the image to be processed to obtain the image features corresponding to the image to be processed;

obtaining a plurality of text frames with different scales in the image to be processed according to the image characteristics, and performing text frame regression processing on the text frames with different scales;

and determining the positions of the one or more characters in the image to be processed according to a plurality of character frames with different scales after the character frame regression processing, and performing character recognition on the image to be processed based on the positions of the one or more characters.

In a possible implementation manner, the performing feature extraction on the image to be processed to obtain an image feature corresponding to the image to be processed includes:

and performing feature extraction on the image to be processed based on a dense connection network to obtain the image features corresponding to the image to be processed, wherein the dense connection network comprises one or more dense blocks, any two dense blocks in the dense connection network are directly connected, and the input of each dense block is the union of the outputs of all the dense blocks.

In one possible implementation, the dense connection network further comprises one or more transitional connection layers, the transitional connection layers comprise 1 × 1 convolutional layers, and the input of each transitional connection layer is the union of all the previous dense blocks and the output of the transitional connection layer;

the performing feature extraction on the image to be processed based on the dense connection network to obtain the image features corresponding to the image to be processed includes:

and performing feature extraction on the image to be processed based on the one or more dense blocks and the one or more transitional connection layers to obtain the image features corresponding to the image to be processed.

In a possible implementation manner, the obtaining, according to the image feature, a plurality of text frames with different scales in the image to be processed, and performing text frame regression processing on the text frames with different scales includes:

obtaining a plurality of text frames with different scales in the image to be processed according to the image characteristics, and determining offset data of the text frames with different scales;

performing text box regression processing on the text boxes with different scales based on the offset data.

In a possible implementation manner, the obtaining, according to the image feature, a plurality of text boxes of different scales in the image to be processed and determining offset data of the text boxes of the different scales includes:

carrying out down-sampling processing on the image features, and carrying out down-sampling and convolution processing on the image features after the down-sampling processing;

and taking the image features subjected to downsampling and convolution processing as new image features subjected to downsampling processing, and re-executing the steps of downsampling and convolution processing on the image features subjected to downsampling processing until the character frames with different scales in the image to be processed are obtained, and determining offset data of the character frames with different scales.

In a possible implementation manner, the determining, according to a plurality of text boxes with different scales after the text box regression processing, the position of the one or more texts in the image to be processed includes:

obtaining scores of the text frames with different scales after the text frame regression processing according to the text frames with different scales after the text frame regression processing and a preset score model, wherein the preset score model is used for determining the scores of the text frames with different scales according to the ratio of the intersection and union of the text frame with the highest score and the text frames with different scales in the text frames with different scales;

and calculating the positions of the text frames with different scales after the text frame regression processing according to the scores of the text frames with different scales after the text frame regression processing, and determining the positions of the one or more characters in the image to be processed based on the positions of the text frames with different scales after the text frame regression processing.

In a possible implementation manner, the calculating, according to the scores of the text boxes of the plurality of different scales after the text box regression processing, the positions of the text boxes of the plurality of different scales after the text box regression processing includes:

calculating the ratio of the intersection and union of a text frame with the highest score in the text frames with different scales subjected to the text frame regression processing and a text frame i subjected to the text frame regression processing, wherein the text frame i subjected to the text frame regression processing is any one of the text frames with different scales subjected to the text frame regression processing, i is 1, …, n is an integer, and n is determined according to the number of the text frames with different scales subjected to the text frame regression processing;

and if the calculated ratio is smaller than a preset threshold value, calculating the position of the text box i subjected to the text box regression processing according to the score of the text box i subjected to the text box regression processing.

In a possible implementation manner, before the performing the feature extraction on the image to be processed to obtain the image feature corresponding to the image to be processed, the method further includes:

performing parameter reduction processing on the image to be processed;

the feature extraction of the image to be processed to obtain the image features corresponding to the image to be processed includes:

and performing feature extraction on the image to be processed after parameter reduction processing to obtain image features corresponding to the image to be processed.

In a possible implementation manner, the performing parameter reduction processing on the image to be processed includes:

and performing parameter reduction processing on the image to be processed by utilizing 3 × 3 convolutional layers and 12 × 2 pooling layer, wherein the 3 × 3 convolutional layers are connected with the 2 × 2 pooling layer in sequence.

In a possible implementation manner, the performing text recognition on the image to be processed based on the position of the one or more texts includes:

and identifying the characters in the image to be processed based on the positions of the one or more characters and a preset identification model, wherein the preset identification model is used for identifying the characters in the image according to the positions of the characters in the image.

In a second aspect, an embodiment of the present application provides a text recognition apparatus, where the apparatus includes:

the image acquisition module is used for acquiring an image to be processed, and the image to be processed carries one or more characters;

the characteristic extraction module is used for extracting the characteristics of the image to be processed to obtain the image characteristics corresponding to the image to be processed;

the text frame processing module is used for obtaining a plurality of text frames with different scales in the image to be processed according to the image characteristics and performing text frame regression processing on the text frames with different scales;

and the character recognition module is used for determining the positions of the one or more characters in the image to be processed according to a plurality of character frames with different scales after the character frame regression processing, and performing character recognition on the image to be processed based on the positions of the one or more characters.

In a possible implementation manner, the feature extraction module is specifically configured to:

In one possible implementation, the dense connection network further comprises one or more transitional connection layers, the transitional connection layers comprising 1 × 1 convolutional layers, and the input of each transitional connection layer is the union of all the previous dense blocks and the output of the transitional connection layer.

The feature extraction module is specifically configured to:

In a possible implementation manner, the text box processing module is specifically configured to:

In a possible implementation manner, the text recognition module is specifically configured to:

performing parameter reduction processing on the image to be processed;

In a third aspect, an embodiment of the present application provides a text recognition apparatus, including:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of the first aspect.

In a fourth aspect, an embodiment of the present application provides a computer-readable storage medium, where a computer program is stored, and the computer program causes a server to execute the method in the first aspect.

In a fifth aspect, the present application provides a computer program product, which includes computer instructions for executing the method of the first aspect by a processor.

According to the character recognition method, the character recognition device, the character recognition equipment and the storage medium, the image to be processed is obtained, the image to be processed carries one or more characters, the image to be processed is subjected to feature extraction to obtain image features, so that character frames with different scales in the image to be processed are obtained according to the image features, the character frame regression processing is carried out on the character frames with different scales, the problem that the image is deformed or angularly moved is solved, then the positions of the characters in the image to be processed are determined according to the character frames with different scales after the character frame regression processing, and the characters are recognized on the image to be processed based on the positions, so that the character recognition rate is improved, and a better character recognition effect is achieved.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of a text recognition system according to an embodiment of the present application;

fig. 2 is a schematic flow chart of a text recognition method according to an embodiment of the present application;

fig. 3 is a schematic flow chart of another text recognition method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a downsampling and convolution process provided by an embodiment of the present application;

FIG. 5 is a schematic diagram illustrating an offset of a text box according to an embodiment of the present application;

fig. 6 is a schematic flowchart of another character recognition method according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of a character recognition device according to an embodiment of the present application;

fig. 8 shows a schematic diagram of a possible structure of the text recognition device of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The terms "first," "second," "third," and "fourth," if any, in the description and claims of this application and the above-described figures are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The existing character recognition has relatively extensive research on computer image and vision at present, has extremely high application requirements in scenes such as license plate recognition, bill recognition, book text recognition and the like, has mature technologies and has relatively good effect. However, if the image is deformed or angularly moved, the conventional image recognition technology does not have the equal deformation property, so that the character recognition rate is reduced, and an ideal recognition effect cannot be achieved.

Therefore, an embodiment of the present application provides a text recognition method, where after an image to be processed carrying one or more texts is obtained, feature extraction is performed on the image to be processed to obtain image features, and then, according to the image features, a plurality of text frames with different scales in the image to be processed are obtained, and text frame regression processing is performed on the text frames with different scales, so as to solve the problem of image deformation or angle movement, improve the subsequent text frames with different scales based on the text frame regression processing, and achieve a better text recognition effect by performing text recognition on the image to be processed.

Optionally, the text recognition method provided by the present application may be applied to the schematic architecture of the text recognition system shown in fig. 1, and as shown in fig. 1, the system may include a receiving device 101, a processing device 102, and a display device 103.

In a specific implementation process, the receiving device 101 may be an input/output interface, and may also be a communication interface, and may be configured to receive an image to be processed that carries one or more characters.

The processing device 102 may obtain the image to be processed through the receiving device 101, further perform feature extraction on the image to be processed to obtain image features, thereby obtaining a plurality of text frames with different scales in the image to be processed according to the image features, performing text frame regression processing on the text frames with different scales, solving the problem of image deformation or angle movement, and then performing text recognition on the image to be processed according to the text frames with different scales after the text frame regression processing, thereby improving a text recognition rate and achieving a better text recognition effect.

In addition, the display device 103 may be used to display the image to be processed and a plurality of text boxes and the like with different scales.

The display device may also be a touch display screen for receiving user instructions while displaying the above-mentioned content to enable interaction with a user.

The processing device 102 may also send the result of character recognition on the image to be processed to a decoder, and the decoder decodes the result and outputs corresponding characters.

It should be understood that the processing device may be implemented by a processor reading instructions in a memory and executing the instructions, or may be implemented by a chip circuit.

The system is only an exemplary system, and when the system is implemented, the system can be set according to application requirements.

In addition, the system architecture described in the embodiment of the present application is for more clearly illustrating the technical solution of the embodiment of the present application, and does not form a limitation on the technical solution provided in the embodiment of the present application, and it can be known by a person skilled in the art that the technical solution provided in the embodiment of the present application is also applicable to similar technical problems along with the evolution of the system architecture and the appearance of new service scenarios.

The technical solutions of the present application are described below with several embodiments as examples, and the same or similar concepts or processes may not be described in detail in some embodiments.

Fig. 2 is a schematic flow chart of a character recognition method according to an embodiment of the present application, where an execution subject of the embodiment may be a processing device in the embodiment shown in fig. 1, and may be determined specifically according to an actual situation. As shown in fig. 2, the character recognition method provided in the embodiment of the present application includes the following steps:

s201: acquiring an image to be processed, wherein the image to be processed carries one or more characters.

The image to be processed can be set according to actual conditions, such as images obtained in scenes of license plate recognition, bill recognition, book text recognition and the like.

S202: and performing feature extraction on the image to be processed to obtain image features corresponding to the image to be processed.

Before the processing device extracts the features of the image to be processed, the processing device may further perform parameter reduction processing on the image to be processed, so as to reduce parameters and calculation amount and improve efficiency of subsequent character recognition.

For example, the processing device may perform parameter reduction processing on the image to be processed by using 3 × 3 convolutional layers and 12 × 2 pooling layer, where the 3 × 3 convolutional layers are sequentially connected and then connected to the 2 × 2 pooling layer. The convolution kernel size (kernel _ size), convolution step size (stride), and feature map padding width (padding) of the 3 convolutional layers and 1 pooling layer of 2 × 2 may be as shown in table 1:

TABLE 1

In addition, when the processing device performs feature extraction on the image to be processed, the processing device may perform feature extraction on the image to be processed based on a dense connection network to obtain image features corresponding to the image to be processed, where the dense connection network includes one or more dense blocks, any two dense blocks in the dense connection network are directly connected, and an input of each dense block is a union of outputs of all previous dense blocks.

Here, the processing device uses the dense connection network as a feature extraction network, and the network can take the outputs of all previous layers as the inputs of the current layer, so that the gradient and information propagation are more accurate, and the accuracy of performing subsequent character recognition based on the features of the image to be processed extracted by the dense connection network is higher.

In the embodiment of the present application, in order to increase the depth of extracting features, the dense connection network may further include one or more transition connection layers, where the transition connection layers are used to increase the number of dense blocks in the dense connection network, and in the case of increasing the number, the resolution of the original feature map is not changed. The transition connection layer comprises 1 multiplied by 1 convolution layers, so that the depth of the extraction features of the dense connection network can be increased, the whole quantity limitation of the dense blocks can be eliminated, and the input of each transition connection layer is the union of all the dense blocks and the output of the transition connection layer. The processing device can extract the features of the image to be processed based on the one or more dense blocks and the one or more transitional connection layers, so that the extracted features are richer, and the accuracy of character recognition based on the extracted features is improved.

TABLE 2

For example, the number of the dense blocks and the transitional connection layers may be set according to actual situations, for example, as shown in table 2, the number of the dense blocks is 4, the number of the transitional connection layers is 2, the 1 st transitional connection layer is disposed between the 3 rd dense block and the 4 th dense block, and the 2 nd transitional connection layer is disposed behind the 4 th dense block. The kernel _ size, stride, and padding parameters for 4 dense blocks and 2 transitional connection layers shown in table 2.

S203: and obtaining a plurality of character frames with different scales in the image to be processed according to the image characteristics, and performing character frame regression processing on the character frames with different scales.

Here, the processing device may obtain a plurality of text frames with different scales in the image to be processed according to the image feature by using a preset dense layer, and perform text frame regression processing on the plurality of text frames with different scales.

The preset dense layer may include two blocks, one block is used for obtaining a plurality of text frames with different scales in the image to be processed, and the other block is used for performing text frame regression processing on the plurality of text frames with different scales.

In the embodiment of the present application, the processing device performs text frame regression processing on a plurality of text frames with different scales in the image to be processed, so as to solve the problem of image deformation or angle movement, improve the subsequent recognition rate of performing text recognition on the image to be processed based on the plurality of text frames with different scales after the text frame regression processing.

S204: and determining the positions of the one or more characters in the image to be processed according to a plurality of character frames with different scales after the regression processing of the character frames, and performing character recognition on the image to be processed based on the positions of the one or more characters.

For example, the processing device may obtain scores of the text frames with different scales after the text frame regression processing according to the text frames with different scales after the text frame regression processing and a preset score model, further calculate positions of the text frames with different scales after the text frame regression processing according to the scores, and determine the position of the one or more characters in the image to be processed based on the positions.

The preset score model is used for determining the scores of the text boxes with different scales according to the ratio of the intersection and union of the text box with the highest score among the text boxes with different scales and the text boxes with different scales.

For example, the preset score model includes the expression:

wherein s is_iThe score of the ith text box is expressed, and iou represents Intersection over Union, which is the ratio of the Intersection and Union of the text box and other text boxes. T represents the calculated text box of the highest score, c_iWhich represents a candidate box, N represents a threshold value, which can be set according to practical situations. Here, the processing device may set a plurality of frames of different sizes after the frame regression processing as the candidate frames, calculate scores of all the candidate frames to obtain a highest-score frame T, and obtain scores of the plurality of frames of different sizes after the frame regression processing based on the expression.

Further, the processing device may calculate the positions of the plurality of character frames of different scales after the character frame regression processing based on the score by using an expression:

wherein t' represents the positions of the text boxes with different scales after the text box regression processing, and t_iCoordinates representing the ith text box.

In addition, when the processing device calculates the positions of the plurality of frames with different scales after the frame regression processing based on the scores, it may be considered to calculate a ratio of an intersection and a union of a frame with a highest score among the plurality of frames with different scales after the frame regression processing and a frame i after the frame regression processing. If the calculated ratio is smaller than the preset threshold, the processing device may calculate the position of the text box i after the text box regression processing according to the score of the text box i after the text box regression processing. The character frame i after the character frame regression processing is any one of a plurality of character frames with different scales after the character frame regression processing, i is 1, …, n is an integer, and n is determined according to the number of the character frames with different scales after the character frame regression processing. That is, the processing device may use a Non Maximum Suppression (NMS) algorithm to calculate the positions of the text boxes with different sizes after the text box regression processing, so that the calculation result is more accurate.

For example, the processing device may list all candidate frames a, that is, list a plurality of frames with different scales after the text frame regression processing, and the calculated score s_iAnd initializes a detection set Bi to be null. Then, the processing device may calculate all the text frames in the candidate frames a to obtain the text frame T with the highest score, and put the text frame T into the set Bi, where i represents the frame selection of the ith time. Further, the processing means may set a threshold N, then traverse all the remaining text boxes, calculate iou of the text box and the top score detection box, and put it into the set Bi if the result is greater than or equal to the threshold. The above processing device repeats the above operations until a is empty, resulting in a collection set Bi. Finally, the processing means may be based on the score s for each text box_iCalculating the bits of the text boxTherefore, the position of the text box calculated based on the position is more accurate.

In this embodiment of the application, when the processing device performs character recognition on the image to be processed based on the position of the one or more characters, the processing device may further recognize the characters in the image to be processed based on the position of the one or more characters and a preset recognition model.

The preset recognition model is used for recognizing characters in the image according to the positions of the characters in the image.

According to the method and the device, the image to be processed is obtained, the image to be processed carries one or more characters, further, feature extraction is carried out on the image to be processed, image features are obtained, accordingly, character frames of multiple different scales in the image to be processed are obtained according to the image features, character frame regression processing is carried out on the character frames of the multiple different scales, the problem that the image is deformed or the angle of the image is moved is solved, then, the positions of the characters in the image to be processed are determined according to the character frames of the multiple different scales after the character frame regression processing, character recognition is carried out on the image to be processed based on the positions, character recognition rate is improved, and a good character recognition effect is achieved. In addition, the embodiment of the application also performs parameter reduction processing on the image to be processed, so that the parameters and the calculated amount are reduced, and the efficiency of subsequent character recognition is improved. In addition, the dense connection network is used as the feature extraction network, the network can take the outputs of all previous layers as the input of the current layer, gradient and information propagation are more accurate, and therefore the accuracy of subsequent character recognition based on the features of the to-be-processed image extracted by the dense connection network is higher. The embodiment of the application can also adopt an NMS algorithm to calculate the positions of the text boxes with different scales after the text box regression processing, so that the calculation result is more accurate.

Here, before the processing device identifies the characters in the image to be processed based on the positions of the one or more characters and a preset identification model, the processing device needs to train the preset identification model so as to identify the characters in the image to be processed by using the model. In the training process, the processing device can input an image carrying characters into the preset recognition model, wherein the input image also carries positions of the characters in the image, and then, the output accuracy is determined according to the characters output by the preset recognition model and the characters corresponding to the input image. If the output accuracy is lower than the preset accuracy threshold, the processing device may adjust the preset recognition model according to the output accuracy to improve the output accuracy, use the adjusted preset recognition as a new preset recognition model, and re-execute the step of inputting the image with the text into the preset recognition model.

In addition, when the processing device obtains a plurality of character frames with different scales in the image to be processed according to the image characteristics and performs character frame regression processing on the character frames with different scales, the processing device also considers that the character frames with different scales in the image to be processed are obtained according to the image characteristics, offset data of the character frames with different scales are determined, then the character frame regression processing is performed on the character frames with different scales based on the offset data, the problem of image deformation or angle movement is solved, and then character recognition is performed on the image to be processed according to the character frames with different scales after the character frame regression processing, and the character recognition rate is improved. Fig. 3 is a flowchart illustrating another text recognition method according to an embodiment of the present application. As shown in fig. 3, the method includes:

s301: acquiring an image to be processed, wherein the image to be processed carries one or more characters.

S302: and performing feature extraction on the image to be processed to obtain image features corresponding to the image to be processed.

The steps S301 to S302 are the same as the steps S201 to S202, and are not described herein again.

S303: and obtaining a plurality of character frames with different scales in the image to be processed according to the image characteristics, and determining offset data of the character frames with different scales.

Here, the processing device may perform downsampling processing on the image feature, perform downsampling and convolution processing on the image feature after the downsampling processing, and perform the step of performing the downsampling and convolution processing on the image feature after the downsampling processing again by using the image feature after the downsampling and convolution processing as a new image feature after the downsampling processing until the text frames of the plurality of different scales in the image to be processed are obtained, and determine offset data of the text frames of the plurality of different scales.

The processing device may perform downsampling on the image features by using a downsampling module, and the downsampling module may include a convolution of 1 × 1 and a pooling layer of 2 × 2. Here, the processing apparatus uses a 2 × 2 pooling layer to enable size matching of the feature map, uses a 1 × 1 convolution to reduce the number of channels by half, and the scale of the whole module includes the features of the feature map and the features of the previous feature map, so that the parameters are fewer and the result is more accurate.

In addition, the processing device may further perform convolution processing on the image feature by using a convolution module, where the convolution module may include a convolution layer of 1 × 1 and a convolution layer of 3 × 3, and perform two convolution operations, where the feature map of the previous layer is transmitted to the feature map of the next layer.

In the embodiment of the present application, the processing device can obtain text boxes with 6 different dimensions as an example. As shown in fig. 4, the text boxes of 6 different scales include text boxes of scale 1, scale 2, scale 3, scale 4, scale 5 and scale 6. The processing device determines a frame of scale 1 according to the image characteristics, further performs downsampling processing on the frame of scale 1 to obtain a frame of scale 2, performs downsampling and convolution processing on the frame of scale 2 to obtain a frame of scale 3, and repeatedly executes the steps of downsampling and convolution processing on the frame of scale 3 to obtain a frame of scale 4, downsampling and convolution processing on the frame of scale 4 to obtain a frame of scale 5, and downsampling and convolution processing on the frame of scale 5 to obtain a frame of scale 6.

The processing device determines offset data of the plurality of frames with different dimensions in the processing procedure, and performs frame regression processing on the plurality of frames with different dimensions based on the offset data. For better understanding of the offset of the text box, fig. 5 shows an offset diagram of a text box, in which b₀Indicates the default frame, 4 arrows from b₀Lead-out direction G_qA regression learning process, G, from the default box to the actual text box is shown_bRepresenting the actual target G_qIs determined by the minimum bounding matrix of (a),

the true value of the rectangle, which is the smallest bounding rectangle of G,

represents G_bIs measured at a central point of the beam,

the indication of the width is that the width,

indicating a high.

Here, the processing device determines offset data of a character frame, and then performs character frame regression processing on the character frame based on the offset data, thereby solving the problem of image deformation or angular movement and improving the accuracy of subsequent character recognition.

S304: and performing text-box regression processing on the text boxes with different scales based on the offset data.

S305: and determining the positions of the one or more characters in the image to be processed according to a plurality of character frames with different scales after the character frame regression processing, and performing character recognition on the image to be processed based on the positions of the one or more characters.

Step S305 is the same as the implementation of step S204, and is not described herein again.

According to the text box regression processing method and device, after the offset data of the text box are determined, the text box regression processing is carried out on the text box based on the offset data, the problem that an image is deformed or an image moves in an angle mode is solved, then text recognition is carried out according to the text boxes with different scales after the text box regression processing, and the text recognition rate is improved.

Here, fig. 6 is a schematic flow chart of another character recognition method proposed in an embodiment of the present application, in which after acquiring a to-be-processed image carrying one or more characters, the processing device may perform parameter reduction processing on the to-be-processed image. Specifically, the processing apparatus may perform parameter reduction by using a parameter reduction module, where the parameter reduction module may include a 3 × 3 convolutional layer and 12 × 2 pooling layer, and the 3 × 3 convolutional layers are sequentially connected and then connected to the 2 × 2 pooling layer. Further, the processing device may perform feature extraction on the to-be-processed image after parameter reduction processing, and may perform feature extraction, for example, based on a dense connection network. The dense connection network may include one or more dense blocks and may further include one or more transitional connection layers. Here, the figure shows 4 dense blocks and 2 transitional connection layers, wherein the 1 st transitional connection layer is arranged between the 3 rd dense block and the 4 th dense block, and the 2 nd transitional connection layer is arranged behind the 4 th dense block. After the feature extraction, the processing device may obtain a plurality of text frames with different scales in the image to be processed based on the extracted image features, and determine offset data of the plurality of text frames with different scales, so as to perform text frame regression processing on the plurality of text frames with different scales based on the offset data. Here, the processing device may perform the processing by using a preset dense layer, where the preset dense layer may include two blocks, one block is used to obtain a plurality of text frames with different scales in the image to be processed, and the other block is used to perform text frame regression processing on the text frames with different scales. Finally, the processing device determines the position of one or more characters in the image to be processed according to a plurality of character frames with different scales after the regression processing of the character frames, and performs character recognition on the image to be processed based on the position. The processing device can adopt an NMS algorithm to calculate the positions of the text boxes with different scales after the text box regression processing, so that the calculation result is more accurate.

In addition, the processing device can also send the result of character recognition of the image to be processed to a decoder, and the decoder decodes the result and outputs corresponding characters.

In the embodiment of the application, the processing device performs text frame regression processing on text frames with different scales in the image to be processed, so as to solve the problem of image deformation or angle movement, and then performs text recognition on the image to be processed according to the text frames with different scales after the text frame regression processing, so that the text recognition rate is improved, and a better text recognition effect is achieved. And the processing device also carries out parameter reduction processing on the image to be processed, reduces parameters and calculated amount and improves the efficiency of subsequent character recognition. In addition, the processing device takes the dense connection network as a feature extraction network, the network can take the outputs of all previous layers as the input of the current layer, so that the gradient and information propagation are more accurate, and the accuracy of subsequent character recognition based on the features of the to-be-processed image extracted by the dense connection network is higher. The processing device can also adopt an NMS algorithm to calculate the positions of a plurality of text boxes with different scales after the text box regression processing, so that the calculation result is more accurate.

Fig. 7 is a schematic structural diagram of a character recognition apparatus according to an embodiment of the present application, which corresponds to the character recognition method according to the foregoing embodiment. For convenience of explanation, only portions related to the embodiments of the present application are shown. Fig. 7 is a schematic structural diagram of a character recognition device according to an embodiment of the present application, where the character recognition device 70 includes: an image acquisition module 701, a feature extraction module 702, a text box processing module 703, and a text recognition module 704. The word recognition means may be the processing means itself, or a chip or an integrated circuit that implements the functions of the processing means. It should be noted here that the division of the image acquisition module, the feature extraction module, the text box processing module, and the text recognition module is only a division of a logic function, and the two may be integrated or independent physically.

The image obtaining module 701 is configured to obtain an image to be processed, where the image to be processed carries one or more characters.

A feature extraction module 702, configured to perform feature extraction on the image to be processed, so as to obtain an image feature corresponding to the image to be processed.

A text box processing module 703, configured to obtain, according to the image features, a plurality of text boxes with different scales in the image to be processed, and perform text box regression processing on the plurality of text boxes with different scales.

And the character recognition module 704 is configured to determine positions of the one or more characters in the image to be processed according to the plurality of character boxes with different scales after the character box regression processing, and perform character recognition on the image to be processed based on the positions of the one or more characters.

In one possible design, the feature extraction module 702 is specifically configured to:

In one possible implementation, the dense connection network further includes one or more transitional connection layers, the transitional connection layers including 1 × 1 convolutional layers, and the input of each transitional connection layer is the union of all the previous dense blocks and the output of the transitional connection layer.

The feature extraction module 702 is specifically configured to:

In a possible implementation manner, the text box processing module 703 is specifically configured to:

and taking the image features after the downsampling and the convolution as new image features after the downsampling, re-executing the step of downsampling and the convolution on the image features after the downsampling until the character frames with different scales in the image to be processed are obtained, and determining the offset data of the character frames with different scales.

In a possible implementation manner, the text recognition module 704 is specifically configured to:

In a possible implementation manner, the feature extraction module 702 is specifically configured to:

performing parameter reduction processing on the image to be processed;

The apparatus provided in the embodiment of the present application may be configured to implement the technical solution of the method embodiment, and the implementation principle and the technical effect are similar, which are not described herein again in the embodiment of the present application.

Optionally, fig. 8 schematically provides a possible basic hardware architecture of the text recognition apparatus described in the present application.

Referring to fig. 8, a text recognition device 800 includes at least one processor 801 and a communication interface 803. Further optionally, a memory 802 and a bus 804 may also be included.

The character recognition device 800 may be the processing device, and the application is not limited thereto. The number of processors 801 in the word recognition device 800 may be one or more, and fig. 8 illustrates only one of the processors 801. Alternatively, the processor 801 may be a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), or a Digital Signal Processing (DSP). If the word recognition device 800 has multiple processors 801, the types of the multiple processors 801 may be different, or may be the same. Alternatively, the processors 801 of the word recognition device 800 may also be integrated into a multi-core processor.

Memory 802 stores computer instructions and data; the memory 802 may store computer instructions and data necessary to implement the above-described text recognition methods provided herein, e.g., the memory 802 stores instructions for implementing the steps of the above-described text recognition methods. The memory 802 may be any one or any combination of the following storage media: nonvolatile memory (e.g., Read Only Memory (ROM), Solid State Disk (SSD), hard disk (HDD), optical disk), volatile memory.

The communication interface 803 may provide information input/output for the at least one processor. Any one or any combination of the following devices may also be included: a network interface (e.g., an ethernet interface), a wireless network card, etc. having a network access function.

Optionally, the communication interface 803 may also be used for data communication between the text recognition device 800 and other computing devices or terminals.

Further alternatively, fig. 8 shows bus 804 as a thick line. A bus 804 may connect the processor 801 with the memory 802 and the communication interface 803. Thus, via bus 804, processor 801 may access memory 802 and may also interact with other computing devices or terminals using communication interface 803.

In the present application, the word recognition apparatus 800 executes the computer instructions in the memory 802, so that the word recognition apparatus 800 implements the word recognition method provided in the present application, or the word recognition apparatus 800 deploys the word recognition device.

From the viewpoint of logical functional division, as shown in fig. 8, the memory 802 may include an image acquisition module 701, a feature extraction module 702, a text box processing module 703, and a text recognition module 704. The inclusion herein merely refers to that the instructions stored in the memory, when executed, may implement the functions of the image acquisition module, the feature extraction module, the text box processing module, and the text recognition module, respectively, without limitation to physical structures.

In addition, the character recognition device may be implemented by software as shown in fig. 8, or may be implemented by hardware as a hardware module or a circuit unit.

The present application provides a computer-readable storage medium, the computer program product comprising computer instructions that instruct a computing device to perform the above-mentioned text recognition method provided herein.

The present application provides a chip comprising at least one processor and a communication interface providing information input and/or output for the at least one processor. Further, the chip may also include at least one memory for storing computer instructions. The at least one processor is used for calling and running the computer instructions to execute the character recognition method provided by the application.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional unit.

Claims

1. A method for recognizing a character, comprising:

2. The method according to claim 1, wherein the performing feature extraction on the image to be processed to obtain image features corresponding to the image to be processed comprises:

3. The method of claim 2, wherein the dense connection network further comprises one or more transitional connection layers, the transitional connection layers comprising 1 x 1 convolutional layers, the input of each transitional connection layer being the union of all previous dense blocks and transitional connection layer outputs;

4. The method according to any one of claims 1 to 3, wherein the obtaining a plurality of text boxes with different scales in the image to be processed according to the image features and performing text box regression processing on the text boxes with different scales comprises:

5. The method according to claim 4, wherein the obtaining a plurality of text boxes with different scales in the image to be processed according to the image features and determining offset data of the text boxes with different scales comprises:

6. The method of any one of claims 1 to 3, wherein determining the position of the one or more words in the image to be processed according to a plurality of text boxes with different scales after text box regression processing comprises:

7. The method of claim 6, wherein calculating the positions of the text boxes of different scales after the text box regression processing according to the scores of the text boxes of different scales after the text box regression processing comprises:

8. The method according to any one of claims 1 to 3, wherein before the performing the feature extraction on the image to be processed to obtain the image feature corresponding to the image to be processed, the method further comprises:

performing parameter reduction processing on the image to be processed;

9. The method according to claim 8, wherein the performing parameter reduction processing on the image to be processed comprises:

10. The method of any one of claims 1 to 3, wherein the performing text recognition on the image to be processed based on the position of the one or more texts comprises:

11. A character recognition apparatus, comprising:

12. A character recognition apparatus, comprising:

a processor;

a memory; and

a computer program;

wherein the computer program is stored in the memory and configured to be executed by the processor, the computer program comprising instructions for performing the method of any of claims 1-10.

13. A computer-readable storage medium, characterized in that it stores a computer program that causes a server to execute the method of any one of claims 1-10.

14. A computer program product comprising computer instructions for executing the method of any one of claims 1-10 by a processor.