CN114220107A

CN114220107A - Image processing method and device

Info

Publication number: CN114220107A
Application number: CN202111529035.0A
Authority: CN
Inventors: 张家鑫; 黄灿
Original assignee: Beijing Youzhuju Network Technology Co Ltd
Current assignee: Beijing Youzhuju Network Technology Co Ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-22

Abstract

The application discloses an image processing method, which comprises the following steps: acquiring an image to be processed including characters; inputting the image to be processed into the character recognition model to obtain characters included in the image to be processed; wherein: the character recognition model is used for extracting the global features and the local features of the image to be processed and obtaining the characters included in the image to be processed according to the global features and the local features. Therefore, in the scheme, when the character recognition model recognizes the characters in the image to be processed, the global features of the image can be considered, and the association (namely the local features) between the similar characters can be considered, so that the character recognition model can accurately recognize the characters in the image to be processed.

Description

Image processing method and device

Technical Field

The present application relates to the field of image processing, and in particular, to an image processing method and apparatus.

Background

In some scenarios, it is desirable to identify characters in an image. However, the current methods for recognizing characters in images cannot accurately recognize the characters in the images.

Therefore, a solution is urgently needed to accurately recognize characters in an image.

Disclosure of Invention

The technical problem to be solved by the application is how to accurately identify characters in an image, and an image processing method and device are provided.

In a first aspect, an embodiment of the present application provides an image processing method, where the method includes:

acquiring an image to be processed including characters;

inputting the image to be processed into the character recognition model to obtain characters included in the image to be processed; wherein:

the character recognition model is used for extracting the global features and the local features of the image to be processed and obtaining the characters included in the image to be processed according to the global features and the local features.

Optionally, the character recognition model includes an encoder and a decoder, where the encoder is configured to extract global features and local features of the image to be processed, and the decoder is configured to obtain a character prediction result according to the global features and the local features.

Optionally, the encoder includes a multi-head attention module and a convolution module, the multi-head attention module and the convolution module are connected in parallel, the multi-head attention module is configured to extract a global feature of the image to be processed, and the convolution module is configured to extract a local feature of the image to be processed.

Optionally, the encoder further includes a feature preprocessing module, where the feature preprocessing module is configured to obtain a first feature sequence; the multi-head attention module is configured to process part or all of the features in the first feature sequence to obtain the global features, and the convolution module is configured to process part or all of the features in the first feature sequence to obtain the local features, where any feature in the first feature sequence is processed by the multi-head attention module and/or the convolution module.

Optionally, the multi-head attention module is configured to process half of the features in the first feature sequence, and the convolution module is configured to process the other half of the features in the first feature sequence.

Optionally, the character recognition model is obtained by training in the following way:

acquiring a training image and a label corresponding to the training image, wherein the label corresponding to the training image is used for indicating characters included in the training image;

training a character recognition model based on the training image and a label corresponding to the training image, wherein the character recognition model is used for recognizing characters in the image; wherein:

the training of the character recognition model based on the training images and the labels corresponding to the training images comprises:

extracting global features and local features of the training images;

obtaining a character prediction result according to the global features and the local features of the training images;

and updating the parameters of the character recognition model based on the character prediction result and the label corresponding to the training image.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including:

an acquisition unit configured to acquire an image to be processed including characters;

the processing unit is used for inputting the image to be processed into the character recognition model to obtain characters included in the image to be processed; wherein:

extracting global features and local features of the training images;

In a third aspect, an embodiment of the present application provides an apparatus, which includes a processor and a memory;

the processor is configured to execute instructions stored in the memory to cause the apparatus to perform the method of any of the first aspects above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium comprising instructions that instruct a device to perform the method according to any one of the above first aspects.

In a fifth aspect, embodiments of the present application provide a computer program product, which when run on a computer, causes the computer to perform the method of any of the above first aspects.

Compared with the prior art, the embodiment of the application has the following advantages:

the embodiment of the application provides an image processing method, which comprises the following steps: acquiring an image to be processed including characters; inputting the image to be processed into the character recognition model to obtain characters included in the image to be processed; wherein: the character recognition model is used for extracting the global features and the local features of the image to be processed and obtaining the characters included in the image to be processed according to the global features and the local features. Therefore, in the scheme, when the character recognition model recognizes the characters in the image to be processed, the global features of the image can be considered, and the association (namely the local features) between the similar characters can be considered, so that the character recognition model can accurately recognize the characters in the image to be processed.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments described in the present application, and other drawings can be obtained by those skilled in the art without creative efforts.

Fig. 1 is a schematic flowchart of a model training method according to an embodiment of the present disclosure;

fig. 2 is a schematic structural diagram of a character recognition model according to an embodiment of the present application;

fig. 3 is a schematic flowchart of an image processing method according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

Detailed Description

In order to make the technical solutions of the present application better understood, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The inventor of the present application has found through research that, at present, a machine learning model may be trained in advance, and the trained machine learning model is used to recognize an image, so as to determine characters included in the image.

In one example, the machine learning model may be a Transformer model, where the Transformer model includes an encoder (encoder) and a decoder (decoder), the encoder is configured to encode an image to obtain image features, and the decoder is configured to decode the features output by the encoder to obtain characters included in the image.

Wherein the encoder of the Transformer model comprises a multi-head attention (multi-head attention) module, and the multi-head attention module is used for acquiring global features of the image. For characters, it is often the case that closely arranged characters have stronger semantics, namely: for character recognition, the role of local features is not negligible. The encoder cannot acquire local features of the image, which results in that the Transformer model cannot accurately identify characters in the image.

In order to solve the above problem, embodiments of the present application provide an image processing method and apparatus.

Various non-limiting embodiments of the present application are described in detail below with reference to the accompanying drawings.

Exemplary method

The embodiment of the application provides an image processing method, wherein characters in an image to be processed can be recognized by utilizing a character recognition model obtained through pre-training, and when the character recognition model recognizes the characters in the image to be processed, not only can global characteristics of the image be considered, but also association (namely local characteristics) between similar characters can be considered, so that the character recognition model can accurately recognize the characters in the image to be processed.

Next, the training process of the character recognition model will be described first.

Referring to fig. 1, the figure is a schematic flow chart of a model training method provided in the embodiment of the present application. In this embodiment, the method may be executed by a terminal or a server, and the embodiment of the present application is not particularly limited.

The method shown in fig. 1, for example, may comprise the steps of: S101-S102.

It should be noted that the process of model training is a process of multiple iterative computations, each iteration can adjust the parameters of the model, and the adjusted parameters participate in the next iterative computation.

Fig. 1 illustrates a certain iteration process in training a character recognition model, taking a certain training image as an example. It will be appreciated that there are many sets of training images used to train the character recognition model, and that each set of training images is processed similarly when the formula recognition model is trained. After training of a plurality of groups of training images, the character recognition model with the accuracy meeting the requirement can be obtained.

S101: the method comprises the steps of obtaining a training image and a label corresponding to the training image, wherein the label corresponding to the training image is used for indicating characters included in the training image.

In one example, the training image may be an image that includes characters. The training image may be obtained by shooting with a shooting device, may also be obtained from a network resource, and may also be obtained in other manners, which is not specifically limited in the embodiments of the present application.

In one example, the raw image may be acquired and then processed to obtain a training image. The original image may be processed, for example, by changing the size of the original image, and in one example, the width and height of the original image may be scaled proportionally so that the height of the processed image is a preset height (for example, 32).

In one example, the labels corresponding to the training images may be manually labeled.

S102: and training a character recognition model based on the training image and the label corresponding to the training image, wherein the character recognition model is used for recognizing characters in the image.

S102, in a particular implementation, may include the following S1021-S1023.

S1021: and extracting global features and local features of the training images.

In one example, the character recognition model may include a feature extraction module to extract global and local features of the training image.

In one example, the character recognition model may be a Transformer model, and in this case, the feature extraction module may be an encoder. Different from the traditional Transformer model, the encoder included in the character recognition model in the embodiment of the application can extract not only the global features of the training image, but also the local features of the training image.

In one example, a convolution module can be added to the encoder of the conventional Transformer model, so as to achieve the purpose of extracting both global features and local features. In other words, the encoder may comprise a multi-head attention module for extracting global features of the training image and a convolution module for extracting local features of the training image.

In one example, the multi-headed attention module and the convolution module may be arranged in parallel. The parallel arrangement has the advantages that the multi-head attention module can process the features which are not subjected to local feature extraction, and the convolution module can process the features which are not subjected to global feature extraction, so that the global features extracted by the multi-head attention module are more comprehensive, and the local features extracted by the convolution module are more comprehensive.

In one example, the character recognition model may include N encoders arranged in series, with the output of the ith encoder being the input of the (i + 1) th encoder and the output of the nth encoder being the input of the decoder. The value of N is not specifically limited in the embodiment of the present application, and in one example, the value of N is 7. In one example, in order to extract as many local features as possible, each encoder is identical in structure, in other words, each encoder may include a multi-headed attention module and a convolution module arranged in parallel. In yet another example, a partial encoder of the N encoders may also follow a native encoder of a transform model, namely: some of the N encoders may not include a convolution module connected in parallel with the multi-head attention module, and the embodiment of the present application is not particularly limited. In the following description, each encoder may include a multi-head attention module and a convolution module arranged in parallel. The encoder mentioned below is any one of the aforementioned N encoders, unless otherwise specified.

In one example, the encoder further comprises a feature preprocessing module configured to obtain a first feature sequence, and the multi-head attention module and the convolution simulation of the encoder are configured to process the first feature sequence. In an example, the multi-head attention module is configured to process part or all of the features in the first feature sequence to obtain the global features, and the convolution module is configured to process part or all of the features in the first feature sequence to obtain the local features, where any feature in the first feature sequence is processed by the multi-head attention module and/or the convolution module in order to avoid influence on a model training effect due to feature waste.

It is understood that, for a first encoder of the N encoders, the feature preprocessing module of the first encoder is configured to obtain a first feature sequence according to the training image. And for the j encoder (j is larger than 1), the characteristic preprocessing module of the j encoder is used for obtaining a first characteristic sequence according to the output of the (j-1) encoder. As regards the first signature sequence, reference may be made to the relevant description below with respect to fig. 2, which is not described in detail here.

In one example, to reduce the computation of the character recognition model, the first feature sequence may be divided into two parts, and the multi-head attention module may process half of the features in the first feature sequence and the convolution module may process the other half of the features in the first feature sequence. In this way, it is ensured that any feature in the first feature sequence is processed, and the amount of computation is effectively reduced.

S1022: and obtaining a character prediction result according to the global features and the local features.

In one example, the character recognition model may include a character recognition module, and the character recognition module is configured to obtain a character prediction result according to the global feature and the local feature. In one example, the character recognition module may be a decoder, which may be a native decoder of a conventional transform model and will not be described in detail herein.

It should be noted that, the decoder mentioned herein obtains the character prediction result according to the global feature and the local feature, and it can be understood that the decoder decodes the feature obtained by further processing according to the global feature and the local feature, so as to obtain the character prediction result. For example, the following steps are carried out: the multi-headed attention module of the encoder outputs global features, the convolution module outputs local features, the feature processing module of the encoder (e.g., 214 shown in fig. 2) further processes the global and local features, and the decoder decodes the processed features to obtain character predictions.

S1023: and updating the parameters of the character recognition model based on the character prediction result and the label corresponding to the training image.

Since the label corresponding to the training image is used to indicate the character included in the training image, and the character prediction result is the character in the training image recognized by the character recognition model, the parameter of the character recognition model may be updated based on the character prediction result and the label corresponding to the training image. In the following training process, the character prediction result of the character recognition model after parameter adjustment can be closer to the label corresponding to the training image.

The character recognition model will now be described with reference to fig. 2. Fig. 2 is a schematic structural diagram of a character recognition model according to an embodiment of the present application.

As shown in fig. 2, the character recognition model includes N encoders 210 and M decoders 220.

Wherein:

n is an integer greater than or equal to 1, in one example, N ═ 7; m is also an integer greater than or equal to 1.

The encoder 210 includes a feature preprocessing module 211, a multi-head attention module 212 and a convolution module 213, and a feature processing module 214.

In one example, for the first encoder, the inputs to the feature pre-processing module 211 are: the method comprises the steps that position vectors and initial image features of an image to be processed are obtained through a convolutional neural network, and the position vectors are used for indicating the positions of all the features in the initial image features; the first feature sequence obtained by the feature preprocessing module 211 includes: initial image features, a position vector, and a partial output of the feature pre-processing module. Wherein, as shown in fig. 2, a part of the output of the feature preprocessing module is half of the feature of the output of the feature preprocessing module. Wherein the convolutional neural network may comprise a plurality of (e.g. 4) residual blocks, for example, because the residual blocks have a good effect of preventing gradient loss.

In one example, for the jth encoder, the inputs to the feature pre-processing module 211 are: the output of the (j-1) th encoder (indicated by the dashed line in fig. 2), the first signature sequence obtained by the signature preprocessing module 211 comprises: the output of the (j-1) th encoder, and a partial output of the feature pre-processing module. Wherein, as shown in fig. 2, a part of the output of the feature preprocessing module is half of the feature of the output of the feature preprocessing module.

Feature processing module 214 is configured to further process the output of multi-head attention module 212 and the output of convolution module 213 to obtain the output of the encoder.

The output of the last of the N encoders may be used as the input to the first of the N decoders.

With respect to the feature processing module 214, reference may be made to a native encoder of a conventional Transformer model, which will not be described in detail herein.

As for the decoder, reference may be made to a native decoder of a conventional transform model, which will not be described in detail herein.

After the character recognition model is obtained through training, the character recognition model can be used for recognizing characters in the image. In an example, an image to be processed including characters may be obtained, where the image to be processed may be obtained by shooting with a shooting device, may also be obtained from a network resource, and may also be obtained in other ways, and this embodiment of the present application is not particularly limited.

After the image to be processed is obtained, the image to be processed may be input to a trained character recognition model, and the character recognition model may output characters included in the image to be processed. When the character recognition model is trained, the global features and the local features are extracted, so that when the character recognition model recognizes characters in an image to be processed, the global features of the image to be processed can be considered, and the association (namely the local features) between adjacent characters can be considered, so that the characters in the image to be processed can be recognized accurately by the character recognition model.

Next, an image processing method provided in an embodiment of the present application is described, and referring to fig. 3, which is a schematic flowchart of an image processing method provided in an embodiment of the present application. The method shown in fig. 3 can be implemented, for example, by the following steps S301 to S302.

S301: an image to be processed including characters is acquired.

S302: inputting the image to be processed into the character recognition model to obtain characters included in the image to be processed, wherein: the character recognition model is used for extracting the global features and the local features of the image to be processed and obtaining the characters included in the image to be processed according to the global features and the local features.

With regard to the image to be processed, reference may be made to the relevant description section above, which is not detailed here.

The character recognition model mentioned in S302 refers to a character recognition model obtained by training using the method shown in fig. 1.

As can be seen from the method shown in fig. 1:

in one example, the character recognition model includes an encoder configured to extract global features and local features of the image to be processed, and a decoder configured to obtain a character prediction result according to the global features and the local features.

In one example, the encoder includes a multi-head attention module and a convolution module, the multi-head attention module and the convolution module are connected in parallel, the multi-head attention module is used for extracting global features of the image to be processed, and the convolution module is used for extracting local features of the image to be processed.

In one example, the encoder further comprises a feature pre-processing module for deriving a first sequence of features; the multi-head attention module is configured to process part or all of the features in the first feature sequence to obtain the global features, and the convolution module is configured to process part or all of the features in the first feature sequence to obtain the local features, where any feature in the first feature sequence is processed by the multi-head attention module and/or the convolution module.

In one example, the multi-head attention module is configured to process half of the features in the first feature sequence, and the convolution module is configured to process the other half of the features in the first feature sequence.

Exemplary device

Based on the method provided by the above embodiment, the embodiment of the present application further provides an apparatus, which is described below with reference to the accompanying drawings.

Referring to fig. 4, the figure is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application. The apparatus 400 may specifically include, for example: an acquisition unit 401 and a processing unit 402.

An acquisition unit 401 configured to acquire an image to be processed including characters;

a processing unit 402, configured to input the image to be processed into the character recognition model, so as to obtain characters included in the image to be processed; wherein:

extracting global features and local features of the training images;

Since the apparatus 400 is a device corresponding to the image processing method provided in the above method embodiment, and the specific implementation of each unit of the apparatus 400 is the same as the image processing method described in the above method embodiment, reference may be made to the relevant description part of the above method embodiment for the specific implementation of each unit of the apparatus 400, and details are not repeated here.

An embodiment of the present application further provides an apparatus, which includes a processor and a memory;

the processor is used for executing the instructions stored in the memory so as to cause the equipment to execute the image processing method provided by the above method embodiment.

The embodiment of the application provides a computer-readable storage medium which comprises instructions for instructing equipment to execute the image processing method provided by the method embodiment.

The embodiment of the present application further provides a computer program product, which when running on a computer, causes the computer to execute the image processing method provided by the above method embodiment.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice in the art to which the invention pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the application being indicated by the following claims.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. An image processing method, characterized in that the method comprises:

acquiring an image to be processed including characters;

2. The method according to claim 1, wherein the character recognition model comprises an encoder and a decoder, the encoder is used for extracting global features and local features of the image to be processed, and the decoder is used for obtaining character prediction results according to the global features and the local features.

3. The method according to claim 2, wherein the encoder comprises a multi-head attention module and a convolution module, the multi-head attention module and the convolution module are connected in parallel, the multi-head attention module is used for extracting global features of the image to be processed, and the convolution module is used for extracting local features of the image to be processed.

4. The method of claim 3, wherein the encoder further comprises a feature pre-processing module configured to obtain a first sequence of features; the multi-head attention module is configured to process part or all of the features in the first feature sequence to obtain the global features, and the convolution module is configured to process part or all of the features in the first feature sequence to obtain the local features, where any feature in the first feature sequence is processed by the multi-head attention module and/or the convolution module.

5. The method of claim 4, wherein the multi-head attention module is configured to process half of the features in the first sequence of features, and wherein the convolution module is configured to process the other half of the features in the first sequence of features.

6. The method according to any one of claims 1 to 4, wherein the character recognition model is trained by:

extracting global features and local features of the training images;

7. An image processing apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the character recognition model is trained by:

extracting global features and local features of the training images;

9. An apparatus, comprising a processor and a memory;

the processor is to execute instructions stored in the memory to cause the device to perform the method of any of claims 1 to 6.

10. A computer-readable storage medium comprising instructions that direct a device to perform the method of any of claims 1-6.