CN114708581A

CN114708581A - Image processing method and device, electronic equipment and storage medium

Info

Publication number: CN114708581A
Application number: CN202210376081.XA
Authority: CN
Inventors: 秦勇
Original assignee: Shenzhen Xingtong Technology Co ltd
Current assignee: Shenzhen Xingtong Technology Co ltd
Priority date: 2022-04-11
Filing date: 2022-04-11
Publication date: 2022-07-05

Abstract

The present disclosure relates to an image processing method, apparatus, electronic device, and storage medium. Acquiring a target image, wherein the target image comprises a plurality of lines of texts, and each line of text in the plurality of lines of texts comprises at least one character; identifying characters in a target image through a pre-trained text identification model, wherein the text identification model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix; and obtaining the recognition result of the characters in the target image according to the probability matrix corresponding to the target image, and being capable of accurately recognizing the text in the image, high in recognition speed and high in accuracy.

Description

Image processing method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method and apparatus, an electronic device, and a storage medium.

Background

Currently, image-based text recognition is widely used, and generally, before text recognition is performed, the number of lines of text included in an image needs to be detected, and then recognition is performed on a per-line text basis.

However, the method of detecting first and then recognizing may have a problem of missing recognition or multiple recognition, and for an image with a complicated text presentation form or background, for example, a straight text, an oblique text, and a curved text exist on the image, the method of detecting first and then recognizing may introduce more errors, the recognition speed is relatively slow, and the accuracy is relatively low.

Disclosure of Invention

In order to solve the technical problem, the present disclosure provides an image processing method, which can accurately identify a text in an image, and has a fast identification speed and a high accuracy.

In a first aspect, an embodiment of the present disclosure provides an image processing method, including:

acquiring a target image, wherein the target image comprises a plurality of lines of texts, and each line of text in the plurality of lines of texts comprises at least one character;

identifying characters in a target image through a pre-trained text identification model, wherein the text identification model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix;

and obtaining the recognition result of the characters in the target image according to the probability matrix corresponding to the target image.

In a second aspect, an embodiment of the present disclosure provides an image processing apparatus, including:

the device comprises an acquisition unit, a processing unit and a display unit, wherein the acquisition unit is used for acquiring a target image, the target image comprises a plurality of lines of texts, and each line of text in the plurality of lines of texts comprises at least one character;

the system comprises a first identification unit, a second identification unit and a probability matrix generation unit, wherein the first identification unit is used for identifying characters in a target image through a pre-trained text identification model, the text identification model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix;

and the second identification unit is used for obtaining the identification result of the characters in the target image according to the probability matrix corresponding to the target image.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform the image processing method described above.

In a fourth aspect, the disclosed embodiments provide a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the image processing method described above.

In a fifth aspect, a computer program product comprising a computer program, wherein the computer program realizes the image processing method as described above when executed by a processor.

The disclosure provides an image processing method, an image processing device, an electronic device and a storage medium. The image processing method specifically comprises the following steps: acquiring a target image, wherein the target image comprises a plurality of lines of texts, and each line of text in the plurality of lines of texts comprises at least one character; identifying characters in a target image through a pre-trained text identification model, wherein the text identification model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix; and obtaining the recognition result of the characters in the target image according to the probability matrix corresponding to the target image, and being capable of accurately recognizing the text in the image, high in recognition speed and high in accuracy.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a schematic diagram of an application scenario provided in an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a text recognition model training method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of a text recognition model according to an embodiment of the present disclosure;

fig. 4 is a schematic flowchart of an image processing method according to an embodiment of the present disclosure;

fig. 5 is a schematic flowchart of an image processing method according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

In order that the above objects, features and advantages of the present disclosure may be more clearly understood, aspects of the present disclosure will be further described below. It should be noted that the embodiments and features of the embodiments of the present disclosure may be combined with each other without conflict.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present disclosure, but the present disclosure may be practiced in other ways than those described herein; it is to be understood that the embodiments disclosed in the specification are only a few embodiments of the present disclosure, and not all embodiments.

At present, character recognition in a natural scene is a very challenging subject, the character recognition is a process of recognizing a character sequence in a picture including characters, namely a process of recognizing all characters included in the picture, the complexity of recognizing an output space is also difficult in the natural scene except for factors such as complex picture background, illumination change and the like, and the character recognition needs to recognize a sequence with an unfixed length from the picture because the character consists of letters or characters with an unfixed number. At present, two recognition methods are usually included, one is based on a bottom-up strategy, the recognition problem is divided into character detection, character recognition and character combination, and the two methods are solved one by one, and the other is based on a whole analysis strategy, namely a sequence-to-sequence method, wherein firstly, an image is coded, and then, a sequence is decoded to directly obtain a whole character string; however, the above two methods have the following problems: although the first method is effective, the labeling at the character level is required, that is, the position and the information of each character on the input image are manually labeled, which requires a lot of labor, and the second method is simple in labeling and only needs to transcribe character strings, but the output recognition result may have a situation of multiple recognized characters or character missing recognition.

Secondly, general text images can be roughly divided into three types according to writing directions and trends, namely normal texts (straight texts), inclined texts with angles, curved texts and the like, wherein the normal texts are usually written from left to right, all characters are roughly on a straight line, the straight line and the horizontal direction are almost coincident, the inclined texts with angles are written from left to right, all characters are roughly on a straight line, a certain included angle is formed between the straight line and the horizontal direction, namely an inclined line of characters, the curved texts are arch bridge type, are written from left to right, but the characters are almost not on the same straight line, and the central points of the characters are roughly in a curve after being connected; as mentioned above, the common text recognition operation is directed to a single-line text image, so a text detection operation is generally required as a pre-step to obtain a specific single-line text, generally speaking, for a conventional text image, such as a normal text, the distribution of its text lines may be relatively standard, and the detection may also achieve good results, so the detection plus recognition operation generally has no large error, but for a complicated text image, there may be an obvious hierarchical structure, the following description takes an example that the image includes a plurality of topics, the image includes a page composed of a plurality of topics, the topics may be well detected and distinguished by a detection box, each topic includes a plurality of lines, each line has different lengths (there are lengths and lengths), some line length and width comparisons (the length is far greater than the width), the detection difficulty is increased, and the above-mentioned detection and recognition modes may have errors, the recognition result is inaccurate.

In summary, text recognition is widely applied, the two methods have different problems, the labeling cost of the character-based method is high, the sequence-based method may have the problem of missing recognition or multiple recognized characters, and texts represented on different text images have straight texts, inclined texts and curved texts; most recognition methods aim at a single line of text, detection is required to be taken as a preposed operation, but for images with complex formats, a detection and recognition mode may introduce more errors, and the recognition accuracy is lower.

In view of the above technical problem, the embodiments of the present disclosure provide an image processing method, which is specifically described in detail by one or more embodiments below.

Specifically, the image processing method may be executed by a terminal or a server. Specifically, the terminal or the server may recognize characters in the target image through a text recognition model. The execution subject of the training method of the text recognition model and the execution subject of the character recognition method may be the same or different.

For example, in one application scenario, as shown in FIG. 1, the server 12 trains a text recognition model. The terminal 11 obtains the trained text recognition model from the server 12, and the terminal 11 recognizes the characters in the target image through the trained text recognition model. The target image may be captured by the terminal 11. Alternatively, the target image is acquired by the terminal 11 from another device. Still alternatively, the target image is an image obtained by image processing of a preset image by the terminal 11, where the preset image may be obtained by shooting by the terminal 11, or the preset image may be obtained by the terminal 11 from another device. Here, the other devices are not particularly limited.

In another application scenario, the server 12 trains a text recognition model. Further, the server 12 recognizes the characters in the target image through the trained text recognition model. The manner in which the server 12 acquires the target image may be similar to the manner in which the terminal 11 acquires the target image as described above, and will not be described herein again.

In yet another application scenario, the terminal 11 trains a text recognition model. Further, the terminal 11 recognizes the characters in the target image through the trained text recognition model.

It can be understood that the text recognition model training method and the image processing method provided by the embodiments of the present disclosure are not limited to the several possible scenarios described above. Since the trained text recognition model can be applied to the image processing method, before the image processing method is introduced, the text recognition model training method can be introduced below.

Taking the example of training the text recognition model by the server 12, a text recognition model training method, i.e., a training process of the text recognition model, is introduced below. It is understood that the text recognition model training method is also applicable to the scenario in which the terminal 11 trains the text recognition model.

Fig. 2 is a schematic flow chart of a text recognition model training method provided in an embodiment of the present disclosure, which specifically includes the following steps S210 to S230 shown in fig. 2:

s210, acquiring a sample image and characters in the sample image.

Optionally, the step S210 specifically includes: acquiring a sample image, wherein the sample image comprises a plurality of lines of texts; and marking characters included in each line of texts in the multiple lines of texts, determining the characters included in each line of texts in the sample image, and separating the characters included in each line of texts in the sample image from the characters included in each line of texts belonging to one sample image by preset characters.

It is understood that a plurality of sample images are obtained, wherein a sample image may be understood as a text image to be recognized, each sample image in the plurality of sample images includes a plurality of text lines, and the representation form of the text lines in the sample image may be straight text, oblique text, curved text, and the like, for example, any sample image includes a plurality of lines of curved text. After a plurality of sample images are obtained, the sample images are scaled to the same size, all characters included in each sample image in the sample images are marked, the characters in the sample images can be determined in the process of marking the characters in the sample images, the marked characters in the sample images are used as standard characters and used as references of character recognition results output in the process of training a text recognition model, and then network parameters of the text recognition model are updated. It is understood that if the sample image includes only a single line of text, the characters in the single line of text may be directly marked, and if the sample image includes multiple lines of text, the marked characters between the multiple lines of text in the same sample image are separated by a preset character, which may be "\ n", to connect the multiple lines of text in the sample image in series, for example, sample image 1 includes 2 lines of text, the marked character in the first line of text is "AAA", the marked character in the second line of text is "BBBBB", and the character sequence in sample image 1 is "AAA \ nbbb". It can be understood that a dictionary is established according to the character information labeled in the plurality of sample images, the dictionary comprises characters in the plurality of sample images, if the plurality of sample images comprise characters, the established dictionary comprises a plurality of independent and nonrepeating characters, the plurality of independent and nonrepeating characters can be obtained by performing single character set and operation on the labeled character sequence, the established dictionary can be used as a character database for performing character recognition based on a text recognition model, and specific characters are determined in the character recognition database according to recognition probability.

S220, inputting the sample image into the constructed text recognition model to obtain a preset character corresponding to the sample image.

Understandably, on the basis of the S210, after the sample image and the characters in the sample image are acquired, the sample image is input into the constructed text recognition model, so as to obtain the preset characters corresponding to the sample image, where the preset characters may be understood as characters in the sample image recognized by the text recognition model, and the preset characters may not be an accurate recognition result.

For example, referring to fig. 3, fig. 3 is a schematic structural diagram of a text recognition model provided in the embodiment of the present disclosure, a constructed text recognition model 300 includes a feature mapping layer 310, a feature extraction layer 320, an adaptive clustering layer 330, and an output layer 340, where the feature mapping layer 310 includes a plurality of sub-feature mapping layers, and the plurality of sub-feature mapping layers are connected in sequence, the feature extraction layer 320 may specifically be a network layer formed by a feature point extraction algorithm, the feature extraction algorithm may specifically be a Scale-invariant feature transform (SIFT), the adaptive clustering layer 330 includes an encoder, a first convolution layer, and a second convolution layer, and the output layer 340 may be a transcription layer based on Attention mechanism (Attention). The data flow inside the text recognition model comprises the following steps: the feature mapping layer 310 performs feature mapping on the sample image to obtain mapping information corresponding to the sample image, the feature extraction layer 320 performs feature extraction on the sample image to obtain feature information corresponding to the sample image, the adaptive clustering layer 330 obtains a feature matrix based on the mapping information output by the feature mapping layer 310 and the feature information output by the feature extraction layer 320, and finally the output layer 340 calculates the similarity probability between each character in the feature matrix and the character in the established dictionary based on the feature matrix output by the adaptive clustering layer 330 to obtain a probability matrix corresponding to the sample image. After the text recognition model outputs the probability matrix, a final recognition result is determined according to the probability matrix, specifically, the final recognition result can be obtained according to a greedy algorithm or a beam search algorithm (beam search), the final recognition result is a preset character, and a character corresponding to the maximum probability value in the probability matrix can be determined as the preset character.

And S230, updating the network parameters of the text recognition model according to the preset characters, the characters in the sample image and the preset loss function.

It can be understood that, on the basis of the above S220, the preset characters included in the sample image are obtained after the sample image is input into the text recognition model, and the text recognition model is updated according to the preset characters, the characters in the sample image, and the preset loss function, for example, the characters marked in the sample image 1 are "peng", the preset characters may be "peng", "bird", or "punt", and if the recognition result is "peng", it indicates that the result recognized by the text recognition model is accurate, where the preset loss function may be a multi-classification cross entropy loss function, the type of the loss function is not limited, and may be determined by the user according to the actual use requirement.

According to the training method of the text recognition model provided by the embodiment of the disclosure, the sample image is obtained, the characters in the sample image are marked to obtain the characters in the sample image, the sample image is input into the constructed text recognition model to obtain the preset characters corresponding to the sample image, and then the network parameters of the text recognition model are updated according to the preset characters, the characters in the sample image and the preset loss function to obtain the text recognition model.

On the basis of the foregoing embodiment, fig. 4 is a flowchart illustrating an image processing method according to an embodiment of the present disclosure, taking an example that a terminal recognizes characters in a target image according to a text recognition model, specifically including steps S410 to S430 shown in fig. 4:

s410, obtaining a target image, wherein the target image comprises a plurality of lines of texts, and each line of text in the plurality of lines of texts comprises at least one character.

It can be understood that after the terminal obtains the text recognition model, the terminal obtains a target image, where the target image includes multiple lines of text, where the multiple lines of text may include multiple lines of straight text, oblique text, and curved text, and each line of text includes at least one character, that is, only the line where a character exists is called a text line.

S420, recognizing characters in the target image through a pre-trained text recognition model, wherein the text recognition model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix.

It can be understood that the obtained target image is input into the trained text recognition model, the text recognition model directly recognizes characters in a plurality of lines of texts in the target image, and does not need to detect a plurality of lines of texts to determine each line of texts and perform character recognition with one line of texts as a reference, that is, the text recognition model provided by the present disclosure only needs to perform a recognition step, while the conventional method needs to perform two steps of detection and recognition, but the conventional method brings new errors during detection, for example, incomplete detection such as multiple detections and missing detections, and further affects subsequent recognition results, in addition, context relationship does not exist between texts in the same line, and only detecting one line of texts affects accuracy of the recognition results, while the text recognition model provided by the present disclosure directly performs character recognition with the target image including a plurality of lines of texts as an input, the detection step is not required, the context information of the multi-line text can be fully utilized while the identification error is effectively reduced, the identification difficulty of the bent text can be reduced to a certain degree, and the accuracy of the identification result is improved. The text recognition model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, and the internal flow of the text recognition model comprises the following steps: the feature mapping layer performs feature mapping on the target image to obtain mapping information corresponding to the target image, the mapping information is used for representing global feature information of the target image, namely relevant information of characters in the target image is extracted, background information is removed, character recognition is conveniently performed according to the extracted feature information of the characters, for example, the size of the target image is 4H 4W, H represents height, W represents width, H and W are both greater than 1, a group of H W512 mapping information is obtained after the mapping information is input to the feature mapping layer, and 512 refers to the number of channels; the method comprises the steps that a feature extraction layer performs feature extraction on a target image to obtain feature information corresponding to the target image, the feature extraction layer can be a network layer corresponding to a sift feature extraction algorithm, the obtained feature information is coordinates of a series of feature points, the feature information specifically comprises the more prominent feature points in a local range, a preset number of the prominent feature points can be reserved according to user requirements, the preset number can be set by a user, for example, the preset number can be 1024, namely 1024 feature points are extracted from a sample image; after obtaining the feature information and the mapping information, the adaptive clustering layer obtains a feature matrix based on the mapping information and the feature information, where the feature matrix includes a category corresponding to a character, coordinate information, and mapping information, the category corresponding to the character refers to a target text line in which the character is located, the coordinate information refers to a coordinate of the character in the target text line, the mapping information is mapping information of the character extracted by the feature mapping layer, the mapping information may also be referred to as a feature mapping vector, the feature matrix may be N512, N represents a possible maximum number of characters corresponding to an image, N may be a set fixed value, and N is calculated by multiplying the number of lines by the number of characters included in each line, for example, setting that there are at most 20 lines in an input target image, each line includes at most 10 characters, and at this time, the obtained N is 20 × 10; after the feature matrix is obtained, the output layer obtains a probability matrix corresponding to the target image based on the feature matrix, the probability matrix can be understood as a matrix formed by similar probabilities of characters included in the target image and any character in the dictionary, and the output layer can be specifically a transcription layer based on an attention mechanism.

And S430, obtaining the recognition result of the characters in the target image according to the probability matrix corresponding to the target image.

Understandably, after determining the probability matrix corresponding to the target image based on the above S420, obtaining a final recognition result, that is, outputting all characters included in the target image, based on the probability matrix by using a greedy algorithm or a beam search algorithm.

The image processing method provided by the embodiment of the disclosure directly takes a target image comprising a multi-line text as the input of a text recognition model for character recognition, does not need to execute a detection step, namely, does not need to split the multi-line text into single-line texts, directly recognizes the multi-line text, effectively reduces recognition errors, simultaneously can fully utilize context information of the multi-line text, reduces the recognition difficulty of a bent text to a certain extent, is suitable for complex application scenes, improves the accuracy of recognition results, can accurately recognize characters in the image, and has the advantages of high recognition speed and high recognition accuracy.

Optionally, the feature mapping layer includes a plurality of sub-feature mapping layers, and the plurality of sub-feature mapping layers are sequentially connected.

For example, referring to fig. 3, the feature mapping layer 310 includes 4 sub-feature mapping layers, specifically includes a first sub-feature mapping layer, a second sub-feature mapping layer, a third sub-feature mapping layer, and a fourth sub-feature mapping layer, the feature mapping layer 310 may be a residual network layer, the specific feature mapping layer 310 may be understood as a Resnet18 network layer, the sub-feature mapping layer may be understood as a Block in a Resnet18 network layer, each Block is composed of a plurality of convolutional layers, each Block performs a convolution operation, an output of each Block is an input of a subsequent Block, that is, each Block is connected in sequence, and a step size of the convolutional layer corresponding to each Block is set in the feature mapping layer 310.

Optionally, performing feature mapping on the target image through the feature mapping layer to obtain mapping information corresponding to the target image, where the obtaining includes: performing feature mapping on the target image through a first sub-feature mapping layer to obtain first mapping information, wherein the size of the first mapping information is the same as that of the target image; performing feature mapping on the first mapping information through a second sub-feature mapping layer to obtain second mapping information, wherein the size of the second mapping information is half of that of the first mapping information; performing feature mapping on the second mapping information through a third sub-feature mapping layer to obtain third mapping information, wherein the size of the third mapping information is the same as that of the second mapping information; and performing feature mapping on the third mapping information through a fourth sub-feature mapping layer to obtain fourth mapping information, wherein the size of the fourth mapping information is half of that of the third mapping information, and the fourth mapping information is mapping information corresponding to the target image.

Understandably, the specific process of the feature mapping layer performing feature mapping on the target image to obtain the mapping information corresponding to the target image includes: the first sub-feature mapping layer performs feature mapping on the target image to obtain first mapping information, the size of the first mapping information is the same as that of the target image, the mapping information can also be understood as a mapping feature map, the size of the target image is 4H × 4W, the step size of convolution in the first sub-feature mapping layer is set to be 1, the size of the obtained first mapping information is 4H × 4W × 128, 4H 4W 128 can be understood as that the size of the feature map is 4H × 4W, and the number of channels is 128; the second sub-feature mapping layer performs feature mapping on the first mapping information to obtain second mapping information, the size of the second mapping information is half of that of the first mapping information, the step size in the second sub-feature mapping layer is set to be 2, the size of the second mapping information is 2H x 2W x 256, namely, the width and the height are both halved, and the number of channels is doubled; performing feature mapping on the second mapping information through a third sub-feature mapping layer to obtain third mapping information, wherein the size of the third mapping information is the same as that of the second mapping information, the step size of convolution in the third sub-feature mapping layer is set to be 1, and the size of the third mapping information is also 2H x 2W 256, namely the height, the width and the number of channels are unchanged; and performing feature mapping on the third mapping information through a fourth sub-feature mapping layer to obtain fourth mapping information, wherein the size of the fourth mapping information is half of that of the third mapping information, the step size of convolution in the fourth sub-feature mapping layer is set to be 2, the size of the fourth mapping information is H x W x 512, namely the mapping information corresponding to the target image output by the feature mapping layer is the fourth mapping information, and the height and width of the fourth mapping information are 1/4 corresponding to the height and width of the target image. The constructed feature mapping layer comprises a plurality of levels, the depth features of the target image can be extracted, the character information in the target image is retained to the maximum extent in the extracted fourth mapping feature, the background information in the target image is removed, the feature mapping layer can be applied to a scene with a complex background, the character can be conveniently identified according to the feature information of the character, and the identification accuracy is improved.

On the basis of the foregoing embodiment, fig. 5 is a schematic flow chart of an image processing method provided in the embodiment of the present disclosure, and optionally, the obtaining of the feature matrix by using the adaptive clustering layer based on the mapping information and the feature information specifically includes the following steps S510 to S540 shown in fig. 5:

optionally, the adaptive clustering layer includes an encoder, a first convolutional layer, and a second convolutional layer.

For example, referring to fig. 3, in the adaptive clustering layer 330 in fig. 3, the encoder, the first convolution layer and the second convolution layer are included, the encoder may specifically be an encoder in a transform structure, but a position encoder in a transform structure is not used, and position encoding is not adopted because there is explicit feature point coordinate information in feature information input to the encoder, which is equivalent to performing one feature transformation, and then the encoder outputs a vector, for example, 1024 × 512 is input, and 1024 × 512 is output, and 1024 × 512 is input, and is composed according to coordinate information of 1024 feature points included in the feature information and hw × 512 information included in mapping information (fourth mapping information).

And S510, obtaining a first vector through an encoder based on the mapping information and the scaled characteristic information.

It can be understood that after the adaptive clustering layer receives the mapping information of the feature mapping layer and the feature information of the feature extraction layer, the encoder obtains a first vector based on the mapping information and the scaled feature information, the scaled feature information is information obtained by reducing the feature information by 1/4, since the mapping information is reduced by 1/4 compared with the height and width of the target image, the feature information obtained according to the target image also needs to be correspondingly reduced by 1/4, the uniformity of data is maintained, errors are avoided, and the size of the first vector finally output can be 1024 × 512.

S520, performing feature transformation on the first vector through the first convolution layer to obtain a second vector, wherein the second vector comprises a plurality of feature points.

It can be understood that, on the basis of the above S510, after the first vector is obtained by the encoder, the first convolution layer is used to perform feature transformation to obtain the second vector, where the first convolution layer may be 1 × 1 convolution layer, the first convolution layer may be understood as the number of compressed channels, that is, the number of compressed channels 512, and the size of the second vector may be 1024 × 10, that is, the number of 512 channels in the first vector is compressed to 10 channels.

S530, classifying each feature point in the plurality of feature points in the second vector through the second convolution layer, and determining the category of the feature point, wherein the category of the feature point is a target line of the feature point in the multi-line text.

It can be understood that, after the second vector is obtained in S520, each feature point in the plurality of feature points included in the second vector is classified by using the second convolution layer, so as to obtain the category of the feature point, where the second convolution layer may be a 1 × 1 convolution layer. It can be understood that, if one feature point corresponds to one character in the target image, or multiple feature points may be one character, the feature points are classified, that is, the characters are classified, the category refers to the number of lines of the text, the category of the feature point is the target line of the feature points in multiple lines of texts, that is, the text line to which the character corresponding to the feature point belongs is determined, the maximum number of lines in the set image is 20, then the number of categories is 20, for example, the target image includes 10 lines, the character a is the second line in the target image, then the category of the feature point corresponding to the character a identified by the second convolution layer is the second line, and so on, the category of each feature point is determined, that is, the text line in which each character is located is determined. The number of lines of text in the sample image can be determined according to the category of the feature points, that is, if 5 categories of the feature points are identified, it can be determined that the sample image includes 5 lines of text.

And S540, obtaining a feature matrix according to the category of the feature points, the mapping information and the scaled feature information.

Understandably, on the basis of the above S530, an N × 512 feature matrix is formed according to the category of each feature point obtained by the second convolution layer, the mapping information output by the feature mapping layer, and the scaled feature information.

Optionally, the step S540 specifically includes: dividing target feature points with the same category in the plurality of feature points into a group, wherein each group is separated by a preset character; determining the position of a target characteristic point in each group in the scaled characteristic information, and determining a vector of the target characteristic point in the mapping information according to the position of the target characteristic point; and forming a feature matrix according to the positions of the target feature points and the vectors of the target feature points.

Optionally, the determining the vector of the target feature point in the mapping information according to the position of the target feature point specifically includes: sequencing according to the positions of the target feature points in each group; and determining the vector of the target characteristic point in the mapping information according to the position of the sequenced target characteristic point.

It can be understood that the target feature points with the same category in the plurality of feature points are divided into one group, that is, all characters in the same text line are grouped into one group, and each group is separated by a preset character, where the preset character may be \ n, and the group may be divided into 20 types according to the set maximum text line number, and if the sample image only includes 5 text lines, the category of the feature points is only 5 types, and the rest 15 types may be automatically filled with null characters. Subsequently, the position information and the feature mapping vector of each feature point in each group are determined, specifically, in the same group, the position information of all the feature points is determined in the scaled feature information, the feature points are sorted in the group according to the position information, then the feature mapping vector corresponding to the target feature point is determined in the mapping information according to the position information of the target feature point (current feature point), then the feature mapping vector corresponding to the next feature point in the same group is determined, the set maximum number of characters in each row is 10, and if no 10 characters exist in the same group (in one row), empty characters are automatically filled until the maximum number of characters reaches 10. And forming a feature matrix of N × 512 according to the categories, the position information and the feature mapping vectors of all the feature points until the feature mapping vectors corresponding to all the feature points in all the divided groups are determined.

The image processing method provided by the embodiment of the disclosure can directly identify information of feature points corresponding to all characters in a multi-line text, after a first vector is determined through an encoder, feature transformation is continuously performed based on a first convolution layer, the number of channels is reduced, then the category of each feature point is determined according to a second convolution layer, only the multi-line text is directly identified, the operation of determining the category of the feature point is involved, the text line where each character corresponding to the feature point is located is determined, then the feature mapping vector of each feature point is determined, and a feature matrix is obtained.

Fig. 6 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present disclosure. The image processing apparatus provided by the embodiment of the present disclosure may execute the processing flow provided by the embodiment of the image processing method, as shown in fig. 6, the image processing apparatus 600 includes:

an obtaining unit 610, configured to obtain a target image, where the target image includes multiple lines of text, and each line of text in the multiple lines of text includes at least one character;

the first recognition unit 620 is configured to recognize characters in a target image through a pre-trained text recognition model, where the text recognition model includes a feature mapping layer, a feature extraction layer, an adaptive clustering layer, and an output layer, where the feature mapping layer performs feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer performs feature extraction on the target image to obtain feature information corresponding to the target image, the adaptive clustering layer obtains a feature matrix based on the mapping information and the feature information, and the output layer obtains a probability matrix corresponding to the target image based on the feature matrix;

the second identifying unit 630 is configured to obtain an identification result of the character in the target image according to the probability matrix corresponding to the target image.

Optionally, the feature mapping layer in the first identification unit 620 includes a plurality of sub-feature mapping layers, and the plurality of sub-feature mapping layers are sequentially connected.

Optionally, the first identifying unit 620 performs feature mapping on the target image through the feature mapping layer to obtain mapping information corresponding to the target image, and is specifically configured to:

performing feature mapping on the target image through a first sub-feature mapping layer to obtain first mapping information, wherein the size of the first mapping information is the same as that of the target image;

performing feature mapping on the first mapping information through a second sub-feature mapping layer to obtain second mapping information, wherein the size of the second mapping information is half of that of the first mapping information;

performing feature mapping on the second mapping information through a third sub-feature mapping layer to obtain third mapping information, wherein the size of the third mapping information is the same as that of the second mapping information;

and performing feature mapping on the third mapping information through a fourth sub-feature mapping layer to obtain fourth mapping information, wherein the size of the fourth mapping information is half of that of the third mapping information, and the fourth mapping information is mapping information corresponding to the target image.

Optionally, the adaptive clustering layer in the first identifying unit 620 includes an encoder, a first convolution layer and a second convolution layer.

Optionally, the first identifying unit 620 obtains a feature matrix based on the mapping information and the feature information through the adaptive clustering layer, and is specifically configured to:

obtaining, by an encoder, a first vector based on the mapping information and the scaled feature information;

performing feature transformation on the first vector through the first convolution layer to obtain a second vector, wherein the second vector comprises a plurality of feature points;

classifying each feature point in a plurality of feature points in the second vector through a second convolution layer, and determining the category of the feature point, wherein the category of the feature point is a target line of the feature point in a multi-line text;

and obtaining a characteristic matrix according to the category of the characteristic points, the mapping information and the scaled characteristic information.

Optionally, the first identifying unit 620 obtains a feature matrix according to the category of the feature point, the mapping information, and the scaled feature information, and is specifically configured to:

dividing target feature points with the same category in the plurality of feature points into a group, wherein each group is separated by a preset character;

determining the position of a target characteristic point in each group in the scaled characteristic information, and determining a vector of the target characteristic point in the mapping information according to the position of the target characteristic point;

and forming a feature matrix according to the positions of the target feature points and the vectors of the target feature points.

Optionally, the first identifying unit 620 determines a vector of the target feature point in the mapping information according to the position of the target feature point, and is specifically configured to:

sequencing according to the positions of the target feature points in each group;

and determining the vector of the target characteristic point in the mapping information according to the position of the sequenced target characteristic point.

Optionally, the first identifying unit 620 obtains a probability matrix corresponding to the target image based on the feature matrix through the output layer, and is specifically configured to:

and obtaining a probability matrix corresponding to the target image through the output layer based on the feature matrix and a preset dictionary, wherein the preset dictionary comprises a plurality of characters.

Optionally, the apparatus 600 further includes a training unit, and the training unit is specifically configured to:

acquiring a sample image and characters in the sample image;

inputting the sample image into the constructed text recognition model to obtain a preset character corresponding to the sample image;

and updating the network parameters of the text recognition model according to the preset characters, the characters in the sample image and the preset loss function.

Optionally, the training unit obtains the sample image and the characters in the sample image, and is specifically configured to:

acquiring a sample image, wherein the sample image comprises a plurality of lines of texts;

and marking characters included in each line of texts in the multiple lines of texts, determining the characters included in each line of texts in the sample image, and separating the characters included in each line of texts in the sample image from the characters included in each line of texts belonging to one sample image by preset characters.

The image processing apparatus in the embodiment shown in fig. 6 can be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, and are not described herein again.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a method according to an embodiment of the present disclosure.

The disclosed exemplary embodiments also provide a non-transitory computer readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is for causing the computer to perform a method according to an embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when being executed by a processor of a computer, is adapted to cause the computer to carry out the method according to the embodiments of the present disclosure.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 704 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above. For example, in some embodiments, the image processing method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. In some embodiments, the computing unit 701 may be configured to perform the image processing method by any other suitable means (e.g. by means of firmware).

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. An image processing method, comprising:

recognizing characters in the target image through a pre-trained text recognition model, wherein the text recognition model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix;

2. The method of claim 1, wherein the feature mapping layer comprises a plurality of sub-feature mapping layers, and the plurality of sub-feature mapping layers are sequentially connected;

the performing, by the feature mapping layer, feature mapping on the target image to obtain mapping information corresponding to the target image includes:

3. The method of claim 1, wherein the adaptive clustering layer comprises an encoder, a first convolutional layer, and a second convolutional layer;

obtaining a feature matrix based on the mapping information and the feature information by the adaptive clustering layer, including:

obtaining, by the encoder, a first vector based on the mapping information and the scaled feature information;

classifying each feature point in a plurality of feature points in the second vector through the second convolution layer, and determining the category of the feature point, wherein the category of the feature point is a target line of the feature point in the plurality of lines of texts;

and obtaining a feature matrix according to the category of the feature points, the mapping information and the scaled feature information.

4. The method according to claim 3, wherein obtaining a feature matrix according to the category of the feature point, the mapping information, and the scaled feature information comprises:

5. The method according to claim 4, wherein the determining the vector of the target feature point in the mapping information according to the position of the target feature point comprises:

sorting according to the positions of the target feature points in each group;

6. The method according to claim 1, wherein obtaining, by the output layer, a probability matrix corresponding to the target image based on the feature matrix comprises:

7. The method of claim 1, further comprising:

acquiring a sample image and characters in the sample image;

inputting the sample image into a constructed text recognition model to obtain a preset character corresponding to the sample image;

and updating the network parameters of the text recognition model according to the preset characters, the characters in the sample image and a preset loss function.

8. The method of claim 7, wherein the obtaining the sample image and the characters in the sample image comprises:

9. An image processing apparatus characterized by comprising:

the system comprises a first identification unit, a second identification unit and a third identification unit, wherein the first identification unit is used for identifying characters in a target image through a pre-trained text identification model, the text identification model comprises a feature mapping layer, a feature extraction layer, a self-adaptive clustering layer and an output layer, the feature mapping layer is used for performing feature mapping on the target image to obtain mapping information corresponding to the target image, the feature extraction layer is used for performing feature extraction on the target image to obtain feature information corresponding to the target image, the self-adaptive clustering layer is used for obtaining a feature matrix based on the mapping information and the feature information, and the output layer is used for obtaining a probability matrix corresponding to the target image based on the feature matrix;

10. An electronic device, comprising:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the method according to any one of claims 1-8.

11. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.