CN116563840B

CN116563840B - Scene text detection and recognition method based on weak supervision cross-mode contrast learning

Info

Publication number: CN116563840B
Application number: CN202310828211.3A
Authority: CN
Inventors: 裴文杰; 伍晶晶; 房正耀; 陈芳林; 卢光明
Original assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Current assignee: Harbin Institute Of Technology shenzhen Shenzhen Institute Of Science And Technology Innovation Harbin Institute Of Technology
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-09-05
Anticipated expiration: 2043-07-07
Also published as: CN116563840A

Abstract

The invention provides a scene text detection and recognition method based on weak supervision cross-modal contrast learning, and relates to the technical field of image processing. The method comprises the following steps: inputting an image to be identified into a first image encoder in a text identification model to obtain a first image feature map; acquiring a probability feature map, a character position feature map and a character semantic feature map based on the first image feature map; determining a text recognition result in the image to be recognized based on the probability feature map, the character position feature map and the character semantic feature map; the text recognition model is obtained by training based on a plurality of groups of first training data comprising a sample image to be recognized and a first text content label and a text region position label corresponding to the sample image to be recognized; the text region position labels are generated based on a label generation model, and the label generation model is obtained by training based on a plurality of groups of second training data comprising sample images to be marked and second text content labels in the sample images to be marked. The invention can reduce the labeling cost.

Description

Scene text detection and recognition method based on weak supervision cross-mode contrast learning

Technical Field

The invention relates to the technical field of image processing, in particular to a scene text detection and recognition method based on weak supervision cross-modal contrast learning.

Background

The purpose of scene text detection recognition is to detect and recognize text instances in scene pictures that contain text. The diversity of text size, shape, arrangement direction and scene presents certain difficulties for locating the text, and the sequence structure of characters and the diversity of fonts in the text example presents certain challenges for identifying the text.

Aiming at scene text detection and recognition tasks, in the prior art, when training is carried out on a model, besides character labeling of text examples in a data set, text box labeling corresponding to each text example is also relied on to be used as supervision information for training, namely, not only manual labeling is carried out on text contents in sample images, but also manual labeling is carried out on text positions, and the latter results in high data set labeling cost required by the traditional scene text detection and recognition method.

Disclosure of Invention

The invention provides a scene text detection and recognition method based on weak supervision cross-mode contrast learning, which is used for solving the defect of high cost of scene text detection and recognition task dataset marking in the prior art and realizing reduction of dataset marking cost.

The invention provides a scene text detection and recognition method based on weak supervision cross-modal contrast learning, which comprises the following steps:

acquiring an image to be identified, inputting the image to be identified into a first image encoder in a trained text identification model, and acquiring a first image feature map output by the first image encoder;

respectively inputting the first image feature map to an anchor estimator, a sampling module and a recognition module in the text recognition model, and obtaining a probability feature map output by the anchor estimator, a character position feature map output by the sampling module and a character semantic feature map output by the recognition module, wherein the value of a pixel point in the probability feature map reflects the probability that the position of the pixel point is a text region, the feature vector corresponding to the pixel point in the character position feature map reflects the pixel distance between the pixel point and each character included in the text region where the pixel point is, the feature vector corresponding to the pixel point in the character semantic feature map reflects the probability that the position of the pixel point is each preset character, each text word comprises a text word, and each text word consists of at least one character;

Determining character positions in the text regions based on the probability feature map and the character position feature map, determining characters in each text region in the character semantic feature map based on the character positions in the text regions to determine text recognition results in the image to be recognized, wherein the text recognition results comprise the positions of the text regions in the image to be recognized and text contents in the text regions, and each text region comprises a text word;

the text recognition model is obtained by training based on a plurality of groups of first training data, and each group of first training data comprises a sample image to be recognized, a first text content label and a text region position label corresponding to the sample image to be recognized; the text region position labels in the sample image to be identified are generated based on a trained label generation model, the label generation model is trained based on a plurality of groups of second training data, and each group of second training data comprises the sample image to be marked and the second text content labels in the sample image to be marked.

According to the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention, the process of generating the text region position label based on the label generation model comprises the following steps:

Inputting the sample image to be identified to a second image encoder in the label generation model to obtain a second image feature map output by the second image encoder, inputting the first text content label corresponding to the sample image to be identified to a text encoder in the label generation model to obtain text features output by the text encoder;

generating an activation feature map based on the second image feature map and the text feature, wherein a pixel value corresponding to each pixel point of the activation feature map reflects similarity between a feature vector corresponding to each pixel point in the second image feature map and the text feature;

and taking the position of the pixel point, corresponding to the pixel value of which is larger than a preset threshold value, in the activation characteristic diagram as the text region position label.

According to the scene text detection and recognition method based on weak supervision cross-mode contrast learning provided by the invention, the training process of the label generation model comprises the following steps:

inputting a first sample to-be-annotated image in a plurality of sample to-be-annotated images to the second image encoder, and inputting the second text content label in the sample to-be-annotated image to the text encoder to obtain a sample second image feature map corresponding to the first sample to-be-annotated image output by the second image encoder and sample text features output by the text encoder;

Generating a first sample activation feature map based on a sample second image feature map corresponding to the first sample image to be annotated and the sample text feature;

performing point multiplication on the first sample activation feature map and a sample second image feature map corresponding to the first sample image to be marked to obtain a first weighted image feature vector;

and acquiring a first training loss based on the similarity between the first weighted picture feature vector and the sample text feature, and updating parameters of the label generation model based on the first training loss.

According to the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention, the updating of the parameters of the label generation model based on the first training loss comprises the following steps:

inputting a second sample to-be-annotated image in the plurality of sample to-be-annotated images to the second image encoder to obtain a sample second image feature map corresponding to the second sample to-be-annotated image;

generating a second sample activation feature map based on a sample second image feature map corresponding to the second sample image to be annotated and the sample text feature;

performing point multiplication on the second sample activation feature map and a sample second image feature map corresponding to the second sample image to be marked to obtain a second weighted image feature vector;

Acquiring a second training loss based on similarity of the second weighted picture feature vector and the sample text feature;

updating parameters of the tag generation model based on the first training loss and the second training loss.

According to the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention, the character position in each text region is determined based on the probability feature map and the character position feature map, and the method comprises the following steps:

determining at least one first target pixel point in the character position feature map based on the pixel point position, corresponding to the pixel value, in the probability feature map, greater than a preset threshold value;

and determining the character position in each text region based on the feature vector corresponding to the first target pixel point in the character position feature map.

According to the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention, the second image encoder comprises a plurality of convolution layers, the image to be recognized of the sample is input to the second image encoder in the label generation model, and a second image feature map output by the second image encoder is obtained, and the method comprises the following steps:

Convoluting the image to be identified of the sample through a first convolution layer to obtain a first feature image, convoluting the first feature image through a second convolution layer to obtain a second feature image, convoluting the second feature image through a third convolution layer to obtain a third feature image, convoluting the third feature image through a fourth convolution layer to obtain a fourth feature image, wherein the sizes of the first feature image, the second feature image, the third feature image and the fourth feature image are sequentially reduced;

the fourth characteristic diagram and the third characteristic diagram are up-sampled and then connected with the second characteristic diagram to obtain a fifth characteristic diagram, and the third characteristic diagram and the second characteristic diagram are up-sampled and then connected with the first characteristic diagram to obtain a sixth characteristic diagram;

and taking the fifth characteristic diagram and the sixth characteristic diagram as inputs of a multi-scale variable attention module, and obtaining the second image characteristic diagram output by the multi-scale variable attention module.

According to the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention, the training process of the text recognition model comprises the following steps:

Inputting the sample image to be identified into the text identification model, and acquiring a sample text identification result output by the text identification model and a sample probability feature map output by the anchoring estimator in the text identification model;

obtaining a third loss based on the sample text recognition result, the first text content tag corresponding to the sample to-be-recognized image and the text region position tag;

obtaining a fourth loss based on the sample probability feature map and the activation feature map corresponding to the text region position tag;

parameters of the text recognition model are updated based on the third loss and the fourth loss.

The invention also provides a scene text detection and recognition device based on weak supervision cross-modal contrast learning, which comprises:

the image coding module is used for acquiring an image to be identified, inputting the image to be identified into a first image coder in a trained text identification model, and acquiring a first image feature map output by the first image coder;

the feature processing module is used for respectively inputting the first image feature map to an anchor estimator, a sampling module and a recognition module in the text recognition model, acquiring a probability feature map output by the anchor estimator, a character position feature map output by the sampling module and a character semantic feature map output by the recognition module, wherein the value of a pixel point in the probability feature map reflects the probability that the position of the pixel point is a text region, the feature vector corresponding to the pixel point in the character position feature map reflects the pixel distance between the pixel point and each character included in the text region where the pixel point is, the feature vector corresponding to the pixel point in the character semantic feature map reflects the probability that the position of the pixel point is each preset character, each text region comprises a text word, and one text word consists of at least one character;

A text detection and recognition module, configured to determine a character position in each text region based on the probability feature map and the character position feature map, determine a character in each text region in the character semantic feature map based on the character position in the text region to determine a text recognition result in the image to be recognized, where the text recognition result includes a position of each text region in the image to be recognized and text content in the text region, and each text region includes a text word;

The invention also provides electronic equipment, which comprises a memory, a processor and a computer program stored on the memory and capable of running on the processor, wherein the processor realizes the scene text detection and recognition method based on the weak supervision cross-modal contrast learning when executing the program.

The present invention also provides a non-transitory computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a scene text detection and recognition method based on weak supervised cross modal contrast learning as described in any of the above.

According to the scene text detection and recognition method based on weak supervision cross-mode contrast learning, the text region position label is generated by training the label generating model through the image and the text content label data in the image, the text region position label is generated by utilizing the trained label generating model, then the text region position label, the text content label and the text region position label are used as the supervision training text recognition model of the text recognition model, so that manual labeling of the text position in the image is not needed, further, three types of processing are carried out on the image feature map of the input image in the text recognition model, a probability feature map reflecting the text region position, a character position feature map reflecting the character position in the text region and a character semantic feature map reflecting the characters are obtained, final text characters and text regions are determined step by step based on the three feature maps, and the accuracy of text recognition results is guaranteed. According to the method provided by the invention, only the content label of the text instance is needed to be used as a supervision signal, and the manual label of the text position is not needed, so that the labeling cost of the data set is greatly reduced on the basis of ensuring the accuracy of the scene text detection and identification result.

Drawings

In order to more clearly illustrate the invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow diagram of a scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention;

FIG. 2 is a schematic diagram of a conventional scene text detection and recognition method;

FIG. 3 is a schematic diagram II of a conventional scene text detection and recognition method;

FIG. 4 is a schematic diagram of a model framework of a scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention;

FIG. 5 is a schematic diagram of a workflow of a label generation model in a scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention;

FIG. 6 is a schematic diagram of a workflow of a first image encoder in a scene text detection and recognition method based on weak supervised cross-modal contrast learning provided by the invention;

FIG. 7 is a schematic diagram of a workflow of a text encoder in a label generation model in a scene text detection and recognition method based on weak supervised cross-modal contrast learning provided by the invention;

FIG. 8 is a schematic diagram of a process for generating feature vectors of weighted pictures in a scene text detection and recognition method based on weak supervision cross-modal contrast learning;

FIG. 9 is a schematic diagram of a workflow of a text recognition model in a scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention;

fig. 10 is a schematic diagram of an application of a text recognition result output in a scene text detection and recognition method based on weak supervision cross-modal contrast learning;

FIG. 11 is a second application diagram of a text recognition result output in the scene text detection and recognition method based on weak supervision cross-modal contrast learning;

FIG. 12 is a schematic diagram showing the effect of the scene text detection and recognition method based on weak supervision cross-modal contrast learning;

FIG. 13 is a schematic structural diagram of a scene text detection and recognition device based on weak supervision cross-modal contrast learning provided by the invention;

fig. 14 is a schematic structural diagram of an electronic device provided by the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the invention is described below with reference to fig. 1-12, and can be applied to electronic equipment with computing capability, wherein the electronic equipment can be, but is not limited to, various computers, mobile terminals, wearable intelligent equipment and the like.

In the prior art, a scene text detection and recognition model needs to use text content labeling and text box labeling of a dataset, mainly comprising a two-stage method as shown in fig. 2 and an end-to-end one-stage method as shown in fig. 3, wherein the method needs to carry out text content labeling and text box labeling on the dataset, and the scene text detection and recognition model needs a large amount of dataset to train in practical application, so that high labeling cost is brought.

Aiming at the defect that in the prior art, the text content and the text box in the data set are required to be marked in the model training of the scene text detection and identification task, the invention provides a scene text detection and identification method based on weak supervision cross-mode contrast learning, aiming at reducing the marking cost of the model training data set marking the scene text detection and identification task.

As shown in fig. 1, the method provided by the invention comprises the following steps:

s110, acquiring an image to be recognized, inputting the image to be recognized into a first image encoder in a trained text recognition model, and acquiring a first image feature map output by the first image encoder;

s120, respectively inputting the first image feature map to an anchor estimator, a sampling module and an identification module in the text identification model, and obtaining a probability feature map output by the anchor estimator, a character position feature map output by the sampling module and a character semantic feature map output by the identification module, wherein the value of a pixel point in the probability feature map reflects the probability that the position of the pixel point is a text region, the feature vector corresponding to the pixel point in the character position feature map reflects the pixel distance between the pixel point and each character included in the text region where the pixel point is located, the feature vector corresponding to the pixel point in the character semantic feature map reflects the probability that the position of the pixel point is each preset character, each text region comprises a text word, and one text word consists of at least one character;

S130, determining character positions in the text areas based on the probability feature map and the character position feature map, determining characters in each text area in the character semantic feature map based on the character positions in the text areas to determine text recognition results in the images to be recognized, wherein the text recognition results comprise the positions of the text areas in the images to be recognized and text contents in the text areas, and each text area contains a text word.

According to the method provided by the invention, the tag generation model is trained through the data of two modes of the image and the text content tag in the image, the text region position tag is generated by utilizing the tag generation model after training, then the text region position tag, the text content tag and the text region position tag are used as the supervised training text recognition model of the text recognition model, so that manual labeling of the text position in the image is not needed, further, three kinds of processing are carried out on the image feature map of the input image in the text recognition model, a probability feature map reflecting the text region position, a character position feature map reflecting the character position in the text region and a character semantic feature map reflecting the character are obtained, the final text characters and the text region are determined step by step based on the three feature maps, and the accuracy of the text recognition result is ensured. According to the method provided by the invention, only the content label of the text instance is needed to be used as a supervision signal, and the manual label of the text position is not needed, so that the labeling cost of the data set is greatly reduced on the basis of ensuring the accuracy of the scene text detection and identification result.

As shown in fig. 4, the method provided by the invention firstly trains the tag recognition model shown in the left part of fig. 4, generates a text region position tag based on the tag recognition model after training, combines the text content tag corresponding to the sample image to be recognized with the generated text region position tag to train the text recognition model shown in the right part of fig. 4, and finally executes scene text detection and recognition tasks by using the trained text recognition model.

Specifically, the process of generating the text region location tag based on the tag generation model includes:

And training the sample image to be recognized of the text recognition model, and generating the corresponding text region position label through the trained label generation model. As shown in fig. 5, the label generating model includes a second image encoder and a text encoder, the image to be identified of the sample is input to the second image encoder for feature extraction, so as to obtain the second image feature map, and the first text content label corresponding to the image to be identified of the sample is input to the text encoder for feature extraction, so as to obtain the text feature.

Specifically, the second image encoder includes a plurality of convolution layers, the inputting the sample image to be identified to the second image encoder in the label generation model, to obtain a second image feature map output by the second image encoder, including:

After the sample to be identified image input to the second image encoder is processed, multi-scale features are extracted from different convolution depths by adopting a plurality of convolution layers, and then fusion of the multi-scale features is carried out. As shown in the example of fig. 6, after the sample image to be identified in the input RGB format is preprocessed, feature images c2, c3, c4 and c5 with different sizes are obtained through convolution layers conv2_x, conv3_x, conv4_x and conv5_x respectively. Wherein c5 and c4 are connected with c3 through up-sampling operation as the feature f1 of the original 1/8 size; c4, c3 are connected with c2 through up-sampling operation as the feature f2 of the original 1/4 size. The feature graphs with different sizes obtained through the multi-scale feature fusion operation can enable the model to respond to detection of different text sizes in the image. f1 and f2 are used as the input of different sizes of the multi-scale variable self-attention module to carry out multi-scale variable self-attention, and finally the image coding features with the original size of 1/8 and the dimension of 512 are obtained.

The text encoder encodes the second text content tag in the sample to-be-annotated image, and the specific process of obtaining the sample text feature comprises the following steps:

acquiring an individual vector mark corresponding to each character in the second text content label, and splicing to obtain a text mark;

determining an order mark corresponding to each character based on the sequence information of each character in the second text content label, and fusing the text mark and the order mark to obtain an intermediate text feature vector;

and carrying out relevance modeling on the text feature vector to obtain the text feature vector.

Modeling the relevance of the text feature vector may be achieved by a transducer structure. As in the example shown in fig. 7, to distinguish the labels of a given text label between different characters, the individual vector labels of the C-dimension of each character in the alphabet are learned. Then, the corresponding vector marks of each character in the text labels of the given K characters are indexed, and the indexing results are spliced in sequence, so that the KxC-dimensional text labels are obtained. Meanwhile, in order to learn the sequence information among the characters in the text labels, the sequence labels of the positions of each character are learned, and the text labels and the position labels of the corresponding positions are fused through character-level feature addition, so that feature interaction can be carried out among the character labels of the text labels by adopting a transducer structure, and the correlation among the character labels can be modeled. And finally, carrying out average operation on the character labels to obtain 512-dimensional text feature vectors.

After the second image feature map and the text feature are obtained, appearance similarity between the text feature and each pixel in the second image feature map is measured first to obtain an activation feature map. Specifically, the pixel value corresponding to each pixel point of the activation feature map reflects the similarity between the feature vector corresponding to each pixel point in the second image feature map and the text feature, where the similarity may be characterized by cosine similarity. Each pixel in the activation profile is a continuous value between [0,1] instead of binary; higher values indicate a stronger response to the text encoding feature. Learning the activation map in this soft modeling manner during training can simplify the optimization of gradient propagation and can preserve much more similarity information than binary values. In addition, the activation feature map shows how well the text-encoding feature vector matches pixels of the scene image, and the pixel with the most active peak can be identified as the anchor point for the text. The anchor point of the text may be the center point of the text region. That is, the text region position label is given to the position of the pixel point where the corresponding pixel value in the activation feature map is sufficiently large (large as a preset threshold value) so as to reflect the position of the text region.

In order to enable the label generation model to generate accurate text region position labels, the training process of the label generation model comprises the following steps:

The updating parameters of the tag generation model based on the first training loss includes:

As shown in fig. 8, for the sample to-be-annotated image used for training the label generation model, the learned activation feature map is used as a weighted value to aggregate image features related to text encoding feature vectors, so as to obtain a weighted picture feature vector corresponding to the sample to-be-annotated image, the sample text feature corresponding to the second text content label in the sample to-be-annotated image and the weighted picture feature vector corresponding to the sample to-be-annotated image are combined into a positive sample pair, and the sample text feature corresponding to the sample to-be-annotated image, the sample text feature corresponding to the second text content label in the sample to-be-annotated image and another weighted picture feature vector generated based on the sample text feature corresponding to the second text content label in the sample to-be-annotated image are combined into a negative sample pair. Training the label to generate a model with the aim of maximizing the similarity between the two feature vectors in the positive sample pair and minimizing the similarity between the two feature vectors in the negative sample pair, so that the text feature vector can be used as a cluster center, all images which are related to the text feature vector into the positive sample pair are related to the text feature vector, so as to learn similar appearance modes between all regional images related to the text, and the negative sample pair prevents the model from learning a uniform mode for different texts to cause mode collapse.

And further training the text recognition model by using the text region position label generated by the label generation model after training as a supervision signal of the text recognition model. The operation of the text recognition model will be described first.

As shown in fig. 9, after an image to be recognized is input into the text recognition model, the image to be recognized is first processed by a first image encoder to obtain a first image feature map, and the first image feature map is then input into an anchor estimator, a sampling module and a recognition module in the text recognition model respectively to obtain a probability feature map output by the anchor estimator, a character position feature map output by the sampling module and a character semantic feature map output by the recognition module. The determining the character position in each text region based on the probability feature map and the character position feature map includes:

The pixel value of the pixel point in the probability feature map reflects the probability that the position of the pixel point is a text region, for example, the anchoring estimator may be composed of a convolution layer with 1x1 and an activation layer with sigmoid as an activation function, map the first image feature map to the probability feature map with the same size and a dimension of 1, where the closer the pixel value is to 1, the greater the probability that the position is represented as an anchor point, that is, the greater the probability that the text exists in the region. When the pixel value corresponding to the pixel point of the probability feature image is larger than the preset threshold value, the pixel point can be judged to be an anchor point of a text region, and the pixel point at the position corresponding to the pixel point is taken as the first target pixel point in the character position feature image.

And each pixel point in the character position feature map corresponds to a feature vector with 2K dimensions, wherein K is the maximum value of the number of characters included in a preset single word. For example, the sampling module is composed of three 3x3 convolution layers, one 1x1 convolution layer, and regularization and ReLU activation functions between the convolution layers, maps the first image feature map to a three-dimensional matrix of sampling point coordinate offsets (i.e., the character position feature map) of the same size, with dimensions of 2K, such as a three-dimensional matrix of sampling point coordinate offsets of dimensions of 2K for dimensions of HxW, with values of a 2K-dimensional vector at pixel coordinates (i, j) being the lateral and longitudinal coordinate offset values (Δxk, Δyk) of each of the K sampling points relative to (i, j), respectively. The coordinates of the kth sample point are then (i+Δxk, j+Δyk). The number of sampling points is K, and the sampling points correspond to the specific position of each character in a text example.

According to the feature vector corresponding to the first target pixel point in the character position feature map, the position of the character around each first target pixel point can be determined, after the character position in each text area is determined, what character corresponds to each character position in the character semantic feature map is determined, and then the text recognition result is obtained.

And the pixel value corresponding to each pixel point in the character semantic feature map reflects what character corresponds to the position of the pixel point. For example, the recognition module consists of four 3x3 convolutional layers, one 1x1 convolutional layer, and regularization and ReLU activation functions between the convolutional layers, maps the first image feature map to a character probability prediction three-dimensional matrix (i.e., the character semantic feature map) of the same size and with a dimension of character set number +1. Each pixel predicts its probability for all characters in the { null character } -u-character set }, where the character with the highest probability indicates that the image region to which the pixel corresponds is this character.

The method provided by the invention can be used for reasoning locally on a Linux platform with the video memory more than or equal to 4G. The training requires that the video memory is more than or equal to 8G. And the cloud service system is also deployed in the cloud to provide services. After the training of the label generating model (model one) and the text identifying model (model two) is completed, on the basis of realizing scene text detection and identification, a plurality of tasks can be further completed by utilizing the identification result. As shown in fig. 10, the text recognition may be followed by further translation, and as shown in fig. 11, the text recognition method may be used for a picture retrieval task under the condition that whether a certain text is included or not.

According to the method provided by the invention, the image text is positioned and identified by using the weakly supervised deep learning, so that the marking cost required by the training of the depth model is greatly reduced, the scene image text detection, positioning and identification with excellent effect can be realized by only marking the text content of the image, and the experimental result shows that the method provided by the invention is greatly advanced in three data sets compared with the existing weakly supervised similar method, as shown in the table 1.

TABLE 1

On the four data sets in table 1, the visualization of the detection and identification results of the method provided by the invention is shown in fig. 12, and it can be seen that the method provided by the invention has good visualization effect.

The identification result of the method provided by the invention is applied to a picture retrieval task, namely, a given text search scene contains all pictures of a specified text, and experiments prove that the retrieval effect is superior to other scene text detection identification models of the same kind of principle, and the method is not inferior to a special method in the field of picture retrieval, and the result is shown in a table 2.

TABLE 2

The model I (the label generation model) provided by the invention has better text positioning effect on the test set although the behavior of the text marked anchor point is determined only by comparison learning on the training set, as shown in the table 3. Therefore, the method can be used as a data set labeling method, the center position label of the text is obtained only by inputting the text in the image, and the technology can realize automatic scene text labeling.

TABLE 3 Table 3

The scene text detection and recognition device based on the weak supervision cross-mode contrast learning provided by the invention is described below, and the scene text detection and recognition device based on the weak supervision cross-mode contrast learning described below and the scene text detection and recognition method based on the weak supervision cross-mode contrast learning described above can be correspondingly referred to each other. As shown in fig. 13, the scene text detection and recognition device based on weak supervision cross-modal contrast learning provided by the invention comprises:

the image coding module 1310 is configured to obtain an image to be identified, input the image to be identified to a first image encoder in a trained text recognition model, and obtain a first image feature map output by the first image encoder;

the feature processing module 1320 is configured to input the first image feature map to an anchor estimator, a sampling module and a recognition module in the text recognition model, respectively, obtain a probability feature map output by the anchor estimator, a character position feature map output by the sampling module and a character semantic feature map output by the recognition module, where a value of a pixel point in the probability feature map reflects a probability that the position of the pixel point is a text region, a feature vector corresponding to the pixel point in the character position feature map reflects a pixel distance between the pixel point and each character included in the text region where the pixel point is located, and a feature vector corresponding to the pixel point in the character semantic feature map reflects a probability that the position of the pixel point is each preset character, each text region includes a text word, and one text word is composed of at least one character;

A text detection recognition module 1330 configured to determine a character position in each of the text regions based on the probability feature map and the character position feature map, determine a character in each of the text regions based on the character position in the text region in the character semantic feature map to determine a text recognition result in the image to be recognized, where the text recognition result includes a position of each of the text regions in the image to be recognized and text content in the text region, and each of the text regions includes a text word;

Fig. 14 illustrates a physical structure diagram of an electronic device, as shown in fig. 14, which may include: processor 1410, communication interface (Communications Interface) 1420, memory 1430 and communication bus 1440, wherein processor 1410, communication interface 1420 and memory 1430 communicate with each other via communication bus 1440. The processor 1410 may invoke logic instructions in the memory 1430 to perform a scene text detection and recognition method based on weak supervised cross modality contrast learning, the method comprising: acquiring an image to be identified, inputting the image to be identified into a first image encoder in a trained text identification model, and acquiring a first image feature map output by the first image encoder;

In addition, the logic instructions in the memory 1430 described above may be implemented in the form of software functional units and may be stored in a computer readable storage medium when sold or used as a stand alone product. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

In still another aspect, the present invention further provides a non-transitory computer readable storage medium having stored thereon a computer program, which when executed by a processor, is implemented to perform the scene text detection and recognition method based on weak supervision cross-modal contrast learning provided by the above methods, the method comprising: acquiring an image to be identified, inputting the image to be identified into a first image encoder in a trained text identification model, and acquiring a first image feature map output by the first image encoder;

The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus necessary general hardware platforms, or of course may be implemented by means of hardware. Based on this understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the respective embodiments or some parts of the embodiments.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A scene text detection and recognition method based on weak supervision cross-modal contrast learning is characterized by comprising the following steps:

2. The method for scene text detection and recognition based on weak supervised cross modal contrast learning of claim 1, wherein generating the text region location labels based on the label generation model comprises:

3. The scene text detection and recognition method based on weak supervised cross modal contrast learning of claim 2, wherein the training process of the label generation model comprises:

4. The method for scene text detection and recognition based on weak supervised cross modal contrast learning as set forth in claim 3, wherein the updating parameters of the tag generation model based on the first training loss includes:

5. The method for scene text detection and recognition based on weak supervised cross modal contrast learning as recited in claim 1, wherein the determining character positions in each of the text regions based on the probability feature map and the character position feature map comprises:

6. The method for detecting and identifying scene text based on weak supervised cross modal contrast learning as set forth in claim 2, wherein the second image encoder includes a plurality of convolution layers, the inputting the sample to-be-identified image to the second image encoder in the tag generation model, obtaining a second image feature map output by the second image encoder, includes:

7. The scene text detection and recognition method based on weak supervision cross-modal contrast learning of claim 2, wherein the training process of the text recognition model comprises:

8. Scene text detection and recognition device based on weak supervision cross-modal contrast learning is characterized by comprising:

9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor implements the scene text detection and recognition method based on weak supervised cross modal contrast learning as set forth in any of claims 1 to 7 when the computer program is executed by the processor.

10. A non-transitory computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the scene text detection and recognition method based on weak supervised cross modal contrast learning of any of claims 1 to 7.