CN114067321A - Text detection model training method, device, equipment and storage medium - Google Patents

Text detection model training method, device, equipment and storage medium Download PDF

Info

Publication number
CN114067321A
CN114067321A CN202210040015.5A CN202210040015A CN114067321A CN 114067321 A CN114067321 A CN 114067321A CN 202210040015 A CN202210040015 A CN 202210040015A CN 114067321 A CN114067321 A CN 114067321A
Authority
CN
China
Prior art keywords
text
predicted
text object
target
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210040015.5A
Other languages
Chinese (zh)
Other versions
CN114067321B (en
Inventor
单鼎一
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202210040015.5A priority Critical patent/CN114067321B/en
Publication of CN114067321A publication Critical patent/CN114067321A/en
Application granted granted Critical
Publication of CN114067321B publication Critical patent/CN114067321B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Abstract

The embodiment of the application provides a text detection model training method, a text detection model training device, text detection model training equipment and a storage medium, which can be applied to various scenes such as the map field, vehicle-mounted scenes, artificial intelligence, auxiliary driving and the like, and the method comprises the following steps: a set of sample images is obtained, wherein each sample image contains at least one real text object. And performing combined iterative training on the text detection model to be trained and the global semantic segmentation model based on the sample image set, and outputting the trained target text detection model. In the training process, the two models supervise and learn each other, and the loss functions corresponding to the two models optimize the gradient together, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.

Description

Text detection model training method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of artificial intelligence, in particular to a text detection model training method, a text detection model training device, text detection model training equipment and a storage medium.
Background
With the development of artificial intelligence technology, scene text detection technology, which refers to extracting text content from an image, is developed.
When scene text recognition is carried out in the related technology, a foreground and a background in an image are separated to obtain a text region in the image, and then pixel-level feature clustering is carried out on the text region to obtain text content.
However, the above scheme highly depends on conditions such as clustering radius, and is sensitive to outliers, and it is difficult for the same clustering radius to simultaneously solve text scenes of different sizes, thereby resulting in low accuracy of text detection.
Disclosure of Invention
The embodiment of the application provides a text detection model training method, a text detection model training device, text detection equipment and a storage medium, which are used for improving the accuracy of text detection.
In one aspect, an embodiment of the present application provides a text detection model training method, including:
obtaining a set of sample images, wherein each sample image contains at least one real text object;
performing joint iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set, and outputting a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
performing target detection on a sample image through the text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, and performing image segmentation on the sample image through the global semantic segmentation model to obtain at least one second predicted text object;
and determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, the at least one first predicted text object and corresponding predicted attribute information, and the at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
In one aspect, an embodiment of the present application provides a text detection model training apparatus, where the apparatus includes:
a first obtaining module, configured to obtain a sample image set, where each sample image contains at least one real text object;
the model training module is used for performing combined iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set and outputting a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
performing target detection on a sample image through the text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, and performing image segmentation on the sample image through the global semantic segmentation model to obtain at least one second predicted text object;
and determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, the at least one first predicted text object and corresponding predicted attribute information, and the at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
Optionally, the model training module is specifically configured to:
carrying out feature extraction on the sample image to obtain a plurality of sample feature images with different sizes;
cutting out a plurality of corresponding initial text box images from the plurality of sample characteristic images, and adjusting the plurality of initial text box images to be the same in size to obtain a plurality of sample text box images;
performing instance segmentation on the plurality of sample text block images to obtain the at least one first predicted text object;
and performing attribute prediction on the plurality of sample text box images to obtain prediction attribute information corresponding to the at least one first prediction text object.
Optionally, the prediction attribute information includes prediction location information and a prediction category;
the model training module is specifically configured to:
performing border regression on the plurality of sample text block images to obtain prediction position information corresponding to the at least one first prediction text object;
and carrying out frame classification on the plurality of sample text frame images to obtain a prediction category corresponding to each of at least one first prediction text object.
Optionally, the model training module is specifically configured to:
for the plurality of sample feature images, respectively executing the following steps:
determining a text box size matching an image size of a sample feature image;
generating a plurality of initial text boxes corresponding to the text box size in the sample characteristic image;
cutting a plurality of initial text box images from the one sample feature image based on the plurality of initial text boxes.
Optionally, the model training module is specifically configured to:
performing feature extraction on the sample image to obtain a target sample feature image;
performing semantic segmentation on the target sample characteristic image to obtain a prediction global category corresponding to each pixel in the sample image;
generating a feature vector corresponding to each pixel in the sample image based on the target sample feature image;
and carrying out example segmentation on the sample image based on the predicted global category and the feature vector corresponding to each pixel to obtain at least one second predicted text object.
Optionally, the model training module is specifically configured to:
determining a target pixel with a predicted global category as a foreground category from the pixels;
clustering each target pixel based on the feature vector of each target pixel to obtain at least one target pixel set and a predicted text object label corresponding to each target pixel set;
and obtaining at least one second predicted text object based on the predicted text object labels respectively corresponding to the at least one target pixel set and the at least one target pixel set.
Optionally, the model training module is specifically configured to:
determining a first loss value based on at least one real text object and corresponding real attribute information in the sample image and the at least one first predicted text object and corresponding predicted attribute information;
determining a second loss value based on at least one real text object in the sample image and the at least one second predicted text object;
determining the target loss value based on the first loss value and the second loss value.
Optionally, the real attribute information includes real position information and real category, and the predicted attribute information includes predicted position information and predicted category;
the model training module is specifically configured to:
determining a first instance segmentation loss value based on the at least one real text object and the at least one first predicted text object;
determining a position loss value based on the real position information corresponding to the at least one real text object and the predicted position information corresponding to the at least one first predicted text object;
determining a category loss value based on a real category corresponding to each of the at least one real text object and a prediction category corresponding to each of the at least one first predicted text object;
determining the first penalty value based on the first instance segmentation penalty value, the location penalty value, and the class penalty value.
Optionally, the model training module is specifically configured to:
determining a semantic segmentation loss value based on a real global class of each pixel corresponding to the at least one real text object and a predicted global class of each pixel corresponding to the at least one second predicted text object in the sample image;
determining a second example segmentation loss value based on the real text object labels of the pixels corresponding to the at least one real text object and the predicted text object labels of the pixels corresponding to the at least one second predicted text object in the sample image;
determining the second penalty value based on the semantic segmentation penalty value and the second instance segmentation penalty value.
Optionally, a model prediction module is further included;
the model prediction module is specifically configured to:
performing joint iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set, and acquiring an image to be processed after outputting a trained target text detection model;
extracting the features of the image to be processed by adopting the target text detection model to obtain a plurality of target feature images with different sizes, and cutting out a plurality of corresponding candidate text box images from the plurality of target feature images;
adjusting the images of the candidate text boxes to be the same size by adopting the target text detection model to obtain a plurality of target text box images;
and carrying out example segmentation on the multiple target text block images by adopting the target text detection model to obtain the at least one target text object.
In one aspect, an embodiment of the present application provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the steps of the text detection model training method when executing the program.
In one aspect, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, where the program is executed by the computer device, and causes the computer device to execute the steps of the above-mentioned text detection model training method.
In one aspect, the present application provides a computer program product including a computer program stored on a computer-readable storage medium, the computer program including program instructions that, when executed by a computer device, cause the computer device to perform the steps of the above text detection model training method.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
Fig. 1 is a schematic diagram of a system architecture according to an embodiment of the present application;
fig. 2 is a schematic diagram of an image to be processed according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of an inspection result of an image to be processed according to an embodiment of the present disclosure;
FIG. 4 is a schematic diagram of a map application interface provided in an embodiment of the present application;
fig. 5 is a schematic flowchart of a text detection model training method according to an embodiment of the present disclosure;
FIG. 6 is a diagram illustrating a real text object according to an embodiment of the present application;
fig. 7 is a schematic network structure diagram of a text detection model according to an embodiment of the present application;
fig. 8 is a first flowchart illustrating a text detection method according to an embodiment of the present application;
fig. 9 is a schematic network structure diagram of a feature extraction module according to an embodiment of the present disclosure;
fig. 10 is a schematic network structure diagram of a global semantic segmentation model according to an embodiment of the present application;
fig. 11 is a flowchart illustrating a second text detection method according to an embodiment of the present application;
fig. 12 is a schematic network structure diagram of a text detection model and a global semantic segmentation model according to an embodiment of the present application;
fig. 13 is a schematic structural diagram of a text detection model training apparatus according to an embodiment of the present application;
fig. 14 is a schematic structural diagram of a computer device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more clearly apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
For convenience of understanding, terms referred to in the embodiments of the present invention are explained below.
Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.
The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.
Natural Language Processing (NLP) is an important direction in the fields of computer science and artificial intelligence. It studies various theories and methods that enable efficient communication between humans and computers using natural language. Natural language processing is a science integrating linguistics, computer science and mathematics. Therefore, the research in this field will involve natural language, i.e. the language that people use everyday, so it is closely related to the research of linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic question and answer, knowledge mapping, and the like. For example, in the embodiment of the present application, a natural language processing technology is adopted to perform target detection on an image to be processed, so as to obtain at least one target text object.
POI: the information of interest around a geographic location in a map may also be referred to as any meaningful point on the map with non-geographic significance, such as a store, a bar, a gas station, etc.
OCR: optical Character Recognition, refers to the process of an electronic device examining a printed Character on paper, determining its shape by detecting dark and light patterns, and then translating the shape into a computer text using Character Recognition methods.
The following is a description of the design concept of the embodiments of the present application.
When scene text recognition is carried out in the related technology, a foreground and a background in an image are separated to obtain a text region in the image, and then pixel-level feature clustering is carried out on the text region to obtain text content. However, the above scheme highly depends on conditions such as clustering radius, and is sensitive to outliers, and it is difficult for the same clustering radius to simultaneously solve text scenes of different sizes, thereby resulting in low accuracy of text detection.
In view of this, an embodiment of the present application provides a text detection model training method, including:
a set of sample images is obtained, wherein each sample image contains at least one real text object. Performing combined iterative training on the text detection model to be trained and the global semantic segmentation model based on the sample image set, and outputting a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
and carrying out image segmentation on the sample image through a global semantic segmentation model to obtain at least one second predicted text object. And then determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, at least one first predicted text object and corresponding predicted attribute information, and at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
After the target text detection model is obtained in the above manner, the image to be processed is input into the target text detection model for text detection, and at least one target text object in the image to be processed is obtained.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.
Reference is made to fig. 1, which is a block diagram of a system architecture to which embodiments of the present application are applicable. The architecture comprises at least a terminal device 101 and a server 102. The number of the terminal devices 101 may be one or more, and the number of the servers 102 may also be one or more, and the number of the terminal devices 101 and the number of the servers 102 are not particularly limited in the present application.
The terminal device 101 may have a target application installed therein, where the target application may be a client application, a web page version application, an applet application, or the like. In an actual application scenario, the target application may be any application with a text detection function. The terminal device 101 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, an intelligent voice interaction device, an intelligent household appliance, an intelligent sound box, an intelligent watch, an intelligent vehicle-mounted device, and the like, but is not limited thereto. The embodiment of the application can be applied to various scenes, including but not limited to the map field, vehicle-mounted scenes, cloud technology, artificial intelligence, intelligent traffic and driving assistance.
The server 102 may be a background server of the target application, and provides a corresponding service for the target application, and the server 102 may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a Content Delivery Network (CDN), and a big data and artificial intelligence platform. The terminal device 101 and the server 102 may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.
The text detection model training method in the embodiment of the present application may be executed by the terminal device 101, may also be executed by the server 102, and may also be executed by the terminal device 101 and the server 102 interactively.
In the following, the following detailed description is made by taking the text detection model training method as an example, which is executed by the server 102:
a model training stage:
the terminal device 101 sends a set of sample images, each containing at least one real text object, to the server 102. The server 102 performs joint iterative training on the text detection model to be trained and the global semantic segmentation model based on the sample image set, and outputs a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
and carrying out image segmentation on the sample image through a global semantic segmentation model to obtain at least one second predicted text object. And then determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, at least one first predicted text object and corresponding predicted attribute information, and at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
After the target text detection model is obtained, the target text detection model is saved in the server 102.
And a text detection stage:
the user submits the image to be processed on the terminal device 101, and the terminal device 101 sends the image to be processed to the server 102. The server 102 inputs the image to be processed into a target text detection model for text detection, and at least one target text object in the image to be processed is obtained. The server 102 sends at least one target text object to the terminal device 101. The terminal device 101 presents at least one target text object on a display interface. Specifically, during the display, the target text object may be displayed separately, or the target text object may be displayed at a corresponding position in the image to be processed.
In practical application, the text detection model training method in the embodiment of the application is suitable for any scene needing to detect the text object in the image, such as POI updating, bus stop board name recognition, traffic sign board recognition and the like in a map. The following are examples to update POIs in a map:
referring to fig. 2, in order to obtain a target text detection model for an image to be processed photographed near a target position a, after the target text detection model is obtained by using the text detection model training method in the embodiment of the present application, the image to be processed shown in fig. 2 is input into the target text detection model, a text object is detected for the image to be processed by using the target text detection model, and a detection result is output. The detection result is shown in fig. 3 and includes a target text box 201 and a target text box 202. Respectively carrying out text recognition on the target text box 201 and the target text box 202 to obtain that the text content in the target text box 201 is Chongqing special fish banker and the text content in the target text box 202 is Qianye flower shop. If the Chongqing characteristic fish banker and the spica do not exist in the POI of the target position A in the map, the Chongqing characteristic fish banker and the spica are added into the map as the POI of the target position.
When the user searches for the peripheral information of the target position a in the map application, as shown in fig. 4, the map application displays the peripheral information of the target position a such as "chongqing featured fish banker" and "qianyihua shop".
Based on the system architecture diagram shown in fig. 1, an embodiment of the present application provides a flow of a text detection model training method, as shown in fig. 5, the flow of the method may be executed by the terminal device 101 or the server 102 shown in fig. 1, or may be executed by the terminal device 101 and the server 102 interactively, and includes the following steps:
step S501, a sample image set is acquired.
In particular, each sample image contains at least one real text object, which is a text object pre-marked in the sample image, which may be a polygon containing text content.
For example, as shown in fig. 6, the sample image includes two pre-marked real text boxes, namely a real text box 601 and a real text box 602, where the text content included in the real text box 601 is "chongqing featured fish banker" and the text content included in the real text box 602 is "qian ye florist".
And step S502, performing combined iterative training on the text detection model to be trained and the global semantic segmentation model based on the sample image set, and outputting a trained target text detection model.
Specifically, in each iterative training process, the following operations are performed:
and carrying out image segmentation on the sample image through a global semantic segmentation model to obtain at least one second predicted text object. And then determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, at least one first predicted text object and corresponding predicted attribute information, and at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
In a specific implementation, the real attribute information includes real position information and real category, and the predicted attribute information includes predicted position information and predicted category. In each iterative training process, at least one sample image can be randomly selected from the sample image set in a non-replacement mode for training; or sequencing each sample image in the sample image set in advance, and then selecting at least one sample image from the sample image set in sequence according to the sequencing result to train.
In each iteration process, parameter adjustment is carried out on the text detection model to be trained and the global semantic segmentation model by adopting a target loss value, so that the similarity between the real text object and the corresponding real attribute information, the first predicted text object and the corresponding predicted attribute information is higher and higher, and the similarity between the real text object and the second predicted text object is higher and higher.
The end condition of the iterative training may be that the iterative training number reaches a preset number, or that the target loss value satisfies a preset convergence condition.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.
Optionally, referring to fig. 7, a schematic diagram of a network structure of a text detection model provided in the embodiment of the present application is shown, where the text detection model includes a feature extraction module, a candidate box prediction module, a feature clipping module, a first example segmentation branch, a bounding box regression branch, and a bounding box classification branch.
Based on the above network structure, the embodiment of the present application performs target detection on a sample image through a text detection model in at least the following manner, to obtain at least one first predicted text object and corresponding predicted attribute information, as shown in fig. 8, and includes the following steps:
step S801 is to perform feature extraction on the sample image to obtain a plurality of sample feature images of different sizes.
Specifically, the characteristic extraction module is used for extracting the characteristics of the sample image to obtain a plurality of sample characteristic images with different sizes, wherein the characteristic extraction module comprises a backbone network and a multi-scale characteristic fusion module, and the backbone network adopts depth residual errors such as resnet101 and the like to guarantee that a deeper network can be trained. The multi-scale Feature fusion module adopts a Feature Pyramid network (FPN for short) to realize better image Feature fusion.
The backbone network comprises a plurality of index blocks and a plurality of conv blocks, and the convolution blocks comprise a plurality of convolution layers, a plurality of normalization layers and a plurality of activation layers and are used for obtaining image characteristics. Specifically, the lower-layer convolutional layer is responsible for extracting basic image features such as image edge textures and the like, and the upper-layer convolutional layer is responsible for combining the texture features of the bottom layer with abstraction. And the normalization layer performs normalized normal distribution processing on the image characteristics. The activation layer carries out nonlinear mapping on the extracted image features, and the generalization capability of the model is enhanced.
The identification block comprises a shortcut (direct connection) structure and an identity mapping structure, so that extra parameters cannot be generated, the calculation complexity cannot be increased, the effective return of the gradient is ensured, and the gradient does not disappear in the training of the deep network.
The multi-scale feature fusion module comprises a plurality of upsampling blocks, and each upsampling block inputs not only the image features from the previous upsampling output but also the image features of the same size extracted by the trunk network. In order to better fuse the feature information, the two image features are added in the up-sampling block, and the convolution operation is carried out to realize information fusion, so as to obtain the feature image.
For example, referring to fig. 9, a network structure schematic diagram of a feature extraction module provided in the embodiment of the present application includes a backbone network and a multi-scale feature fusion module, where the backbone network includes 4 residual modules, which are a residual module 1, a residual module 2, a residual module 3, and a residual module 4, respectively, and each residual module includes an identifier block and a convolution block. The multi-scale feature fusion module comprises 3 upsampling blocks, namely an upsampling block 1, an upsampling block 2 and an upsampling block 3.
The sample image is input into a backbone network, the residual module 1 performs downsampling processing on the sample image to obtain a characteristic image C2, and the characteristic image C2 is input into the residual module 2. The residual module 2 performs downsampling processing on the feature image C2 to obtain a feature image C3, and inputs the feature image C3 to the residual module 3. The residual module 3 performs downsampling processing on the feature image C3 to obtain a feature image C4, and inputs the feature image C4 to the residual module 4. The residual module 4 performs downsampling processing on the feature image C4 to obtain a feature image C5. After performing convolution processing on the feature image C5 by 1x1, a feature image M5 is obtained, and the feature image M5 is input to the upsampling module 1.
The up-sampling module 1 performs up-sampling processing on the feature image M5 to obtain a feature image M4, and then inputs the feature image M4 to the up-sampling module 2. The up-sampling module 2 performs up-sampling processing on the feature image M4 to obtain a feature image M3, and then inputs the feature image M3 to the up-sampling module 3. The up-sampling module 3 performs up-sampling processing on the feature image M3 to obtain a feature image M2. The feature image C4 is the same size as the feature image M4, the feature image C3 is the same size as the feature image M3, and the feature image C2 is the same size as the feature image M2.
The feature image M5 is subjected to 3x3 convolution processing, and a sample feature image P1 is obtained. After performing 1x1 convolution processing on the feature image C4, fusing the feature image C4 with the feature image M4, and performing 3x3 convolution processing on the image obtained through fusion to obtain a sample feature image P2. After performing 1x1 convolution processing on the feature image C3, fusing the feature image C3 with the feature image M3, and performing 3x3 convolution processing on the image obtained through fusion to obtain a sample feature image P3. And after carrying out 1x1 convolution processing on the feature image C2, fusing the feature image C2 with the feature image M2, and then carrying out 3x3 convolution processing on the image obtained by fusion to obtain a sample feature image P4, wherein the sample feature image P1, the sample feature image P2, the sample feature image P3 and the sample feature image P4 are sample feature images with different sizes.
Step S802, cutting out a plurality of corresponding initial text box images from the plurality of sample feature images.
Specifically, a sample text box image is determined from a sample feature image through a candidate box prediction module, each pixel in the sample feature image can generate a plurality of sample text boxes with different sizes, and the size of each sample text box is determined by two parameters, namely scale and ratio. The candidate frame prediction module may be a Region recommendation Network (RPN).
Because the feature extraction module extracts a plurality of sample feature images with different sizes, initial text boxes with different sizes can be cut in the sample feature images with different sizes. Specifically, for a plurality of sample feature images, the following steps are respectively performed:
a text box size matching an image size of a sample feature image is determined, and then a plurality of initial text boxes corresponding to the text box size are generated in the sample feature image. And then cutting a plurality of initial text box images from the sample characteristic image based on a plurality of initial text boxes.
In specific implementation, a large-size sample characteristic image is matched with a large-size initial text box; a sample feature image of small size, matching the initial text box of small size. Therefore, the initial text box corresponding to the small text object can be cut out of the small-size sample characteristic image, and the initial text box corresponding to the large text object can be cut out of the large-size sample characteristic image, so that the accuracy of detecting the text objects with different sizes is improved, and the subsequent filtering of the sample text boxes is reduced.
In step S803, the plurality of initial text box images are adjusted to the same size, and a plurality of sample text box images are obtained.
Specifically, the following steps are respectively executed for a plurality of initial text box images through a feature clipping module:
and dividing an initial text frame image into a plurality of candidate areas with the same size according to a preset dividing proportion. Then, bilinear interpolation processing is carried out on a plurality of sampling points contained in each candidate region, 4 sampling pixel values corresponding to each candidate region are obtained, and the maximum sampling pixel value in the 4 sampling pixel values is used as the target pixel value of each candidate region. And obtaining the sample text box image after the size is adjusted based on the target pixel values corresponding to the candidate areas. The pixel values of the coordinates of the four fixed points are obtained through bilinear interpolation, so that discontinuous operation becomes continuous, the error is smaller when the original image is returned, and meanwhile, the consistency of characteristic dimensions is ensured.
Step S804, performing instance segmentation on the multiple sample text block images to obtain at least one first predicted text object.
Specifically, an example in the embodiment of the present application represents one text object. Each sample text box image corresponds to a text object. And (3) dividing branches through the first example, and processing the sample text box image after the size is adjusted by adopting a full wrapping machine Network (FCN for short) to obtain a first predicted text object. Because one first predicted text object may correspond to a plurality of sample text box images, the sample text box images are subjected to example segmentation, after a plurality of example segmentation results are obtained, a non-maximum suppression algorithm is adopted to perform de-duplication on repeated example segmentation results, and the retained example segmentation results are output as the first predicted text object.
Step S805, performing attribute prediction on the plurality of sample text box images to obtain prediction attribute information corresponding to each of the at least one first prediction text object.
Specifically, the prediction attribute information includes prediction position information and a prediction category. And performing frame regression on the plurality of sample text frame images through frame regression branches to obtain predicted position information corresponding to each of the at least one first predicted text object, wherein the predicted position information can be position coordinates of the predicted text object in the sample images. When the predicted text object is a quadrangular text box, the predicted position information may be position coordinates of four corners of the text box.
And performing frame classification on the multiple sample text frame images through a frame classification branch to obtain prediction categories corresponding to at least one first prediction text object, wherein the prediction categories can include two categories of texts and non-texts, can also include three categories of clear texts, fuzzy texts and non-texts, and can also be set to other categories, so that the application is not particularly limited.
In the embodiment of the application, the sample images are subjected to feature extraction to obtain a plurality of sample feature images with different sizes, and a plurality of corresponding sample text box images are obtained from the plurality of sample feature images to perform text object detection, so that the accuracy of detecting text objects with different sizes is improved. The multi-dimensional text detection is respectively carried out on the sample text box image through the example segmentation branch, the frame regression branch and the frame classification branch, so that the model can be subjected to parameter adjustment by adopting various loss values in the training process, and the performance of the target text detection model obtained through training is improved.
Optionally, referring to fig. 10, a network structure diagram of a global semantic segmentation model provided in the embodiment of the present application is shown, where the global semantic segmentation model includes a feature extraction module, a semantic segmentation branch, a feature learning module, and a second example segmentation branch.
Based on the above network structure, the embodiment of the present application performs image segmentation on a sample image through a global semantic segmentation model to obtain at least one second predicted text object, as shown in fig. 11, and includes the following steps:
step S1101, performing feature extraction on the sample image, and obtaining a target sample feature image.
Specifically, feature extraction is performed on the sample image through a feature extraction module, and a target sample feature image is obtained. The global semantic segmentation model and the text detection model may correspond to the same feature extraction module, and the process of extracting features by the feature extraction module is described in the foregoing, and is not described herein again.
The target sample feature image may be: and performing feature extraction on the sample image to obtain one sample feature image in a plurality of sample feature images with different sizes. For example, the target sample feature image is the sample feature image P4 in fig. 9.
Step S1102, performing semantic segmentation on the target sample feature image to obtain a predicted global category corresponding to each pixel in the sample image.
Specifically, the predicted global category includes a foreground category and a background category. And performing semantic segmentation on the target sample characteristic image through a semantic segmentation branch to obtain a prediction global category corresponding to each pixel in the sample image, wherein 1 is used for representing a foreground category, and 0 is used for representing a background category. Of course, other numerical values may also be used to represent the foreground category and the background category in the embodiment of the present application, and are not specifically limited herein.
Step S1103 generates a feature vector corresponding to each pixel in the sample image based on the target sample feature image.
Specifically, an 8-dimensional feature vector is generated for each pixel by the feature learning module.
Step S1104, performing instance segmentation on the sample image based on the predicted global category and the feature vector corresponding to each pixel, to obtain at least one second predicted text object.
Specifically, target pixels of which the predicted global category is the foreground category are determined from the pixels through the second example segmentation branch, and then the target pixels are clustered based on the feature vectors of the target pixels to obtain at least one target pixel set and predicted text object labels corresponding to the at least one target pixel set. And obtaining at least one second predicted text object based on the predicted text object labels respectively corresponding to the at least one target pixel set and the at least one target pixel set.
In specific implementation, after target pixels of which the predicted global category is the foreground category are determined from the pixels, the target pixels are merged with feature vectors corresponding to the pixels to obtain feature vectors of the target pixels of all the foreground categories, namely, pixel features of all the text objects. Each text object corresponds to a text object tag, for example, text objects that are not otherwise represented by different numbers.
And clustering the target pixels by adopting a density clustering algorithm based on the characteristic vectors of the target pixels, so that the target pixels of the same text object are similar as much as possible, and the target pixels of different text objects are distinguished as much as possible. And finally, determining a second predicted text object according to the target pixel set obtained by clustering and the corresponding predicted text object label.
In the embodiment of the application, in the training process, at least one second predicted text object is detected and obtained from the sample image in a mode of combining semantic segmentation and example segmentation, so that the learning of gradient can be jointly optimized by combining two loss functions of semantic segmentation and example segmentation when model parameters are adjusted, and the effect obtained by joint training is improved.
Optionally, during each iteration, a first loss value is determined based on at least one real text object and corresponding real attribute information, and at least one first predicted text object and corresponding predicted attribute information in the sample image. A second loss value is then determined based on the at least one real text object and the at least one second predicted text object in the sample image. And then determining a target loss value based on the first loss value and the second loss value.
Specifically, after target detection is performed on a sample image through a text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, a first loss value is determined based on the at least one real text object and corresponding real attribute information in the sample image and the at least one first predicted text object and corresponding predicted attribute information.
And after the sample image is subjected to image segmentation through the global semantic segmentation model to obtain at least one second predicted text object, determining a second loss value based on at least one real text object in the sample image and the at least one second predicted text object.
And then combining a first loss value corresponding to the text detection model and a second loss value corresponding to the global semantic segmentation model to obtain a target loss value, and performing parameter adjustment on the text detection model and the global semantic segmentation model by adopting the target loss value.
In the embodiment of the application, in the process of performing combined iterative training on the text detection model and the global semantic segmentation model to be trained, parameter adjustment is performed on the text detection model and the global semantic segmentation model by combining respective corresponding loss values of the text detection model and the global semantic segmentation model, so that mutual supervision and learning of the text detection model and the global semantic segmentation model is realized, and thus the robustness and the accuracy of text detection are improved.
Optionally, the text detection model includes three branches, namely a first example segmentation branch, a frame regression branch, and a frame classification branch, and results output by the three branches may all guide model training, so that a first loss value corresponding to the text detection model may be determined based on loss values corresponding to the three branches, specifically as follows:
a first instance segmentation loss value is determined based on the at least one real text object and the at least one first predicted text object. A position loss value is determined based on the respective real position information of the at least one real text object and the respective predicted position information of the at least one first predicted text object. A category loss value is determined based on a real category to which each of the at least one real text object corresponds and a prediction category to which each of the at least one first predicted text object corresponds. A first penalty value is determined based on the first instance segmentation penalty value, the location penalty value, and the category penalty value.
In a specific implementation, each real text object corresponds to a predicted text object. A first instance segmentation loss value is determined based on at least one real text object and a corresponding predicted text object using a loss function. The smaller the segmentation loss value of the first example is, the closer the predicted text object obtained by the prediction of the text detection model is to the real text object is.
And determining a position loss value by adopting a loss function based on the real position information corresponding to at least one real text object and the corresponding predicted position information corresponding to the corresponding predicted text object. The smaller the position loss value is, the closer the predicted position information obtained by the text detection model is to the real position information is.
And determining a category loss value by adopting a loss function based on the real category corresponding to each of the at least one real text object and the prediction category corresponding to the corresponding predicted text object. The smaller the category loss value is, the closer the predicted category obtained by the prediction of the text detection model is to the real category. The penalty functions used to calculate the first instance segmentation penalty value, the location penalty value, and the class penalty value may or may not be the same.
It should be noted that, in the embodiment of the present application, the manner of determining the first loss value is not limited to the above-described one, and the first loss value may be determined based on the first example division loss value as the first loss value, or the first loss value may be determined based on the first example division loss value and the position loss value, or the first loss value may be determined based on the first example division loss value and the category loss value, or other embodiments may also be used, and the present application is not limited specifically.
In the embodiment of the application, the model is subjected to parameter adjustment through combination of the example segmentation loss value of the example segmentation branch, the position loss value of the frame regression branch and the category loss value of the frame classification branch, gradient learning in the training process is optimized, and effective guidance is provided for model training, so that the performance of a target text detection model obtained through training is improved, and the accuracy of text detection is improved.
Optionally, the global semantic segmentation model includes two branches, namely a second semantic segmentation branch and an instance segmentation branch, and the results output by the two branches can both guide model training, so that a second loss value corresponding to the global semantic segmentation model can be determined based on the loss values corresponding to the two branches, specifically as follows:
and determining a semantic segmentation loss value based on the real global class of each pixel corresponding to at least one real text object and the predicted global class of each pixel corresponding to at least one second predicted text object in the sample image. And determining a second example segmentation loss value based on the real text object label of each pixel corresponding to at least one real text object and the predicted text object label of each pixel corresponding to at least one second predicted text object in the sample image. A second penalty value is then determined based on the semantic segmentation penalty value and the second instance segmentation penalty value.
Specifically, before model training, a real global category and a real text object label corresponding to each pixel in a sample image are marked, wherein the real global category comprises a real foreground category and a real background category, the real foreground category is represented by 1, and the real background category is represented by 0. Different real text object tags are represented by different numbers.
And determining a semantic segmentation loss value by adopting a cross entropy loss function based on the real global class of each pixel corresponding to at least one real text object in the sample image and the predicted global class of each pixel corresponding to the corresponding second predicted text object.
And determining a second example segmentation loss value by adopting a loss function based on the real text object label of each pixel corresponding to at least one real text object in the sample image and the predicted text object label of each pixel corresponding to a corresponding second predicted text object, wherein the second example segmentation loss value comprises intra-class aggregation degree loss and inter-class distinction degree loss, the intra-class aggregation degree loss ensures that the pixels corresponding to the same text object are similar as much as possible, and the inter-class distinction degree loss ensures that the pixels between different text objects have larger difference.
Determining a second loss value based on the semantic segmentation loss value and the second instance segmentation loss value; the second loss value may also be determined based on the semantic segmentation loss value, the second instance segmentation loss value, and a regularization loss value of the model complexity.
In the embodiment of the application, the model is subjected to parameter adjustment through the semantic segmentation loss value of the semantic segmentation branch and the example segmentation loss value of the example segmentation branch, so that the learning of the gradient in the training process is optimized, and effective guidance is provided for the model training. Meanwhile, the global semantic segmentation model is introduced to guide the training of the text detection model, so that compared with the method for training the text detection model independently, the training effect of the text detection model is improved, and the robustness and the accuracy of text detection are further improved.
In order to better explain the embodiment of the present application, a text detection model training method provided by the embodiment of the present application is described below with reference to a specific implementation scenario, where a flow of the method may be executed by the terminal device 101 or the server 102 shown in fig. 1, or may be executed by the terminal device 101 and the server 102 interactively.
Referring to fig. 12, a network structure diagram of a text detection model and a global semantic segmentation model provided in the embodiment of the present application includes a feature extraction module, a target detection branch and a global segmentation branch, where the feature extraction module includes a backbone network and a multi-scale feature fusion module, and the target detection branch includes a candidate box prediction module, a feature clipping module, a first instance segmentation branch, a frame regression branch and a frame classification branch. The global segmentation branch comprises a semantic segmentation branch, a feature learning module and a second example segmentation branch.
In the training process, a set of sample images is obtained, wherein each sample image contains at least one real text object labeled in advance and real position information and a real category of each real text object. In addition, real global categories and real text object labels of pixels corresponding to each real text object are marked in advance, wherein the real global categories comprise real foreground categories and real background categories, each real text object label identifies one text object, and different real text object labels are represented by different numbers.
Based on the image set, performing joint iterative training on a text detection model to be trained and a global semantic segmentation model, and outputting a trained target text detection model, wherein the following operations are executed in each iterative training process:
a feature extraction module: and performing feature extraction on the sample image to obtain a plurality of sample feature images with different sizes.
And (3) target detection branch: and cutting out a plurality of corresponding initial text box images from the plurality of sample characteristic images through a candidate box prediction module, wherein the candidate box prediction module recommends a network for the area. And adjusting the plurality of initial text box images to be of the same size by a characteristic cutting module in an ROI-align mode to obtain a plurality of sample text box images. And carrying out example segmentation on the images of the plurality of sample text frames by adopting a full-volume machine network through a first example segmentation branch to obtain at least one first predicted text object. A first instance segmentation loss value is determined based on at least one real text object and a corresponding predicted text object using a loss function.
And performing border regression on the plurality of sample text frame images through border regression branches to obtain the predicted position information corresponding to each of the at least one first predicted text object. And determining a position loss value by adopting a loss function based on the real position information corresponding to at least one real text object and the corresponding predicted position information corresponding to the corresponding predicted text object.
And carrying out frame classification on the multiple sample text frame images through a frame classification branch to obtain a prediction category corresponding to each of at least one first prediction text object. And determining a category loss value by adopting a loss function based on the real category corresponding to each of the at least one real text object and the prediction category corresponding to the corresponding predicted text object.
Global partitioning branch: target sample feature images are obtained from a plurality of sample feature images of different sizes. And performing semantic segmentation on the target sample characteristic image through a semantic segmentation branch to obtain a prediction global category corresponding to each pixel in the sample image. And determining a semantic segmentation loss value by adopting a cross entropy loss function based on the real global class of each pixel corresponding to at least one real text object in the sample image and the predicted global class of each pixel corresponding to the corresponding second predicted text object.
And generating a feature vector corresponding to each pixel in the sample image based on the target sample feature image through a feature learning module. And merging the semantic segmentation branch and the processing result of the feature learning module and inputting the merged result into a second example segmentation branch. And determining target pixels of which the predicted global category is the foreground category from all the pixels through second example segmentation branches, clustering all the target pixels based on the feature vectors of all the target pixels, and obtaining at least one target pixel set and predicted text object labels corresponding to at least one target pixel set. And obtaining at least one second predicted text object based on the predicted text object labels respectively corresponding to the at least one target pixel set and the at least one target pixel set. And determining a second example segmentation loss value by adopting a loss function based on the real text object label of each pixel corresponding to at least one real text object in the sample image and the predicted text object label of each pixel corresponding to the corresponding second predicted text object.
And determining a target loss value based on the first instance segmentation loss value, the position loss value, the category loss value, the semantic segmentation loss value and the second instance segmentation loss value, and then performing parameter adjustment on the text detection model and the global semantic segmentation model by adopting the target loss value.
And when the iterative training times reach the preset times, stopping the iterative training and outputting a target text detection model obtained by the training, wherein the target text detection model comprises a feature extraction module and a target detection branch and does not comprise a global segmentation branch.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.
Based on the system architecture diagram shown in fig. 1, an embodiment of the present application provides a flow of a text detection method, where the flow of the method may be executed by the terminal device 101 or the server 102 shown in fig. 1, or may be executed by the terminal device 101 and the server 102 interactively, and includes the following steps:
and acquiring an image to be processed, inputting the image to be processed into a target text detection model for text detection, and acquiring at least one target text object in the image to be processed.
In particular, the target text object may be a polygon that internally contains text content. After the image to be processed is input into the target text detection model, feature extraction is carried out on the image to be processed by adopting the target text detection model, a plurality of target feature images with different sizes are obtained, and a plurality of corresponding candidate text box images are cut out from the plurality of target feature images. And adjusting the images of the candidate text boxes to be the same size by adopting a target text detection model to obtain a plurality of target text box images. And then, carrying out example segmentation on the images of the plurality of target text blocks by adopting a target text detection model to obtain at least one target text object. The process of the target text detection model for performing text detection on the image to be processed is the same as the process of performing text detection on the sample image by the text detection model to be trained, which is described in the foregoing, and is not repeated here.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and robustness of the target text detection model obtained by training are improved, and the text detection accuracy is further improved. Secondly, training the obtained target text detection model, and performing target detection on the image to be processed to obtain a target text object without performing pixel-level feature clustering, so that the problem caused by the clustering radius is avoided, and the accuracy and efficiency of text detection are improved.
In one possible embodiment, the image to be processed is a peripheral image of the target location. Inputting the peripheral image into a target text detection model for text detection, and after at least one target text object in the peripheral image is obtained, performing text recognition on each target text object in the peripheral image to obtain text content in the target text object. And then updating the peripheral information of the target position in the map application based on the obtained text contents.
Specifically, OCR is adopted to perform text recognition on the target text object, and text content in the target text object is obtained, where the text content may be names of a store, a gas station, and a bank. And then judging whether the peripheral information of the target position in the map application contains the text content or not, and if not, adding the text content as the peripheral information of the target position to the map application.
When the POI in the map is updated, due to the problems of inconsistent text distribution condition and scale, rich background, poor brightness, poor contrast, shielding, illumination, perspective deformation, incomplete shielding and the like of the collected peripheral images, the accuracy of text detection is poor, and the accuracy of subsequent POI updating is influenced. In the embodiment of the application, the trained target text detection model is output by performing combined iterative training on the text detection model to be trained and the global semantic segmentation model, and then the text detection is performed on the peripheral image by adopting the target text detection model, so that the accuracy of text detection on the peripheral image is effectively improved, the accuracy and the efficiency of POI updating are further improved, meanwhile, the freshness of map data is kept, and better experience is provided for a user.
Based on the same technical concept, an embodiment of the present application provides a schematic structural diagram of a text detection model training apparatus, as shown in fig. 13, the apparatus 1300 includes:
a first obtaining module 1301, configured to obtain a sample image set, where each sample image includes at least one real text object;
a model training module 1302, configured to perform joint iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set, and output a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
performing target detection on a sample image through the text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, and performing image segmentation on the sample image through the global semantic segmentation model to obtain at least one second predicted text object;
and determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, the at least one first predicted text object and corresponding predicted attribute information, and the at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
Optionally, the model training module 1302 is specifically configured to:
carrying out feature extraction on the sample image to obtain a plurality of sample feature images with different sizes;
cutting out a plurality of corresponding initial text box images from the plurality of sample characteristic images, and adjusting the plurality of initial text box images to be the same in size to obtain a plurality of sample text box images;
performing instance segmentation on the plurality of sample text block images to obtain the at least one first predicted text object;
and performing attribute prediction on the plurality of sample text box images to obtain prediction attribute information corresponding to the at least one first prediction text object.
Optionally, the prediction attribute information includes prediction location information and a prediction category;
the model training module 1302 is specifically configured to:
performing border regression on the plurality of sample text block images to obtain prediction position information corresponding to the at least one first prediction text object;
and carrying out frame classification on the plurality of sample text frame images to obtain a prediction category corresponding to each of at least one first prediction text object.
Optionally, the model training module 1302 is specifically configured to:
for the plurality of sample feature images, respectively executing the following steps:
determining a text box size matching an image size of a sample feature image;
generating a plurality of initial text boxes corresponding to the text box size in the sample characteristic image;
cutting a plurality of initial text box images from the one sample feature image based on the plurality of initial text boxes.
Optionally, the model training module 1302 is specifically configured to:
performing feature extraction on the sample image to obtain a target sample feature image;
performing semantic segmentation on the target sample characteristic image to obtain a prediction global category corresponding to each pixel in the sample image;
generating a feature vector corresponding to each pixel in the sample image based on the target sample feature image;
and carrying out example segmentation on the sample image based on the predicted global category and the feature vector corresponding to each pixel to obtain at least one second predicted text object.
Optionally, the model training module 1302 is specifically configured to:
determining a target pixel with a predicted global category as a foreground category from the pixels;
clustering each target pixel based on the feature vector of each target pixel to obtain at least one target pixel set and a predicted text object label corresponding to each target pixel set;
and obtaining at least one second predicted text object based on the predicted text object labels respectively corresponding to the at least one target pixel set and the at least one target pixel set.
Optionally, the model training module 1302 is specifically configured to:
determining a first loss value based on at least one real text object and corresponding real attribute information in the sample image and the at least one first predicted text object and corresponding predicted attribute information;
determining a second loss value based on at least one real text object in the sample image and the at least one second predicted text object;
determining the target loss value based on the first loss value and the second loss value.
Optionally, the real attribute information includes real position information and real category, and the predicted attribute information includes predicted position information and predicted category;
the model training module 1302 is specifically configured to:
determining a first instance segmentation loss value based on the at least one real text object and the at least one first predicted text object;
determining a position loss value based on the real position information corresponding to the at least one real text object and the predicted position information corresponding to the at least one first predicted text object;
determining a category loss value based on a real category corresponding to each of the at least one real text object and a prediction category corresponding to each of the at least one first predicted text object;
determining the first penalty value based on the first instance segmentation penalty value, the location penalty value, and the class penalty value.
Optionally, the model training module 1302 is specifically configured to:
determining a semantic segmentation loss value based on a real global class of each pixel corresponding to the at least one real text object and a predicted global class of each pixel corresponding to the at least one second predicted text object in the sample image;
determining a second example segmentation loss value based on the real text object labels of the pixels corresponding to the at least one real text object and the predicted text object labels of the pixels corresponding to the at least one second predicted text object in the sample image;
determining the second penalty value based on the semantic segmentation penalty value and the second instance segmentation penalty value.
Optionally, a model prediction module 1303 is further included;
the model prediction module 1303 is specifically configured to:
performing joint iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set, and acquiring an image to be processed after outputting a trained target text detection model;
extracting the features of the image to be processed by adopting the target text detection model to obtain a plurality of target feature images with different sizes, and cutting out a plurality of corresponding candidate text box images from the plurality of target feature images;
adjusting the images of the candidate text boxes to be the same size by adopting the target text detection model to obtain a plurality of target text box images;
and carrying out example segmentation on the multiple target text block images by adopting the target text detection model to obtain the at least one target text object.
In the embodiment of the application, the text detection model to be trained and the global semantic segmentation model are subjected to combined iterative training, the trained target text detection model is output, the two models are supervised and learned mutually in the training process, and the loss functions corresponding to the two models jointly optimize the gradient, so that the accuracy and the robustness of the target text detection model obtained by training are improved. Secondly, training the obtained target text detection model, and obtaining a text object by carrying out target detection on the image without carrying out pixel-level feature clustering, so that the problem caused by clustering radius is avoided, and the accuracy and the efficiency of text detection are improved.
Based on the same technical concept, the embodiment of the present application provides a computer device, which may be a terminal device and/or a server shown in fig. 1, as shown in fig. 14, including at least one processor 1401 and a memory 1402 connected to the at least one processor, where a specific connection medium between the processor 1401 and the memory 1402 is not limited in the embodiment of the present application, and the processor 1401 and the memory 1402 are connected through a bus in fig. 14 as an example. The bus may be divided into an address bus, a data bus, a control bus, etc.
In the embodiment of the present application, the memory 1402 stores instructions executable by the at least one processor 1401, and the at least one processor 1401 can execute the steps of the text detection model training method by executing the instructions stored in the memory 1402.
The processor 1401 is a control center of the computer device, and may connect various parts of the computer device by using various interfaces and lines, and perform text detection model training and text detection by executing or executing instructions stored in the memory 1402 and calling up data stored in the memory 1402. Alternatively, the processor 1401 may include one or more processing units, and the processor 1401 may integrate an application processor, which mainly handles an operating system, a user interface, application programs, and the like, and a modem processor, which mainly handles wireless communication. It will be appreciated that the modem processor described above may not be integrated into processor 1401. In some embodiments, processor 1401 and memory 1402 may be implemented on the same chip, or in some embodiments, they may be implemented separately on separate chips.
The processor 1401 may be a general-purpose processor such as a Central Processing Unit (CPU), a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, and may implement or perform the methods, steps, and logic blocks disclosed in the embodiments of the present Application. A general purpose processor may be a microprocessor or any conventional processor or the like. The steps of a method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware processor, or may be implemented by a combination of hardware and software modules in a processor.
Memory 1402, which is a non-volatile computer-readable storage medium, may be used to store non-volatile software programs, non-volatile computer-executable programs, and modules. The Memory 1402 may include at least one type of storage medium, and may include, for example, a flash Memory, a hard disk, a multimedia card, a card-type Memory, a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Programmable Read Only Memory (PROM), a Read Only Memory (ROM), a charge Erasable Programmable Read Only Memory (EEPROM), a magnetic Memory, a magnetic disk, an optical disk, and so on. Memory 1402 is any other medium that can be used to carry or store desired program code in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 1402 in the embodiments of the present application may also be a circuit or any other device capable of performing a storage function for storing program instructions and/or data.
Based on the same inventive concept, embodiments of the present application provide a computer-readable storage medium storing a computer program executable by a computer device, which, when the program is run on the computer device, causes the computer device to perform the steps of the above-mentioned text detection model training method.
Based on the same inventive concept, embodiments of the present application provide a computer program product comprising a computer program stored on a computer-readable storage medium, the computer program comprising program instructions that, when executed by a computer device, cause the computer device to perform the steps of the above-mentioned text detection model training method.
It should be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (14)

1. A text detection model training method is characterized by comprising the following steps:
obtaining a set of sample images, wherein each sample image contains at least one real text object;
performing joint iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set, and outputting a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
performing target detection on a sample image through the text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, and performing image segmentation on the sample image through the global semantic segmentation model to obtain at least one second predicted text object;
and determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, the at least one first predicted text object and corresponding predicted attribute information, and the at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
2. The method of claim 1, wherein the performing target detection on the sample image by the text detection model to obtain at least one first predicted text object and corresponding prediction attribute information comprises:
carrying out feature extraction on the sample image to obtain a plurality of sample feature images with different sizes;
cutting out a plurality of corresponding initial text box images from the plurality of sample characteristic images, and adjusting the plurality of initial text box images to be the same in size to obtain a plurality of sample text box images;
performing instance segmentation on the plurality of sample text block images to obtain the at least one first predicted text object;
and performing attribute prediction on the plurality of sample text box images to obtain prediction attribute information corresponding to the at least one first prediction text object.
3. The method of claim 2, wherein the prediction attribute information includes prediction location information and a prediction category;
the predicting attribute information of the sample text box images to obtain the prediction attribute information corresponding to each of the at least one first prediction text object includes:
performing border regression on the plurality of sample text block images to obtain prediction position information corresponding to the at least one first prediction text object;
and carrying out frame classification on the plurality of sample text frame images to obtain a prediction category corresponding to each of at least one first prediction text object.
4. The method of claim 2, wherein said cutting out a corresponding plurality of initial text box images from a plurality of sample feature images comprises:
for the plurality of sample feature images, respectively executing the following steps:
determining a text box size matching an image size of a sample feature image;
generating a plurality of initial text boxes corresponding to the text box size in the sample characteristic image;
cutting a plurality of initial text box images from the one sample feature image based on the plurality of initial text boxes.
5. The method of claim 1, wherein the image segmenting the sample image by the global semantic segmentation model to obtain at least one second predicted text object comprises:
performing feature extraction on the sample image to obtain a target sample feature image;
performing semantic segmentation on the target sample characteristic image to obtain a prediction global category corresponding to each pixel in the sample image;
generating a feature vector corresponding to each pixel in the sample image based on the target sample feature image;
and carrying out example segmentation on the sample image based on the predicted global category and the feature vector corresponding to each pixel to obtain at least one second predicted text object.
6. The method of claim 5, wherein the instance segmenting the sample image based on the global class and the feature vector corresponding to each pixel to obtain at least one second predicted text object comprises:
determining a target pixel with a predicted global category as a foreground category from the pixels;
clustering each target pixel based on the feature vector of each target pixel to obtain at least one target pixel set and a predicted text object label corresponding to each target pixel set;
and obtaining at least one second predicted text object based on the predicted text object labels respectively corresponding to the at least one target pixel set and the at least one target pixel set.
7. The method of claim 1, wherein determining a target loss value based on at least one real text object and corresponding real attribute information, the at least one first predicted text object and corresponding predicted attribute information, the at least one second predicted text object in the sample image comprises:
determining a first loss value based on at least one real text object and corresponding real attribute information in the sample image and the at least one first predicted text object and corresponding predicted attribute information;
determining a second loss value based on at least one real text object in the sample image and the at least one second predicted text object;
determining the target loss value based on the first loss value and the second loss value.
8. The method of claim 7, wherein the real attribute information includes real location information and real category, and the predicted attribute information includes predicted location information and predicted category;
determining a first loss value based on at least one real text object and corresponding real attribute information in the sample image and the at least one first predicted text object and corresponding predicted attribute information, comprising:
determining a first instance segmentation loss value based on the at least one real text object and the at least one first predicted text object;
determining a position loss value based on the real position information corresponding to the at least one real text object and the predicted position information corresponding to the at least one first predicted text object;
determining a category loss value based on a real category corresponding to each of the at least one real text object and a prediction category corresponding to each of the at least one first predicted text object;
determining the first penalty value based on the first instance segmentation penalty value, the location penalty value, and the class penalty value.
9. The method of claim 7, wherein determining a second loss value based on at least one real text object in the sample image and the at least one second predicted text object comprises:
determining a semantic segmentation loss value based on a real global class of each pixel corresponding to the at least one real text object and a predicted global class of each pixel corresponding to the at least one second predicted text object in the sample image;
determining a second example segmentation loss value based on the real text object labels of the pixels corresponding to the at least one real text object and the predicted text object labels of the pixels corresponding to the at least one second predicted text object in the sample image;
determining the second penalty value based on the semantic segmentation penalty value and the second instance segmentation penalty value.
10. The method according to any one of claims 1 to 9, wherein, after performing joint iterative training on the text detection model to be trained and the global semantic segmentation model based on the sample image set and outputting a trained target text detection model, the method further comprises:
acquiring an image to be processed;
extracting the features of the image to be processed by adopting the target text detection model to obtain a plurality of target feature images with different sizes, and cutting out a plurality of corresponding candidate text box images from the plurality of target feature images;
adjusting the images of the candidate text boxes to be the same size by adopting the target text detection model to obtain a plurality of target text box images;
and carrying out example segmentation on the images of the target text blocks by adopting the target text detection model to obtain at least one target text object.
11. A text detection model training device, comprising:
a first obtaining module, configured to obtain a sample image set, where each sample image contains at least one real text object;
the model training module is used for performing combined iterative training on a text detection model to be trained and a global semantic segmentation model based on the sample image set and outputting a trained target text detection model; wherein, in each iterative training process, the following operations are executed:
performing target detection on a sample image through the text detection model to obtain at least one first predicted text object and corresponding predicted attribute information, and performing image segmentation on the sample image through the global semantic segmentation model to obtain at least one second predicted text object;
and determining a target loss value based on at least one real text object and corresponding real attribute information in the sample image, the at least one first predicted text object and corresponding predicted attribute information, and the at least one second predicted text object, and performing parameter adjustment by adopting the target loss value.
12. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the steps of the method of any of claims 1 to 10 are performed when the program is executed by the processor.
13. A computer-readable storage medium, having stored thereon a computer program executable by a computer device, for causing the computer device to perform the steps of the method of any one of claims 1 to 10, when the program is run on the computer device.
14. A computer program product, characterized in that the computer program product comprises a computer program stored on a computer-readable storage medium, the computer program comprising program instructions which, when executed by a computer device, cause the computer device to carry out the steps of the method of any one of claims 1-10.
CN202210040015.5A 2022-01-14 2022-01-14 Text detection model training method, device, equipment and storage medium Active CN114067321B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210040015.5A CN114067321B (en) 2022-01-14 2022-01-14 Text detection model training method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210040015.5A CN114067321B (en) 2022-01-14 2022-01-14 Text detection model training method, device, equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114067321A true CN114067321A (en) 2022-02-18
CN114067321B CN114067321B (en) 2022-04-08

Family

ID=80230845

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210040015.5A Active CN114067321B (en) 2022-01-14 2022-01-14 Text detection model training method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114067321B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419641A (en) * 2022-03-15 2022-04-29 腾讯科技(深圳)有限公司 Training method and device of text separation model, electronic equipment and storage medium
CN114511043A (en) * 2022-04-18 2022-05-17 苏州浪潮智能科技有限公司 Image understanding method, device, equipment and medium
CN114724133A (en) * 2022-04-18 2022-07-08 北京百度网讯科技有限公司 Character detection and model training method, device, equipment and storage medium
CN115376137A (en) * 2022-08-02 2022-11-22 北京百度网讯科技有限公司 Optical character recognition processing and text recognition model training method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN111553351A (en) * 2020-04-26 2020-08-18 佛山市南海区广工大数控装备协同创新研究院 Semantic segmentation based text detection method for arbitrary scene shape
CN112308053A (en) * 2020-12-29 2021-02-02 北京易真学思教育科技有限公司 Detection model training and question judging method and device, electronic equipment and storage medium
CN112733822A (en) * 2021-03-31 2021-04-30 上海旻浦科技有限公司 End-to-end text detection and identification method
US20210342621A1 (en) * 2020-12-18 2021-11-04 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for character recognition and processing

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019192397A1 (en) * 2018-04-04 2019-10-10 华中科技大学 End-to-end recognition method for scene text in any shape
CN110858217A (en) * 2018-08-23 2020-03-03 北大方正集团有限公司 Method and device for detecting microblog sensitive topics and readable storage medium
CN111553351A (en) * 2020-04-26 2020-08-18 佛山市南海区广工大数控装备协同创新研究院 Semantic segmentation based text detection method for arbitrary scene shape
US20210342621A1 (en) * 2020-12-18 2021-11-04 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for character recognition and processing
CN112308053A (en) * 2020-12-29 2021-02-02 北京易真学思教育科技有限公司 Detection model training and question judging method and device, electronic equipment and storage medium
CN112733822A (en) * 2021-03-31 2021-04-30 上海旻浦科技有限公司 End-to-end text detection and identification method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
刘思新 等: "基于改进TFIDF-Logistic Regression微博暴力文本分类", 《吉林大学学报(信息科学版)》 *
汪洪涛 等: "基于STN-CRNN的自然场景英文文本识别研究", 《武汉理工大学学报(信息与管理工程版)》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114419641A (en) * 2022-03-15 2022-04-29 腾讯科技(深圳)有限公司 Training method and device of text separation model, electronic equipment and storage medium
CN114511043A (en) * 2022-04-18 2022-05-17 苏州浪潮智能科技有限公司 Image understanding method, device, equipment and medium
CN114724133A (en) * 2022-04-18 2022-07-08 北京百度网讯科技有限公司 Character detection and model training method, device, equipment and storage medium
CN114511043B (en) * 2022-04-18 2022-07-08 苏州浪潮智能科技有限公司 Image understanding method, device, equipment and medium
WO2023201963A1 (en) * 2022-04-18 2023-10-26 苏州浪潮智能科技有限公司 Image caption method and apparatus, and device and medium
CN115376137A (en) * 2022-08-02 2022-11-22 北京百度网讯科技有限公司 Optical character recognition processing and text recognition model training method and device
CN115376137B (en) * 2022-08-02 2023-09-26 北京百度网讯科技有限公司 Optical character recognition processing and text recognition model training method and device

Also Published As

Publication number Publication date
CN114067321B (en) 2022-04-08

Similar Documents

Publication Publication Date Title
CN114067321B (en) Text detection model training method, device, equipment and storage medium
US20220172518A1 (en) Image recognition method and apparatus, computer-readable storage medium, and electronic device
US20150016668A1 (en) Settlement mapping systems
CN110796204A (en) Video tag determination method and device and server
CN111078940B (en) Image processing method, device, computer storage medium and electronic equipment
CN113822314A (en) Image data processing method, apparatus, device and medium
Kiranyaz et al. Automatic object extraction over multiscale edge field for multimedia retrieval
CN114332894A (en) Image text detection method and device
US20220147732A1 (en) Object recognition method and system, and readable storage medium
CN114168768A (en) Image retrieval method and related equipment
CN113537187A (en) Text recognition method and device, electronic equipment and readable storage medium
Han et al. Circular array targets detection from remote sensing images based on saliency detection
CN115249306B (en) Image segmentation model training method, image processing device and storage medium
CN113723515B (en) Moire pattern recognition method, device, equipment and medium based on image recognition
CN113065459B (en) Video instance segmentation method and system based on dynamic condition convolution
Veeravasarapu et al. Model-driven simulations for computer vision
CN115147532A (en) Image processing method, device and equipment, storage medium and program product
CN114332599A (en) Image recognition method, image recognition device, computer equipment, storage medium and product
CN114692715A (en) Sample labeling method and device
CN113505866B (en) Image analysis method and device based on edge material data enhancement
CN116309612B (en) Semiconductor silicon wafer detection method, device and medium based on frequency decoupling supervision
CN111680722B (en) Content identification method, device, equipment and readable storage medium
CN114612901A (en) Image change recognition method, device, equipment and storage medium
Parlewar et al. An Efficient Saliency Detection Using Wavelet Fusion
Sun et al. Learning graph structures with transformer for weakly supervised semantic segmentation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant