CN115273054A

CN115273054A - Character detection method and device, computer equipment and storage medium

Info

Publication number: CN115273054A
Application number: CN202210692483.0A
Authority: CN
Inventors: 田越; 周建东; 吴得泱; 杜锟; 曾峰
Original assignee: Huizhou Yonghui Intelligent Technology Co ltd; Datuo Shandong Internet Of Things Technology Co ltd
Current assignee: Huizhou Yonghui Intelligent Technology Co ltd; Datuo Shandong Internet Of Things Technology Co ltd
Priority date: 2022-06-17
Filing date: 2022-06-17
Publication date: 2022-11-01

Abstract

The embodiment of the application belongs to the technical field of data processing, and relates to a character detection method, a device, computer equipment and a storage medium, wherein the character detection method comprises the following steps: acquiring a fusion feature map corresponding to a target picture; inputting the fusion characteristic graph into a convolution network to obtain a character offset graph, a character distribution graph, a word distribution graph and a word center line graph of the target picture; determining a polygonal surrounding frame according to the character offset map, the word distribution map and the character distribution map; identifying an area corresponding to a polygon bounding box intersected with the word central line graph as a character area; and decoding the character area to obtain a character detection result corresponding to the target picture. The method and the device can improve the accuracy of the character detection result.

Description

Character detection method and device, computer equipment and storage medium

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a method and an apparatus for detecting characters, a computer device, and a storage medium.

Background

How to accurately and efficiently identify the character information from the picture has important significance for application scenes such as automatic driving, map navigation, picture searching and the like. The character detection is used as a prerequisite step of character recognition, and the quality of a detection result can directly influence a recognition result.

The current text detection method usually uses special operators (for example, color, texture, or a specified rectangular box, etc.) to detect the characters in the picture, so as to implement character detection in a simple scene. However, different from a scanned paper document, characters in a natural scene may have characteristics of perspective, deformation, strong dim light, noisy background and the like, and the existing text detection method is not ideal for character recognition results in the natural scene.

Disclosure of Invention

An embodiment of the present application aims to provide a text detection method, a text detection device, a computer device, and a storage medium, so as to at least solve the problem in the prior art that a text recognition result in a natural scene is not ideal.

In order to solve the above technical problem, an embodiment of the present application provides a text detection method, which adopts the following technical scheme:

acquiring a fusion feature map corresponding to a target picture; inputting the fusion feature graph into a convolution network to obtain a character offset graph, a character distribution graph, a word distribution graph and a word center line graph of the target picture; determining a polygonal bounding box according to the character offset map, the word distribution map and the character distribution map; identifying an area corresponding to the polygon bounding box having an intersection with the word centerline map as a text area; and decoding the character area to obtain a character detection result corresponding to the target picture.

Further, the step of obtaining the fusion feature map corresponding to the target picture includes: acquiring a target picture; performing feature extraction on the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map; and performing feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

Further, before the above-mentioned feature extraction is performed on the target picture by using the ResNet50 feature extractor to obtain at least one initial feature map, the text detection method further includes: and carrying out scaling processing and normalization processing on the target image.

Further, the step of determining the polygon bounding box according to the character offset map, the word distribution map and the character distribution map includes: carrying out binarization processing on the word distribution diagram to obtain a first word distribution diagram; performing noise filtration on the character offset map and the character distribution map according to the first word distribution map to obtain a first character offset map and a first character distribution map; carrying out binarization processing on the first character offset map to obtain a second character offset map; performing difference value processing on the second character offset map and the first word distribution map to obtain a comprehensive word distribution map; extracting a plurality of character feature point coordinates from the first character distribution map; and determining a polygonal surrounding frame according to the character feature point coordinates and the comprehensive word distribution diagram.

Further, the step of identifying an area corresponding to the polygon bounding box having an intersection with the word center line drawing as a character area includes: performing noise filtration on the word center line graph according to the first word distribution graph to obtain a first word center line graph; the first word distribution diagram is obtained by carrying out binarization processing on the word distribution diagram; carrying out binarization processing on the first word center line graph to obtain a second word center line graph; and identifying an area corresponding to a polygonal bounding box intersected with the second word center line graph as a character area.

Further, before the step of inputting the fused feature map into a convolutional network to obtain a character offset map, a character distribution map, a word distribution map, and a word center line map of the target picture, the text detection method further includes: and training by adopting a weak supervision strategy and a preset convolution network model to obtain the convolution network by taking a history fusion characteristic graph of a history picture as input and taking a history character offset graph, a history character distribution graph, a history word distribution graph and a history word central line graph as output.

In order to solve the above technical problem, an embodiment of the present application further provides a text detection device, which adopts the following technical scheme:

the first acquisition module is used for acquiring a fusion feature map corresponding to a target picture; the second acquisition module is used for inputting the fusion feature map into a convolution network to acquire a character offset map, a character distribution map, a word distribution map and a word center line map of the target picture; the determining module is used for determining a polygonal surrounding box according to the character offset map, the word distribution map and the character distribution map; the character recognition module is used for recognizing an area corresponding to the polygonal bounding box intersected with the word center line graph as a character area; and the character decoding module is used for decoding the character area to obtain a character detection result corresponding to the target picture.

Further, the first obtaining module comprises an obtaining submodule, an extracting submodule and a first processing submodule; the acquisition submodule is used for acquiring a target picture; the extraction submodule is used for extracting the features of the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map; and the first processing submodule is used for carrying out feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

Further, the first obtaining module further comprises a second processing submodule; and the second processing submodule is used for carrying out scaling processing and normalization processing on the target image.

Further, the determining module comprises a word processing sub-module, a filtering sub-module, a character processing sub-module, a difference processing sub-module, a coordinate extracting sub-module and a determining sub-module; the word processing submodule is used for carrying out binarization processing on the word distribution diagram to obtain a first word distribution diagram; the filtering submodule is used for carrying out noise filtering on the character offset map and the character distribution map according to the first word distribution map to obtain a first character offset map and a first character distribution map; the character processing submodule is used for carrying out binarization processing on the first character offset map to obtain a second character offset map; the difference-taking processing submodule is used for carrying out difference-taking processing on the second character offset map and the first word distribution map to obtain a comprehensive word distribution map; the coordinate extraction submodule is used for extracting a plurality of character feature point coordinates from the first character distribution diagram; and the determining submodule is used for determining a polygonal surrounding frame according to the character feature point coordinates and the comprehensive word distribution diagram.

Furthermore, the character recognition module comprises a noise filtering submodule, a central line graph processing submodule and an identification submodule; the noise filtering submodule is used for carrying out noise filtering on the word central line graph according to the first word distribution graph to obtain a first word central line graph; the first word distribution diagram is obtained by carrying out binarization processing on the word distribution diagram; the central line graph processing submodule is used for carrying out binarization processing on the first word central line graph to obtain a second word central line graph; and the identification submodule is used for identifying an area corresponding to a polygonal surrounding frame intersected with the second word central line graph as a character area.

Further, the character detection device further comprises a training module; the training module is used for taking a historical fusion characteristic diagram of a historical picture as input, taking a historical character offset diagram, a historical character distribution diagram, a historical word distribution diagram and a historical word central line diagram as output, and training by adopting a weak supervision strategy and a preset convolution network model to obtain the convolution network.

In order to solve the above technical problem, an embodiment of the present application further provides a computer device, where the computer device includes a memory and a processor, where the memory stores a computer program, and the processor implements the steps of the text detection method when executing the computer program.

In order to solve the above technical problem, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the steps of the text detection method are implemented as described above.

Compared with the prior art, the embodiment of the application mainly has the following beneficial effects: and inputting the fusion feature map corresponding to the target picture into a convolution network to obtain a character offset map, a character distribution map, a word distribution map and a word center line map of the target picture. And then, determining a polygon enclosing frame according to the character offset map, the word distribution map and the character distribution map, and identifying an area corresponding to the polygon enclosing frame with intersection with the word center line map as a character area. And finally, decoding the character area to obtain a character detection result corresponding to the target picture. Therefore, the character offset map and the character distribution map corresponding to the character module, the word distribution map and the word center line map corresponding to the word module share the fusion characteristic map of the target picture, namely, the character information in the picture is detected by adopting two monitoring signals of the character and the word, and the accuracy of the character detection result is improved.

Drawings

In order to more clearly illustrate the solution of the present application, a brief description will be given below of the drawings required for use in the description of the embodiments of the present application, and it is obvious that the drawings in the following description are some embodiments of the present application, and that other drawings can be obtained by those skilled in the art without inventive effort.

FIG. 1 is an exemplary system architecture diagram in which the present application may be applied;

FIG. 2 is a flow diagram of one embodiment of a text detection method of the present application;

FIG. 3 is a flow chart for one embodiment of step S21 in FIG. 2;

FIG. 4 is a flowchart of one embodiment of step S23 of FIG. 2;

FIG. 5 is a flow diagram for one embodiment of step S24 in FIG. 2;

FIG. 6 is a schematic structural diagram of one embodiment of a text detection device of the present application;

FIG. 7 is a schematic block diagram of one embodiment of a computer device according to the present application.

Detailed Description

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs; the terminology used in the description of the application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application; the terms "including" and "having," and any variations thereof, in the description and claims of this application and the description of the above figures are intended to cover non-exclusive inclusions. The terms "first," "second," and the like in the description and claims of this application or in the above-described drawings are used for distinguishing between different objects and not for describing a particular order.

Reference herein to "an embodiment" means that, a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.

In order to make the technical solutions better understood by those skilled in the art, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings.

As shown in fig. 1, the system architecture 100 may include

terminal devices

101, 102, 103, a network 104, and a server 105. The network 104 is used to provide a medium for communication links between the

terminal devices

101, 102, 103 and the server 105. Network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use

terminal devices

101, 102, 103 to interact with a server 105 over a network 104 to receive or send messages or the like. The

terminal devices

101, 102, 103 may have various communication client applications installed thereon, such as a web browser application, a shopping application, a search application, an instant messaging tool, a mailbox client, social platform software, and the like.

The

terminal devices

101, 102, 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture experts Group Audio Layer III, motion Picture experts compression standard Audio Layer 3), MP4 players (Moving Picture experts Group Audio Layer IV, motion Picture experts compression standard Audio Layer 4), laptop portable computers, desktop computers, and the like.

The server 105 may be a server providing various services, such as a background server providing support for pages displayed on the

terminal devices

101, 102, 103.

It should be noted that the text detection method provided in the embodiment of the present application may be applied to the server device 105, and may also be applied to the

terminal devices

101, 102, and 103. The server device 105 and the

terminal devices

101, 102, 103 may be collectively referred to as electronic devices. That is, the main executing body of the text detection method provided in the embodiment of the present application may be a text detection device, and the text detection device may be the electronic device (e.g., the server device 105 or the

terminal devices

101, 102, and 103).

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

With continued reference to FIG. 2, a flow diagram of one embodiment of a text detection method according to the present application is shown. The character detection method is characterized in that the character detection method comprises the following steps, the method comprises the following steps:

and S21, acquiring a fusion feature map corresponding to the target picture.

Specifically, fig. 3 is a schematic flowchart of a process for obtaining a fusion feature map corresponding to a target picture according to an embodiment of the present disclosure. Referring to fig. 3, steps S211 to S213 are included.

In step S211, a target picture is acquired.

And step S212, performing feature extraction on the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map.

After the target picture is acquired, firstly, the target picture is zoomed.

Specifically, for a target picture with the size of H × W, scaling the target picture according to pixels with the short sides of the target picture as a first numerical value, and respectively tiling black pixels on two long sides, so that the pixel length of the long sides can be evenly divided by a second numerical value, and the scaled target picture is obtained.

And then, carrying out normalization processing on the target picture after the scaling processing.

Specifically, the pixel value of the target picture after scaling is divided by the third numerical value, and then the target picture after scaling is normalized in the BGR channel according to the preset mean value and the preset variance, so as to obtain the preprocessed picture with the preset size.

For example, for a target picture with a size of H × W, the target picture is scaled according to pixels with a short side length of 800, and black pixels are respectively tiled for two long sides, so that the pixel length of the long side can be evenly divided by 32, and the scaled target picture is obtained. Then, the pixel value of the target picture after scaling processing is divided by 255, and the target picture after scaling processing is normalized according to a mean value pixel _ means = [0.406,0.456,0.485] and a variance pixel _ stds = [0.225,0.224,0.229] in a BGR channel, so that a preprocessed picture with the size of 3*H '× W' is obtained.

Finally, a ResNet50 feature extractor is adopted to extract the features of the preprocessed pictures, at least one initial feature map is obtained.

Specifically, a ResNet50 feature extractor is adopted to extract features of the preprocessed pictures, so that 5 groups of feature maps F0, F1, F2, F3 and F4 with different sizes and different channel depths can be obtained. The channel depth and size of F0 are (128,1/2*H ', 1/2*W'), the channel depth and size of F1 are (256, 1/4*H ', 1/4*W'), the channel depth and size of F2 are (512,1/8*H ', 1/8*W'), the channel depth and size of F3 are (1024,1/16H ', 1/16W'), and the channel depth and size of F4 are (2048,1/32H ', 1/32W').

In this embodiment, the shallow feature map includes more details and position information, which is more beneficial to detecting small characters; the deep characteristic diagram contains more context semantic information, which is more beneficial to detecting large characters, can further inhibit false positive prediction and improve the accuracy of character detection.

And step S213, performing feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

Specifically, F1, F2, F3 and F4 are used as fusion sources, and the shallow feature map is fused into the deep feature map by using an up-sampling mode. For example, the shallow profile is fused into the deep profile by an instruction F1'= Up _ Sample (Reduce _ Channel (F2'), reduce _ Channel (F1)). The Up _ Sample represents element-by-element addition, the Reduce _ Channel is a convolutional neural network, the convolutional neural network comprises a convolutional layer, a batch normalization layer and a nonlinear function activation layer, and the dimension reduction processing of the feature diagram can be realized by changing the size of a convolutional kernel in the convolutional layer. Thus, the depth and size of F1 '(256,1/4*H', 1/4*W '), the depth and size of F2' (256,1/8*H ', 1/8*W'), the depth and size of F3 '(256,1/16H', 1/16W '), and the depth and size of F4' (256, 1/32H ', 1/32W'), respectively, were obtained.

And then, combining the F1', F2', F3', F4' and the F0, F1, F2, F3 and F4 according to the channels to obtain an initial fusion characteristic diagram. Then, the convolution kernel of 1*1 size is used to perform dimensionality reduction on the initial fusion feature map, and the depth and the size of the final fusion feature map F5 'and F5' are obtained (256,1/4*H ', 1/4*W'), respectively).

And S22, inputting the fusion feature map into a convolution network, and acquiring a character offset map, a character distribution map, a word distribution map and a word center line map of the target picture.

Optionally, before step S22, the text detection method further includes taking a history fusion feature map of the history picture as an input, taking a history character shift map, a history character distribution map, a history word distribution map, and a history Shi Shanci central line map as outputs, and training by using a weak supervision strategy and a preset convolutional network model to obtain a convolutional network.

The preset convolution network model comprises a convolution layer, a batch normalization layer, a nonlinear activation layer and a deconvolution layer.

The historical character offset map is a feature map formed by first distances between character centers in the historical pictures and the center points of the historical pictures (the first distance value of the background is 0).

The historical character distribution map is a feature map formed by second distances between each point in the historical picture and the center of the character to which the point belongs (the second distance value to which the background belongs is 0).

Optionally, the first distance and the second distance may be euclidean distances.

The historical word distribution map is a characteristic map formed by character areas in the historical picture. For example, if a certain coordinate point is located in the text area, the label value of the coordinate point is 1, otherwise, the label value is 0.

The history word central line graph is a feature graph formed by central lines of character areas in the history pictures. For example, after the historical word distribution map is determined, the geometric centerline (without width) of the historical word distribution map is first determined. Then, 20 bisector points (which can be taken according to the actual situation, for example, according to the size of the Shi Shanci distribution graph) are taken from the geometric central line, and each bisector point covers the character area by a circle with a proper diameter (which is taken according to the size of the actual character area). Finally, the average diameter of the 20 circles is counted, and the proper width (for example, 70% of the average diameter) of the average diameter is taken as the width of the center line, and the heads and the tails of the original geometric center lines are respectively retracted by proper lengths (for example, 50% of the average diameter) to be taken as the length of the center line, so as to obtain the history word center line graph.

Specifically, a history fusion feature map of a history picture is used as input, a history character offset map, a history character distribution map, a history word distribution map and a history word center line map are used as output, a preset convolution network model is trained, and a primary convolution network is obtained. And then, taking the adjustment fusion characteristic graph of the adjustment picture as the input of the primary convolution network, adjusting the primary convolution network according to the output adjustment word distribution graph and the adjustment word central line graph of the adjustment picture, performing iteration, and obtaining the adjustment convolution network after iterating for a preset number of times. Then, the test fusion feature map of the test picture is used as an input for adjusting the convolution network, and a test character offset map (referred to as a character pseudo label), a test character distribution map (referred to as a character pseudo label), a test word distribution map, and a test word center line map are output. The quality of the character offset map and the test character distribution map is evaluated using the test word distribution map and the test word center line map, and setting the positions corresponding to the character offset map with lower scores and the test character distribution map as non-emphasis (donotcare) to obtain the test paper volume network. And finally, carrying out iterative training on the test convolution network by using the character pseudo label until the loss function is converged to obtain the final convolution network.

In the present embodiment, the word distribution map and the word center line map are used as the supervisory signal, character pseudo labels are generated, and then the positions corresponding to the character pseudo labels having low quality are set as non-emphasis points. In this way, the iteration times in the convolutional network training process are reduced, the training speed of the convolutional network is improved.

And then inputting the fusion characteristic graph of the target picture into a trained convolution network, wherein the output of the convolution network is a character offset graph, a character distribution graph, a word distribution graph and a word center line graph of the target picture. The size of the character shift map is (H ', W'), the size of the character distribution map is (2,H ', W'), the size of the word segmentation map is (H ', W'), and the size of the word center line map is (H ', W').

In step S23, a polygon bounding box is determined according to the character offset map, the word distribution map and the character distribution map.

Specifically, fig. 4 is a schematic flowchart of a process for determining a polygon bounding box according to an embodiment of the present disclosure. Referring to fig. 4, steps S231 to S236 are included.

Step S231, performs binarization processing on the word distribution map to obtain a first word distribution map.

For example, the word distribution map is binarized with a threshold value of 0.5, and pixels of points greater than or equal to 0.5 in the word distribution map are set to 255 and pixels of points smaller than 0.5 in the word distribution map are set to 0, resulting in a first word distribution map.

Step S232, performing noise filtering on the character offset map and the character distribution map according to the first word distribution map to obtain a first character offset map and a first character distribution map.

Specifically, in the character offset map and the character distribution map, a point corresponding to a point having a pixel of 0 in the first word distribution map is set to be 0, so that the first character offset map and the first character distribution map are obtained.

In step S233, the first character offset map is binarized to obtain a second character offset map.

For example, the first character shift map is binarized with a threshold value of 0.7, and the pixels of the dots greater than or equal to 0.7 in the first character shift map are set to 255, and the pixels of the dots smaller than 0.7 in the word distribution map are set to 0, resulting in the second character shift map.

Step S234, performing a difference processing on the second character offset map and the first word distribution map to obtain a comprehensive word distribution map.

Specifically, the pixel of the second character shift map is 0, and the pixel of the first word distribution map is 255, so as to obtain the comprehensive word distribution map.

In step S235, a plurality of character feature point coordinates are extracted from the first character distribution map.

Specifically, a pre-trained extraction module may be used to extract the coordinates of the feature points of the plurality of characters from the first character distribution map.

In step S236, a polygon bounding box is determined according to the character feature point coordinates and the integrated word distribution map.

Specifically, according to the character feature point coordinates, the positions of a group of points on at least two boundaries are obtained from the comprehensive word distribution diagram, and then the positions of all the points in the group are connected to obtain a polygonal surrounding frame of a character area in any shape.

Step S24, recognizing the area corresponding to the polygon enclosing frame with the intersection with the word center line graph as a character area.

Specifically, fig. 5 is a schematic flow chart illustrating a process of recognizing a text region according to an embodiment of the present application. Referring to fig. 5, steps S241 to S243 are included.

And step S241, performing noise filtration on the word center line graph according to the first word distribution graph to obtain a first word center line graph.

Specifically, a point in the word centerline map corresponding to a point with a pixel of 0 in the first word distribution map is set to be 0, so as to obtain a first word centerline map. The first word distribution diagram is obtained by carrying out binary processing on the word distribution diagram.

And step S242, carrying out binarization processing on the first word centerline graph to obtain a second word centerline graph.

For example, the first word center line graph is binarized with a threshold value of 0.5, and the pixels of the points greater than or equal to 0.5 in the first word center line graph are set to 255, and the pixels of the points less than 0.5 in the first word center line graph are set to 0, resulting in a second word center line graph.

In step S243, the region corresponding to the polygon bounding box intersecting with the second word center line map is identified as a text region.

And S25, decoding the character area to obtain a character detection result corresponding to the target picture.

In this embodiment, the fusion feature map corresponding to the target picture is input into the convolution network, and a character offset map, a character distribution map, a word distribution map, and a word center line map of the target picture are obtained. And then, determining a polygonal surrounding frame according to the character offset map, the word distribution map and the character distribution map, and identifying an area corresponding to the polygonal surrounding frame with intersection with the word center line map as a character area. And finally, decoding the character area to obtain a character detection result corresponding to the target picture. Therefore, the character offset graph and the character distribution graph corresponding to the character module, the word distribution graph and the word center line graph corresponding to the word module share the fusion characteristic graph of the target picture, namely, the character information and the word information in the picture are detected by adopting two monitoring signals of the character and the word, and the accuracy of the character detection result is improved.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by a computer program, which can be stored in a computer-readable storage medium, and can include the processes of the embodiments of the methods described above when the computer program is executed. The storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

It should be understood that, although the steps in the flowcharts of the figures are shown in order as indicated by the arrows, the steps are not necessarily performed in order as indicated by the arrows. The steps are not performed in the exact order shown and may be performed in other orders unless explicitly stated herein. Moreover, at least a portion of the steps in the flow chart of the figure may include multiple sub-steps or multiple stages, which are not necessarily performed at the same time, but may be performed at different times, which are not necessarily performed in sequence, but may be performed alternately or in turns with other steps or at least a portion of the sub-steps or stages of other steps.

With further reference to fig. 6, as an implementation of the method shown in fig. 2, the present application provides an embodiment of a text detection apparatus, where the embodiment of the apparatus corresponds to the embodiment of the method shown in fig. 2, and the apparatus can be applied to various electronic devices.

As shown in fig. 6, the character detection device 60 of the present embodiment includes: a first acquisition module 61, a second acquisition module 62, a determination module 63, a text recognition module 64, and a text decoding module 65. Wherein:

the first obtaining module 61 is configured to obtain a fusion feature map corresponding to a target picture; a second obtaining module 62, configured to input the fusion feature map into a convolution network, and obtain a character offset map, a character distribution map, a word distribution map, and a word center line map of the target picture; a determining module 63, configured to determine a polygon bounding box according to the character offset map, the single-word distribution map, and the character distribution map; a character recognition module 64 configured to recognize an area corresponding to a polygon bounding box intersecting with the single-word centerline map as a character area; and the character decoding module 65 is configured to decode the character region to obtain a character detection result corresponding to the target picture.

In some optional implementations of this embodiment, the first obtaining module 61 includes an obtaining sub-module, an extracting sub-module, and a first processing sub-module; the acquisition sub-module is used for acquiring a target picture; the extraction sub-module is used for extracting the features of the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map; and the first processing sub-module is used for performing feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

In this embodiment, a ResNet50 feature extractor is used to extract features of a target picture, so that feature maps with different depths can be obtained. The shallow feature map contains more details and position information, which is more beneficial to detecting small characters, and the deep feature map contains more context semantic information, which is more beneficial to detecting large characters, so that false positive prediction can be further inhibited, and the accuracy of character detection is improved.

In some optional implementations of this embodiment, the first obtaining module 61 further includes a second processing sub-module; and the second processing submodule is used for carrying out scaling processing and normalization processing on the target image.

In some optional implementation manners of this embodiment, the determining module 63 includes a word processing sub-module, a filtering sub-module, a character processing sub-module, a difference processing sub-module, a coordinate extracting sub-module, and a determining sub-module; the word processing submodule is used for carrying out binarization processing on the word distribution diagram to obtain a first word distribution diagram; the filtering submodule is used for carrying out noise filtering on the character offset map and the character distribution map according to the first word distribution map to obtain a first character offset map and a first character distribution map; the character processing submodule is used for carrying out binarization processing on the first character offset map to obtain a second character offset map; the difference-taking processing sub-module is used for carrying out difference-taking processing on the second character offset map and the first word distribution map to obtain a comprehensive word distribution map; the coordinate extraction submodule is used for extracting a plurality of character feature point coordinates from the first character distribution diagram; and the determining submodule is used for determining the polygonal surrounding frame according to the character feature point coordinates and the comprehensive word distribution diagram.

In some optional implementations of this embodiment, the text recognition module 64 includes a noise filtering sub-module, a centerline map processing sub-module, and a recognition sub-module; the noise filtering submodule is used for carrying out noise filtering on the word central line graph according to the first word distribution graph to obtain a first word central line graph; the first word distribution diagram is obtained by carrying out binarization processing on the word distribution diagram; the central line graph processing submodule is used for carrying out binarization processing on the first word central line graph to obtain a second word central line graph; and the recognition submodule is used for recognizing the area corresponding to the polygon enclosing frame which has intersection with the second word central line graph as a character area.

In some optional implementations of this embodiment, the text detection apparatus further includes a training module; and the training module is used for taking the historical fusion characteristic graph of the historical picture as input, taking the historical character deviation graph, the historical character distribution graph, the historical word distribution graph and the historical word center line graph as output, and training by adopting a weak supervision strategy and a preset convolution network model to obtain the convolution network.

In the present embodiment, the word distribution map and the word center line map are used as the supervisory signal, character pseudo labels are generated, and then the positions corresponding to the character pseudo labels having low quality are set as non-emphasis points. Therefore, the iteration times in the convolutional network training process are reduced, and the training speed of the convolutional network is improved.

In order to solve the technical problem, an embodiment of the present application further provides a computer device. Referring to fig. 7, fig. 7 is a block diagram of a basic structure of a computer device according to the present embodiment.

The computer device 70 comprises a memory 71, a processor 72, a network interface 73 communicatively connected to each other via a system bus. It is noted that only a computer device 70 having components 71-73 is shown, but it is understood that not all of the shown components are required and that more or fewer components may alternatively be implemented. As will be understood by those skilled in the art, the computer device is a device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and the hardware includes, but is not limited to, a microprocessor, an Application Specific Integrated Circuit (ASIC), a Programmable Gate Array (FPGA), a Digital Signal Processor (DSP), an embedded device, and the like.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer equipment can carry out man-machine interaction with a user in a keyboard mode, a mouse mode, a remote controller mode, a touch panel mode or a voice control equipment mode.

The memory 71 includes at least one type of readable storage medium including a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a Read Only Memory (ROM), an Electrically Erasable Programmable Read Only Memory (EEPROM), a Programmable Read Only Memory (PROM), a magnetic memory, a magnetic disk, an optical disk, etc. In some embodiments, the storage 71 may be an internal storage unit of the computer device 70, such as a hard disk or a memory of the computer device 70. In other embodiments, the memory 71 may also be an external storage device of the computer device 70, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), or the like, provided on the computer device 70. Of course, the memory 71 may also include both internal and external storage devices of the computer device 70. In this embodiment, the memory 71 is generally used for storing an operating system installed on the computer device 70 and various application software, such as program codes of a text detection method. In addition, the memory 71 may be used to temporarily store various types of data that have been output or are to be output.

The processor 72 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor, or other data Processing chip in some embodiments. The processor 72 generally serves to control the overall operation of the computer device 70. In this embodiment, the processor 72 is configured to execute the program code stored in the memory 71 or process data, for example, execute the program code of the text detection method.

The network interface 73 may include a wireless network interface or a wired network interface, and the network interface 73 is generally used to establish a communication connection between the computer device 70 and other electronic devices.

In this embodiment, the character offset map and the character distribution map corresponding to the character module, and the word distribution map and the word center line map corresponding to the word module share the fusion feature map of the target picture, that is, the character information in the picture is detected by using two monitoring signals, namely, the character and the word, so that the accuracy of the character detection result is improved.

The present application provides another embodiment, which is to provide a computer-readable storage medium storing a text detection program, where the text detection program is executable by at least one processor to cause the at least one processor to execute the steps of the text detection method as described above.

Through the description of the foregoing embodiments, it is clear to those skilled in the art that the method of the foregoing embodiments may be implemented by software plus a necessary general hardware platform, and certainly may also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present application may be embodied in the form of a software product, which is stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (such as a mobile phone, a computer, a server, an air conditioner, or a network device) to execute the method according to the embodiments of the present application.

It is to be understood that the above-described embodiments are merely illustrative of some, but not restrictive, of the broad invention, and that the appended drawings illustrate preferred embodiments of the invention and do not limit the scope of the invention. This application is capable of embodiments in many different forms and is provided for the purpose of enabling a thorough understanding of the disclosure of the application. Although the present application has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that the present application may be practiced without these specific details or with equivalents of some of the features described in the foregoing embodiments. All equivalent structures made by using the contents of the specification and the drawings of the present application are directly or indirectly applied to other related technical fields and are within the protection scope of the present application.

Claims

1. A character detection method is characterized by comprising the following steps:

acquiring a fusion feature map corresponding to a target picture;

inputting the fusion feature graph into a convolution network to obtain a character offset graph, a character distribution graph, a word distribution graph and a word center line graph of the target picture;

determining a polygon enclosing frame according to the character offset map, the word distribution map and the character distribution map;

identifying an area corresponding to the polygon bounding box having an intersection with the word centerline map as a text area;

and decoding the character area to obtain a character detection result corresponding to the target picture.

2. The text detection method according to claim 1, wherein the obtaining of the fused feature map corresponding to the target picture comprises:

acquiring a target picture;

performing feature extraction on the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map;

and performing feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

3. The text detection method according to claim 2, wherein before the feature extraction is performed on the target picture by using a ResNet50 feature extractor to obtain at least one initial feature map, the text detection method further comprises:

and carrying out scaling processing and normalization processing on the target image.

4. The method of claim 1, wherein the determining a polygon bounding box according to the character offset map, the word distribution map, and the character distribution map comprises:

carrying out binarization processing on the word distribution diagram to obtain a first word distribution diagram;

performing noise filtration on the character offset map and the character distribution map according to the first word distribution map to obtain a first character offset map and a first character distribution map;

carrying out binarization processing on the first character offset map to obtain a second character offset map;

performing difference value processing on the second character offset map and the first word distribution map to obtain a comprehensive word distribution map;

extracting a plurality of character feature point coordinates from the first character distribution map;

and determining a polygonal surrounding frame according to the character feature point coordinates and the comprehensive word distribution diagram.

5. The character detection method according to claim 1, wherein said identifying, as a character region, a region corresponding to the polygon bounding box having an intersection with the word centerline map comprises:

performing noise filtration on the word center line graph according to the first word distribution graph to obtain a first word center line graph; the first word distribution diagram is obtained by carrying out binarization processing on the word distribution diagram;

carrying out binarization processing on the first word center line graph to obtain a second word center line graph;

and identifying an area corresponding to a polygon enclosing frame intersected with the second word center line graph as a character area.

6. The character detection method according to claim 1, wherein before inputting the fused feature map into a convolutional network to obtain a character offset map, a character distribution map, a word distribution map, and a word center line map of the target picture, the character detection method further comprises:

and taking a historical fusion characteristic diagram of a historical picture as an input, taking a historical character offset diagram, a historical character distribution diagram, a historical word distribution diagram and a historical word central line diagram as an output, and training by adopting a weak supervision strategy and a preset convolution network model to obtain the convolution network.

7. A character detection apparatus, comprising:

the first acquisition module is used for acquiring a fusion feature map corresponding to a target picture;

the second acquisition module is used for inputting the fusion feature map into a convolution network to acquire a character offset map, a character distribution map, a word distribution map and a word center line map of the target picture;

the determining module is used for determining a polygonal surrounding box according to the character offset map, the word distribution map and the character distribution map;

the character recognition module is used for recognizing an area corresponding to the polygon bounding box with intersection with the word central line graph as a character area;

and the character decoding module is used for decoding the character area to obtain a character detection result corresponding to the target picture.

8. The text detection device according to claim 7, wherein the first obtaining module comprises an obtaining sub-module, an extracting sub-module and a first processing sub-module;

the acquisition submodule is used for acquiring a target picture;

the extraction submodule is used for extracting the features of the target picture by adopting a ResNet50 feature extractor to obtain at least one initial feature map;

and the first processing submodule is used for performing feature fusion processing on the initial feature map by using an up-sampling mode to obtain a fusion feature map.

9. A computer device comprising a memory in which a computer program is stored and a processor which, when executing the computer program, carries out the steps of the text detection method according to any one of claims 1 to 6.

10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, carries out the steps of the text detection method according to any one of claims 1 to 6.