CN111582265A

CN111582265A - Text detection method and device, electronic equipment and storage medium

Info

Publication number: CN111582265A
Application number: CN202010409367.4A
Authority: CN
Inventors: 毕研广; 胡志强
Original assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Current assignee: Shanghai Sensetime Intelligent Technology Co Ltd
Priority date: 2020-05-14
Filing date: 2020-05-14
Publication date: 2020-08-25

Abstract

The disclosure relates to a text detection method and apparatus, an electronic device, and a storage medium, wherein the method includes: acquiring a text to be detected in an image, inputting the text to be detected into a text detection network for feature extraction, and acquiring first feature data; and carrying out global segmentation and local regression processing according to the text detection network and the first characteristic data to obtain a prediction result for text detection. By adopting the method and the device, not only can the detection accuracy be improved, but also the processing speed of the detection can be improved.

Description

Text detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of detection, and in particular, to a text detection method and apparatus, an electronic device, and a storage medium.

Background

The text is an important carrier of information and exists in various application scenes, such as streets, books, menu lists, invoices, license plates, product packages and the like. How to accurately detect characters in a text is a technical problem to be solved. The more accurate the detection result, the more accurate the obtained character recognition result. However, in some practical scenarios, the text is not regular and may be irregular, for example, the text takes on a curved ring shape or an up-and-down shape, and so on. For the irregular characters, compared with regular characters, how to predict a text frame and further finally determine the actual position of the text frame is higher in detection difficulty, so that a simple and efficient detection scheme is urgently needed for the irregular characters. However, no effective solution exists in the related art.

Disclosure of Invention

In view of this, the present disclosure provides a technical solution for text detection.

According to an aspect of the present disclosure, there is provided a text detection method, the method including:

acquiring a text to be detected in an image, inputting the text to be detected into a text detection network for feature extraction, and acquiring first feature data;

and carrying out global segmentation and local regression processing according to the text detection network and the first characteristic data to obtain a prediction result for text detection.

By adopting the method and the device, the text to be detected can be input into the text detection network for feature extraction to obtain the first feature data, global segmentation and local regression are carried out according to the text detection network and the first feature data to obtain a prediction result for text detection, text frame prediction of the text to be detected can be realized through the prediction result, and then the actual position of the text frame is finally determined.

In a possible implementation manner, after obtaining the prediction result for text detection, the method further includes: and reconstructing a text border of the text to be detected according to the prediction result.

By adopting the method and the device, the text frame of the text to be detected can be reconstructed according to the prediction result, namely, after the text frame of the text to be detected is predicted according to the prediction result, the actual position of the text frame can be determined according to the reconstruction processing based on the prediction result.

In a possible implementation manner, the processing of global segmentation and local regression according to the text detection network and the first feature data to obtain a prediction result for text detection includes:

performing global segmentation processing on the first characteristic data in the text detection network to obtain a character confidence coefficient of a text to be detected;

performing local regression processing on the first characteristic data in the text detection network to obtain a local upper and lower boundary distance of the text to be detected and a local angle of the text to be detected;

the prediction result comprises: the text confidence of the text to be detected, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected.

By adopting the method and the device, the character confidence of the text to be detected can be obtained by globally dividing the first characteristic data, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected can be obtained by locally regressing the first characteristic data, and the character confidence of the text to be detected, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected are taken as prediction results, so that the text frame prediction of the text to be detected can be realized through the prediction results, and the actual position of the text frame can be finally determined.

In a possible implementation manner, after the processing of global segmentation is performed on the first feature data in the text detection network to obtain a word confidence, the method further includes:

obtaining a segmentation mask according to the character confidence of the text to be detected;

and masking the text to be detected according to the segmentation mask to obtain the distance between the upper boundary and the lower boundary of the text to be detected.

By adopting the method and the device, the segmentation mask can be obtained according to the character confidence of the text to be detected, the mask processing is carried out on the text to be detected according to the segmentation mask, the upper and lower boundary distance of the text to be detected can be obtained, the upper and lower boundary distance of the text to be detected is different from the local upper and lower boundary distance of the text to be detected, the upper and lower boundary distance of the whole text character frame to be detected is the upper and lower boundary distance of the text to be detected, and the central line region of the text to be detected for determining the anchor point position can be obtained.

In a possible implementation manner, the text to be detected includes at least one line of text;

before the obtaining of the first feature data, the method further includes:

responding to the feature extraction to obtain second feature data, wherein the second feature data is used for representing any line of text in the text to be detected;

and segmenting the text to be detected into any line of text according to the second characteristic data and the text detection network.

By adopting the method and the device, the characteristics of any line of text in the text to be detected can be obtained according to the second characteristic data, so that any line of text can be segmented from the text to be detected according to the second characteristic data and the text detection network, text detection processing is carried out on any line of text, in the text detection processing, after a prediction result for text detection is obtained, text frame prediction of the text to be detected can be realized through the prediction result, and further the actual position of the text frame is finally determined.

In a possible implementation manner, reconstructing a text border of the text to be detected according to the prediction result includes:

and reconstructing a text border of the text to be detected according to the distance between the upper boundary and the lower boundary of the text to be detected, the local distance between the upper boundary and the lower boundary of the text to be detected and the local angle of the text to be detected.

By adopting the method and the device, the distance between the upper boundary and the lower boundary of the text to be detected, the local distance between the upper boundary and the lower boundary of the text to be detected and the local angle of the text to be detected can be obtained from the prediction result, so that the text frame of the text to be detected can be reconstructed according to the distance between the upper boundary and the lower boundary of the text to be detected, the local distance between the upper boundary and the lower boundary of the text to be detected and the local.

In a possible implementation manner, reconstructing a text border of the text to be detected according to the distance between the upper boundary and the lower boundary of the text to be detected, the distance between the local upper boundary and the local lower boundary of the text to be detected, and the local angle of the text to be detected includes:

performing fusion processing according to the distance between the upper boundary and the lower boundary of the text to be detected to obtain a center line region of the text to be detected;

obtaining local boundaries corresponding to the at least two anchor points respectively according to the local upper and lower boundary distances of the text to be detected and the local angles of the text to be detected, wherein the local upper and lower boundary distances of the text to be detected correspond to the at least two anchor points on the central line area respectively;

and splicing the local boundaries corresponding to the at least two anchor points to obtain a text frame of the text to be detected.

By adopting the method and the device, the fusion processing can be carried out according to the upper and lower boundary distances of the text to be detected so as to obtain the central line region of the text to be detected, the local boundaries corresponding to at least two anchor points can be obtained according to the local upper and lower boundary distances of the text to be detected corresponding to at least two anchor points on the central line region and the local angle of the text to be detected, the local boundaries corresponding to at least two anchor points are spliced, and finally the text frame of the text to be detected is obtained.

In a possible implementation manner, after obtaining the centerline region of the text to be detected, the method further includes:

and sampling the central line area to obtain at least two anchor points on the central line area.

By adopting the method and the device, at least two anchor points on the central line area can be obtained through sampling processing of the central line area, so that local boundaries corresponding to the at least two anchor points are spliced by taking the at least two anchor points on the central line area as references, and finally a text frame of the text to be detected is obtained.

In a possible implementation manner, the method further includes:

training a detection network for multitask training according to text training data to obtain the text detection network;

the text detection network is a trained detection network.

By adopting the method and the device, the detection network for multitask training can be trained according to the text training data to obtain the text detection network, the text detection network is the detection network after training, and the detection network (text detection network) after training is obtained through the text training data, so that the text detection is carried out through the text detection network, the detection precision can be improved, and the method and the device are simple and easy to implement. Moreover, the multi-task training is adopted for training, and compared with a text detection network obtained by single-task training, the method can improve the detection accuracy of the text detection by applying the text detection network and can also improve the processing speed of the detection.

In a possible implementation manner, the training a detection network for multitask training according to text training data to obtain the text detection network includes:

executing a global segmentation processing task on the text training data to obtain a first loss function;

executing a local regression processing task on the text training data to obtain a second loss function and a third loss function;

obtaining a fourth loss function according to the first loss function, the second loss function and the third loss function;

and adjusting network parameters according to the back propagation of the fourth loss function to obtain the text detection network.

By adopting the method and the device, in the process of training the text detection network, the total loss function can be obtained according to the upper and lower boundary distance regression and the direction regression in the processing task of global segmentation and the processing task of local regression, namely the back propagation of the fourth loss function (namely the total loss function) can be obtained according to the first loss function, the second loss function and the third loss function, and the network parameters are adjusted, so that the text detection network is obtained. By adopting the total loss function obtained in the multi-task training, the detection accuracy of the text detection by applying the text detection network can be improved.

According to an aspect of the present disclosure, there is provided a text detection apparatus, the apparatus including:

the device comprises an acquisition unit, a detection unit and a processing unit, wherein the acquisition unit is used for acquiring a text to be detected in an image, inputting the text to be detected into a text detection network for feature extraction, and obtaining first feature data;

and the text detection unit is used for carrying out global segmentation and local regression processing according to the text detection network and the first characteristic data to obtain a prediction result for text detection.

In a possible implementation manner, the apparatus further includes a reconstructing unit configured to:

and reconstructing a text border of the text to be detected according to the prediction result.

In a possible implementation manner, the text detection unit is configured to:

In a possible implementation manner, the apparatus further includes a boundary determining unit, configured to:

the apparatus further comprises a text segmentation unit configured to:

In a possible implementation manner, the reconstructing unit is configured to:

In a possible implementation manner, the apparatus further includes a sampling unit, configured to:

In a possible implementation manner, the apparatus further includes a training unit, configured to:

the text detection network is a trained detection network.

In a possible implementation manner, the training unit is configured to:

According to an aspect of the present disclosure, there is provided an electronic device including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to perform any of the methods described above.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of the above.

By adopting the embodiment of the disclosure, in response to an image detection operation, a text to be detected in the image can be obtained, a center line region of the text to be detected is obtained according to the text to be detected and a text detection network, and a text border of the text to be detected is determined by taking the center line region as a reference.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments, which proceeds with reference to the accompanying drawings.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features, and aspects of the disclosure and, together with the description, serve to explain the principles of the disclosure.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure.

Fig. 1 shows a flow diagram of a text detection method according to an embodiment of the present disclosure.

Fig. 2-4 show schematic diagrams of irregular text sequences in text to be detected according to an embodiment of the present disclosure.

Fig. 5 shows a schematic diagram of recognition of an irregular text sequence center line in a text to be detected according to an embodiment of the present disclosure.

Fig. 6 illustrates a schematic diagram of multitasking in a text detection method according to an embodiment of the present disclosure.

Fig. 7 shows a schematic diagram of post-processing in a text detection method according to an embodiment of the present disclosure.

Fig. 8 shows a block diagram of a text detection apparatus according to an embodiment of the present disclosure.

Fig. 9 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Fig. 10 shows a block diagram of an electronic device according to an embodiment of the disclosure.

Detailed Description

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the term "at least one" herein means any one of a plurality or any combination of at least two of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present disclosure. It will be understood by those skilled in the art that the present disclosure may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present disclosure.

Fig. 1 shows a flowchart of a text detection method according to an embodiment of the present disclosure, which is applied to a text detection device, for example, when the device is deployed in a terminal device or a server or other processing device, the device may perform recognition of a center line region of a text to be detected, text border detection of the text to be detected, and other processes. The terminal device may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, or the like. In some possible implementations, the processing method may be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 1, the process includes:

step S101, a text to be detected in an image is obtained, and the text to be detected is input into a text detection network for feature extraction, so that first feature data is obtained.

In an example, the text to be detected in the image may also be acquired in response to an image detection operation, and the text to be detected in the image may be recognized according to Optical Character Recognition (OCR) 1n, and a conventional OCR recognition framework may be adopted, or an OCR recognition framework based on deep learning may be adopted to recognize the text to be detected in the image. After the text to be detected is identified, the text to be detected can be input into a text detection network for feature extraction, and first feature data at least representing the upper and lower boundaries of the text to be detected is obtained.

Wherein, the conventional OCR recognition framework may include: a preprocessing stage and an identification stage. The preprocessing stage may correct the image, for example, the image is skewed, which is not beneficial to recognizing the text to be detected in the image, and the recognition stage may recognize the text sequence in the text to be detected to obtain a recognition result. And the deep learning-based OCR recognition framework is different from the traditional OCR recognition framework in that: the text sequence sample can be used as training data, and after the deep learning-based OCR recognition framework is obtained based on deep learning, the text sequence in the text to be detected is recognized to obtain a recognition result.

And S102, carrying out global segmentation and local regression processing according to the text detection network and the first characteristic data to obtain a prediction result for text detection.

In an example, the first feature data may be subjected to global segmentation processing in a text detection network to obtain a text confidence of the text to be detected, and the first feature data may be subjected to local regression processing in the text detection network to obtain a local upper and lower boundary distance of the text to be detected and a local angle of the text to be detected.

In one example, the prediction result may include: the text confidence of the text to be detected, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected can reconstruct the text border of the text to be detected according to the prediction result.

By adopting the method and the device, the text to be detected can be input into the text detection network for feature extraction to obtain the first feature data, global segmentation and local regression are carried out according to the text detection network and the first feature data to obtain a prediction result for text detection, text frame prediction of the text to be detected can be realized through the prediction result, and then the actual position of the text frame is finally determined according to the prediction result.

It should be noted that, the text detection network is a text detection network obtained after inputting text training data and performing training, and a prediction result for text detection obtained by performing global segmentation and local regression processing on the text detection network and the first feature data better meets the precision requirement of text detection, that is, after obtaining the first feature data according to a text to be detected, the prediction result is output according to the first feature data and the text detection network, and then the actual position of a text frame is finally determined according to the prediction result.

In an example, fig. 2 to 4 illustrate schematic diagrams of irregular text sequences in a text to be detected according to an embodiment of the present disclosure, as shown in fig. 2 to 4, in some practical scenarios, the text sequences in the text to be detected are not regular, for example, the text sequences have a curved ring shape or an up-and-down shape, and the like, and are identified by a first text border 101 and a second text border 102. For the irregular text sequence, the OCR recognition framework based on deep learning is adopted, the text sequence and the text border in the text to be detected can not be accurately recognized, so that in order to achieve the aim of accurate recognition, the recognition method is as simple and efficient as possible, after first characteristic data is obtained according to the text to be detected, the prediction result can be output according to the first characteristic data and a text detection network, the text border of the text to be detected can be reconstructed according to the prediction result, text detection can be achieved by adopting an input end-output end simple and efficient detection mode, and the detection accuracy and the detection processing speed are improved.

In one example, after the processing of global segmentation on the first feature data in the text detection network to obtain a word confidence level, the method further includes: and obtaining a segmentation mask according to the character confidence of the text to be detected, and performing mask processing on the text to be detected according to the segmentation mask to obtain the distance between the upper boundary and the lower boundary of the text to be detected. For example, after the confidence is obtained through global segmentation, the text-non-text can be classified and clustered according to the confidence to obtain a segmentation mask, and the text to be detected can be masked according to the segmentation mask, so that the distance between the upper boundary and the lower boundary of the text to be detected (i.e. the distance between the upper boundary and the lower boundary of the whole text frame of the text to be detected) can be obtained after masking.

In one example, reconstructing a text border of the text to be detected according to the prediction result includes: and reconstructing a text border of the text to be detected according to the distance between the upper boundary and the lower boundary of the text to be detected, the local distance between the upper boundary and the lower boundary of the text to be detected and the local angle of the text to be detected. For example, a center line region of the text to be detected may be obtained by performing fusion processing according to the distance between the upper and lower boundaries of the text to be detected, and local boundaries corresponding to at least two anchor points on the center line region may be obtained according to the local upper and lower boundary distances of the text to be detected and the local angles of the text to be detected, that is, for the anchor points, the corresponding local boundaries may be obtained by the local angles and the local upper and lower boundaries. And splicing the local boundaries corresponding to the at least two anchor points to obtain a text frame of the text to be detected.

In one example, the centerline region is sampled, and at least two anchor points on the centerline region can be obtained. In terms of the anchor points, the anchor points are representative selected points (for example, selected points at a certain peak position or a peak valley position of the irregular shape) serving as reference bases in the central line region, so that local boundaries corresponding to the at least two anchor points respectively are obtained according to the at least two anchor points.

In an example, the text to be detected may further include at least one line of text, in the feature extraction process in step S101, second feature data representing any line of text in the text to be detected may be further obtained, after the text to be detected is segmented from the text to be detected according to the second feature data and the text detection network, the text detection processing is performed on any line of text, for example, global segmentation and local regression may be performed according to the text detection network and the first feature data, so as to obtain a prediction result for text detection, text frame prediction of the text to be detected may be realized according to the prediction result, and then the actual position of the text frame may be finally determined according to the prediction result.

Fig. 5 shows a schematic diagram of center line recognition of an irregular text sequence in a text to be detected according to an embodiment of the present disclosure, as shown in fig. 5, a text sequence "ABCDE" is in an up-and-down shape and is identified by a first text border 101 and a second text border 102. Because the deep learning-based OCR recognition framework is adopted, the text sequence ABCDE and the text border thereof cannot be accurately recognized, so that the text detection network disclosed by the present disclosure can be adopted to perform feature extraction for the distance between the upper and lower boundaries, and perform global segmentation and local regression processing according to the text detection network and the first feature data to obtain a prediction result for text detection, and further reconstruct the text border of the text to be detected according to the prediction result, so as to accurately recognize the text border corresponding to the text sequence ABCDE, which is identified by the first text border 101 and the second text border 102. After the feature extraction, first feature data representing the upper and lower boundaries of the text sequence "ABCDE" can be obtained, and global segmentation and local regression are performed on the first feature data to obtain a prediction result, where the prediction result at least includes: the text confidence of the text to be detected, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected. Obtaining a segmentation mask according to the character confidence of the text to be detected, performing mask processing on the text to be detected according to the segmentation mask to obtain the distance between the upper boundary and the lower boundary of the text to be detected, performing fusion processing according to the upper and lower boundary distances of the text to be detected to obtain a center line region of the text to be detected (the center line region is identified by an upper frame 11 of the center line region and a lower frame 12 of the center line region), acquiring anchor points 13 serving as reference bases in the center line region (the anchor point data 13 may be one or more, fig. 5 only shows one example of the anchor point data, and the number or position of the anchor points is not limited to the example of the present disclosure in practical application), and obtaining local boundaries corresponding to the at least two anchor points respectively according to the local upper and lower boundary distances of the text to be detected corresponding to the at least two anchor points on the central line area and the local angle of the text to be detected. And splicing local boundaries corresponding to the at least two anchor points respectively to obtain a text border of the text to be detected. The present disclosure can also obtain the center line 14 of the center line area according to the anchor point and the text detection network, so that the position of the text border of the text sequence "ABCDE" is finally determined by using the center line 14.

In a possible implementation manner, the method further includes: training a detection network for multitask training according to text training data to obtain the text detection network; the text detection network is a trained detection network. Through the multi-task training in the disclosure, for example, a global segmentation task for roughly segmenting a text sequence and a local regression task (two aspects of regression of a boundary and an angle) for determining a text frame, since the two training branches can share one set of network parameters instead of adopting independent parameters of the respective training branches (namely, adopting two sets of network parameters), not only can the processing efficiency be improved, but also the training complexity can be reduced.

In a possible implementation manner, the training a detection network for multitask training according to text training data to obtain the text detection network includes: executing a global segmentation processing task on the text training data to obtain a first loss function (if a cross entropy loss function can be adopted); and performing a local regression processing task on the text training data to obtain a second loss function (such as IoU loss function in a dimensionality reduction form) and a third loss function (such as a symmetric trigonometric function). A fourth loss function (i.e., a total loss function obtained according to the first loss function, the second loss function, and the third loss function) may be obtained according to the first loss function, the second loss function, and the third loss function, and the text detection network may be obtained after adjusting the network parameters according to back propagation of the fourth loss function. Through the learning of multi-task training, the text detection network can continuously update the network parameters of the text detection network, so that the required text detection precision is achieved.

In a possible implementation manner, in the performing of the multitask training, as for a global segmentation task, in the training branch, a plurality of processing paths may be adopted to perform training of the global segmentation task on the text training data; in the training branch, the local regression task may be trained on the text training data by using a plurality of processing paths. By adopting the method and the device, the global segmentation task is realized by utilizing multiple paths, and compared with a single path, a more accurate text detection result can be obtained.

Application example:

for text detection, with the development of deep learning and computer vision technology, for example, convolutional neural networks can be adopted as detection tools to realize text detection. Application scenarios applicable to text detection include: real-time text translation, document/invoice/menu recognition, street signs, shop names, text recognition on product outer packages, license plate recognition, and the like. In text detection, the difficulty of recognizing irregular characters is high, for example, the characters are in a curved ring shape or an up-and-down shape. Due to the particularity of the shape of the irregular text, a convolutional neural network obtained by complex modeling is needed for text detection, and a plurality of network parameters for detection are needed or a plurality of stages are cascaded to finally obtain the position of a text frame of the text to be detected. In contrast, a simpler and more efficient detection scheme is needed for the convolutional neural network obtained through the complex modeling, that is, the purpose of more accurately detecting the position of the text border is achieved through the simpler neural network.

By adopting the disclosure, the neural network can be the text detection network obtained after training, the training process of the text detection network is based on multi-task learning, and the multi-task learning comprises the following steps: the three tasks are global segmentation processing, regression processing of local boundary detection and regression processing of local direction detection. Wherein, the global segmentation process may use a classification loss function (such as a cross entropy loss function, which is adapted to the classification) to realize the rough segmentation of the text; the regression processing of the local boundary detection can use a normalized loss function (such as IoU loss function, IoU loss function is insensitive to scale and can effectively regress text boundaries under different scales) to realize the detection of the distance between the upper boundary and the lower boundary; the regression processing of the local direction detection may use a symmetric loss function (e.g., a symmetric trigonometric function, which may make the network converge better under the condition that the real value and the predicted value of the processed text in the horizontal direction are likely to have a large deviation) to obtain the angle information. In an example, in the training process of the text detection network, the cross entropy loss function, the IoU loss function and the symmetric trigonometric function are added, and the weights are respectively 1, 1 and 10, so that a total loss function can be obtained, the text detection network is trained according to the total loss function, and the network parameters of the text detection network are continuously updated until convergence, so that the determined position of the text border is directly obtained based on the text detection network.

Among them, for the IoU loss function, a IoU loss function in a dimensionality reduction form can be adopted, that is: the original IoU loss function is subjected to dimensionality reduction processing to obtain a IoU loss function in a dimensionality reduction form, and variables (such as the distance between the upper boundary point and the lower boundary point) corresponding to the upper boundary point and the lower boundary point are constrained according to the IoU loss function in the dimensionality reduction form. It should be noted that: in the regression process, the original IoU loss function is generally used to optimize the top, bottom, left and right 4 distances of each position, i.e. optimize the 2D rectangular box, while in the regression process of local boundary detection in the present disclosure, there are only top and bottom boundary distances, so that the top and bottom 2 distances of each position can be optimized through the above-mentioned dimension reduction process, thereby reducing the computation amount of network training.

Fig. 6 is a schematic diagram illustrating multitasking in a text detection method according to an embodiment of the present disclosure, and as shown in fig. 6, the method includes: the two processing branches can share a skeleton network used as a feature extractor to extract semantic features, namely, a set of network parameters sharing the skeleton network to extract features, so as to reduce the training complexity of the text detection network. Global segmentation processing and local regression can be realized by adopting multiple paths, and a better network training result of the text detection network can be obtained compared with a single path. Moreover, by adopting the multi-path and shared skeleton network, the expression capability of the text detection network can be improved under the condition of keeping the total amount of network parameters not increased, and meanwhile, the features on different scales can be more effectively utilized.

As shown in fig. 6, the character confidence may be obtained through global segmentation processing, so as to obtain a segmentation mask after classifying and clustering "text-non-text" according to the character confidence, the text to be detected may be masked according to the segmentation mask, and the upper boundary distance (regression from the highlight line in fig. 6 upwards), the lower boundary distance (regression from the highlight line in fig. 6 downwards) and the local angle may be obtained after local regression of the region obtained through the masking processing. Wherein, the processing branch of the local regression comprises: the regression processing of the local boundary detection and the regression processing of the local direction detection can obtain the upper boundary distance and the lower boundary distance through the regression processing of the local boundary detection; the local angle may be obtained by a regression process of local direction detection.

Fig. 7 is a schematic diagram illustrating post-processing in a text detection method according to an embodiment of the present disclosure, where the post-processing is a process of applying the trained text detection network, as shown in fig. 7, the following contents are included:

inputting an image containing a text to be detected into the text detection network, and performing the global segmentation processing, the masking processing and the local regression to obtain an upper boundary distance and a lower boundary distance. For any position, if the upper boundary distance and the lower boundary distance are relatively close, which indicates that the position is likely to be in the centerline region, the fusion processing may be performed according to the upper boundary distance and the lower boundary distance to obtain the confidence of the centerline region. Furthermore, the fusion processing is carried out on all the positions, so that the confidence degrees of all the centerline regions can be obtained, and possible centerline regions can be obtained according to the confidence degrees of all the centerline regions.

The fusion process may be performed by using the following formula (1), where in formula (1), Z is a confidence of the center line region, x is an upper boundary distance, and y is a lower boundary distance.

Z＝2*min(x，y)/(x+y)(1)

Secondly, performing adaptive uniform sampling in the centerline region, that is, starting from one end of the centerline region, performing uniform sampling at intervals according to a preset interval until reaching the other end point of the centerline region, and obtaining an anchor point serving as a reference datum in the centerline region, as shown in fig. 7, wherein a plurality of text sequences respectively correspond to a plurality of centerline regions; and the central line areas are respectively and uniformly sampled to respectively obtain anchor points forming the central line areas.

For example, the lower boundary distance and the upper boundary distance are obtained through regression processing of local boundary detection, the local angle can be obtained through regression processing of local direction detection, the local boundary is obtained according to the lower boundary distance, the upper boundary distance and the local angle, and finally, the position of a text frame of any line text sequence in the text to be detected can be obtained according to splicing processing of the local boundary, as shown in fig. 7, multiple lines of text sequences respectively correspond to multiple text frame positions.

By adopting the method and the system, the text detection network trained by global segmentation processing and local regression can be used for respectively carrying out text detection on different shapes of characters based on any learned shape of the characters, and particularly, under the condition that a large number of dense text sequences exist in the text to be detected, any line of text sequence can be detected in a dense paragraph through the global segmentation processing, so that the problem of adhesion of the dense characters during segmentation is avoided. And the local character boundaries are reconstructed by back and forth merging through local regression, so that network parameters needing to be learned by the network can be reduced without complex post-processing. Through the multi-path global segmentation and local regression, the network parameters of the skeleton network are shared, the network training amount is simplified, the convergence is faster, and the character shape characteristics in a complex scene (irregular characters) can be better learned, so that the detection accuracy can be improved and the detection processing speed can be improved through an input end-output end-to-end simple and efficient detection mode.

It will be understood by those skilled in the art that in the method of the present invention, the order of writing the steps does not imply a strict order of execution and any limitations on the implementation, and the specific order of execution of the steps should be determined by their function and possible inherent logic.

The above-mentioned method embodiments can be combined with each other to form a combined embodiment without departing from the principle logic, which is limited by the space and will not be repeated in this disclosure.

In addition, the present disclosure also provides a text detection apparatus, an electronic device, a computer-readable storage medium, and a program, which can be used to implement any one of the text detection methods provided by the present disclosure, and the corresponding technical solutions and descriptions and corresponding descriptions in the methods section are not repeated.

Fig. 8 shows a block diagram of a text detection apparatus according to an embodiment of the present disclosure, as shown in fig. 8, including: the acquiring unit 31 is configured to acquire a text to be detected in an image, input the text to be detected into a text detection network, and perform feature extraction to obtain first feature data; and the text detection unit 32 is configured to perform global segmentation and local regression processing according to the text detection network and the first feature data to obtain a prediction result for text detection.

In a possible implementation manner, the apparatus further includes a reconstructing unit configured to: and reconstructing a text border of the text to be detected according to the prediction result.

In a possible implementation manner, the text detection unit is configured to: performing global segmentation processing on the first characteristic data in the text detection network to obtain a character confidence coefficient of a text to be detected; performing local regression processing on the first characteristic data in the text detection network to obtain a local upper and lower boundary distance of the text to be detected and a local angle of the text to be detected; the prediction result comprises: the text confidence of the text to be detected, the local upper and lower boundary distance of the text to be detected and the local angle of the text to be detected.

In a possible implementation manner, the apparatus further includes a boundary determining unit, configured to: obtaining a segmentation mask according to the character confidence of the text to be detected; and masking the text to be detected according to the segmentation mask to obtain the distance between the upper boundary and the lower boundary of the text to be detected.

In a possible implementation manner, the text to be detected includes at least one line of text; the apparatus further comprises a text segmentation unit configured to: responding to the feature extraction to obtain second feature data, wherein the second feature data is used for representing any line of text in the text to be detected; and segmenting the text to be detected into any line of text according to the second characteristic data and the text detection network.

In a possible implementation manner, the reconstructing unit is configured to: and reconstructing a text border of the text to be detected according to the distance between the upper boundary and the lower boundary of the text to be detected, the local distance between the upper boundary and the lower boundary of the text to be detected and the local angle of the text to be detected.

In a possible implementation manner, the reconstructing unit is configured to: performing fusion processing according to the distance between the upper boundary and the lower boundary of the text to be detected to obtain a center line region of the text to be detected; obtaining local boundaries corresponding to the at least two anchor points respectively according to the local upper and lower boundary distances of the text to be detected and the local angles of the text to be detected, wherein the local upper and lower boundary distances of the text to be detected correspond to the at least two anchor points on the central line area respectively; and splicing the local boundaries corresponding to the at least two anchor points to obtain a text frame of the text to be detected.

In a possible implementation manner, the apparatus further includes a sampling unit, configured to: and sampling the central line area to obtain at least two anchor points on the central line area.

In a possible implementation manner, the apparatus further includes a training unit, configured to: training a detection network for multitask training according to text training data to obtain the text detection network; the text detection network is a trained detection network.

In a possible implementation manner, the training unit is configured to: executing a global segmentation processing task on the text training data to obtain a first loss function; executing a local regression processing task on the text training data to obtain a second loss function and a third loss function; obtaining a fourth loss function according to the first loss function, the second loss function and the third loss function; and adjusting network parameters according to the back propagation of the fourth loss function to obtain the text detection network.

In some embodiments, functions of or modules included in the apparatus provided in the embodiments of the present disclosure may be used to execute the method described in the above method embodiments, and specific implementation thereof may refer to the description of the above method embodiments, and for brevity, will not be described again here.

Embodiments of the present disclosure also provide a computer-readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the above-mentioned method. The computer readable storage medium may be a volatile computer readable storage medium or a non-volatile computer readable storage medium.

The disclosed embodiments also provide a computer program product comprising computer readable code, which when run on a device, a processor in the device executes instructions for implementing the detection method provided in any of the above embodiments.

The embodiments of the present disclosure also provide another computer program product for storing computer readable instructions, which when executed cause a computer to perform the operations of the detection method provided in any of the above embodiments.

The computer program product may be embodied in hardware, software or a combination thereof. In an alternative embodiment, the computer program product is embodied in a computer storage medium, and in another alternative embodiment, the computer program product is embodied in a Software product, such as a Software Development Kit (SDK), or the like.

An embodiment of the present disclosure further provides an electronic device, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured as the above method.

The electronic device may be provided as a terminal, server, or other form of device.

Fig. 9 is a block diagram illustrating an electronic device 800 in accordance with an example embodiment. For example, the electronic device 800 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, or the like terminal.

Referring to fig. 9, electronic device 800 may include one or more of the following components: processing component 802, memory 804, power component 806, multimedia component 808, audio component 810, input/output (I/O) interface 812, sensor component 814, and communication component 816.

The processing component 802 generally controls overall operation of the electronic device 800, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 802 may include one or more processors 820 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 802 can include one or more modules that facilitate interaction between the processing component 802 and other components. For example, the processing component 802 can include a multimedia module to facilitate interaction between the multimedia component 808 and the processing component 802.

The memory 804 is configured to store various types of data to support operations at the electronic device 800. Examples of such data include instructions for any application or method operating on the electronic device 800, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 804 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 806 provides power to the various components of the electronic device 800. The power components 806 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the electronic device 800.

The multimedia component 808 includes a screen that provides an output interface between the electronic device 800 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 808 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the electronic device 800 is in an operation mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 810 is configured to output and/or input audio signals. For example, the audio component 810 includes a Microphone (MIC) configured to receive external audio signals when the electronic device 800 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 804 or transmitted via the communication component 816. In some embodiments, audio component 810 also includes a speaker for outputting audio signals.

The I/O interface 812 provides an interface between the processing component 802 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 814 includes one or more sensors for providing various aspects of state assessment for the electronic device 800. For example, the sensor assembly 814 may detect an open/closed state of the electronic device 800, the relative positioning of components, such as a display and keypad of the electronic device 800, the sensor assembly 814 may also detect a change in the position of the electronic device 800 or a component of the electronic device 800, the presence or absence of user contact with the electronic device 800, orientation or acceleration/deceleration of the electronic device 800, and a change in the temperature of the electronic device 800. Sensor assembly 814 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 814 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 814 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 816 is configured to facilitate wired or wireless communication between the electronic device 800 and other devices. The electronic device 800 may access a wireless network based on a communication standard, such as WiFi, 2G or 3G, or a combination thereof. In an exemplary embodiment, the communication component 816 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 816 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the electronic device 800 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a non-transitory computer-readable storage medium, such as the memory 804, is also provided that includes computer program instructions executable by the processor 820 of the electronic device 800 to perform the above-described methods.

Fig. 10 is a block diagram illustrating an electronic device 900 in accordance with an example embodiment. For example, the electronic device 900 may be provided as a server. Referring to fig. 10, electronic device 900 includes a processing component 922, which further includes one or more processors, and memory resources, represented by memory 932, for storing instructions, such as applications, that are executable by processing component 922. The application programs stored in memory 932 may include one or more modules that each correspond to a set of instructions. Further, the processing component 922 is configured to execute instructions to perform the above-described methods.

The electronic device 900 may also include a power component 926 configured to perform power management of the electronic device 900, a wired or wireless network interface 950 configured to connect the electronic device 900 to a network, and an input/output (I/O) interface 958. The electronic device 900 may operate based on an operating system stored in memory 932, such as WindowsServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

In an exemplary embodiment, a non-transitory computer readable storage medium, such as the memory 932, is also provided that includes computer program instructions executable by the processing component 922 of the electronic device 900 to perform the above-described method.

The present disclosure may be systems, methods, and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied thereon for causing a processor to implement various aspects of the present disclosure.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present disclosure may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry that can execute the computer-readable program instructions implements aspects of the present disclosure by utilizing the state information of the computer-readable program instructions to personalize the electronic circuitry, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA).

Various aspects of the present disclosure are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the disclosure. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Different embodiments of the present application may be combined with each other without departing from the logic, and the descriptions of the different embodiments are focused on, and for the parts focused on the descriptions of the different embodiments, reference may be made to the descriptions of the other embodiments.

Having described embodiments of the present disclosure, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A text detection method, the method comprising:

2. The method of claim 1, wherein after obtaining the prediction result for text detection, the method further comprises:

3. The method of claim 2, wherein the processing of global segmentation and local regression according to the text detection network and the first feature data to obtain the prediction result for text detection comprises:

4. The method according to claim 3, wherein after the processing of global segmentation on the first feature data in the text detection network to obtain a text confidence level, the method further comprises:

5. The method according to any one of claims 1-4, wherein the text to be detected comprises at least one line of text;

before the obtaining of the first feature data, the method further includes:

6. The method according to claim 4, wherein reconstructing the text border of the text to be detected according to the prediction result comprises:

7. The method according to claim 6, wherein reconstructing the text border of the text to be detected according to the distance between the upper and lower boundaries of the text to be detected, the local distance between the upper and lower boundaries of the text to be detected, and the local angle of the text to be detected comprises:

8. The method according to claim 7, wherein after obtaining the centerline region of the text to be detected, the method further comprises:

9. The method according to any one of claims 1-8, further comprising:

the text detection network is a trained detection network.

10. The method of claim 9, wherein training a detection network for multitasking training according to text training data to obtain the text detection network comprises:

11. A text detection apparatus, characterized in that the apparatus comprises:

12. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: performing the method of any one of claim 1 to claim 10.

13. A computer readable storage medium having computer program instructions stored thereon, which when executed by a processor implement the method of any one of claims 1 to 10.