CN111695377A

CN111695377A - Text detection method and device and computer equipment

Info

Publication number: CN111695377A
Application number: CN201910188639.XA
Authority: CN
Inventors: 王杰; 李明键; 钮毅
Original assignee: Hangzhou Hikvision Digital Technology Co Ltd
Current assignee: Hangzhou Hikvision Digital Technology Co Ltd
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2020-09-22
Anticipated expiration: 2039-03-13
Also published as: CN111695377B

Abstract

The application provides a text detection method and device and computer equipment. The text detection method provided by the application comprises the following steps: acquiring appointed information from an image to be detected containing a text, inputting the appointed information into a pre-established target neural network for constructing a spatial relationship between the text in the image and a concerned target, outputting spatial information by the target neural network, and correcting the candidate text region according to the spatial information to obtain a final selected text region in the image to be detected. The specific information comprises a feature vector of a candidate text region positioned from the image to be detected, and the attention target comprises at least one of a text in the image, a specific target which has a spatial relationship with the text in the image and attribute information of the image. The text detection method, the text detection device and the computer equipment can accurately position the text area in the image to be detected.

Description

Text detection method and device and computer equipment

Technical Field

The present application relates to the field of image detection, and in particular, to a text detection method and apparatus, and a computer device.

Background

With the widespread use of image acquisition devices, image detection techniques based on image content are receiving more and more attention. Among contents included in an image, text information is more easily understood, and thus a character recognition technology has received a great deal of attention.

The text recognition technology mainly comprises text detection and character recognition. The text detection means positioning a text region from an image to be detected; the character recognition means recognizing a text region and outputting text information. According to the text detection method disclosed by the related technology, a large number of anchor points are established, anchor points close to the text are screened out through a related algorithm, and the offset between the anchor points and the text is regressed, so that a text region is obtained. The method only carries out text detection through the fixed receptive field, and is low in accuracy.

Disclosure of Invention

In view of this, the present application provides a text detection method, a text detection device, and a computer device, so as to provide a text detection method with higher accuracy.

A first aspect of the present application provides a text detection method, including:

acquiring specified information from an image to be detected containing a text; the specified information comprises a characteristic vector of a candidate text region positioned from the image to be detected;

inputting the specified information into a pre-established target neural network for constructing a spatial relationship between a text in the image and an attention target, and outputting spatial information by the target neural network; wherein the attention target includes at least one of a text in the image, a specified target in the image having a spatial relationship with the text, and attribute information of the image;

and correcting the candidate text region according to the spatial information to obtain a final selection text region in the image to be detected.

A second aspect of the present application provides a text detection apparatus comprising an element generation module, a spatial relationship modeling module, and a text detection module, wherein,

the element generation module is used for acquiring specified information from the image to be detected containing the text; the specified information comprises a characteristic vector of a candidate text region positioned from the image to be detected;

the spatial relationship establishing module is used for inputting the specified information into a pre-established target neural network for establishing a spatial relationship between a text in the image and an attention target, and the target neural network outputs spatial information; wherein the attention target includes at least one of a text in the image, a specified target in the image having a spatial relationship with the text, and attribute information of the image;

and the text detection module is used for correcting the candidate text region according to the spatial information to obtain a final selection text region in the image to be detected.

A third aspect of the present application provides a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods provided by the first aspect of the present application.

A fourth aspect of the present application provides a computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the methods provided in the first aspect of the present application when executing the program.

The application provides a text detection method, a text detection device and a computer device, through obtaining the appointed information from the image to be detected containing the text, will the appointed information is input to the target neural network that is used for constructing the spatial relationship between the text in the image and the concerned target that is established in advance, by the target neural network outputs spatial information, the basis spatial information is right candidate text region is rectified, obtains the final selection text region in the image to be detected. The specific information comprises a feature vector of a candidate text region positioned from the image to be detected, and the attention target comprises at least one of a text in the image, a specific target which has a spatial relationship with the text in the image and attribute information of the image. Therefore, the spatial relationship between the text and the target is fully considered, the spatial information is fully utilized to position the final selected text area, missing detection and false detection can be avoided, and the accuracy of text detection can be improved.

Drawings

Fig. 1 is a flowchart of a first embodiment of a text detection method provided in the present application;

fig. 2 is a flowchart of a second embodiment of a text detection method provided in the present application;

fig. 3 is a flowchart of a third embodiment of a text detection method provided in the present application;

fig. 4 is a flowchart of a fourth embodiment of a text detection method provided in the present application;

fig. 5 is a hardware structure diagram of a computer device in which a text detection apparatus according to an exemplary embodiment of the present application is located;

fig. 6 is a schematic structural diagram of a first embodiment of a text detection apparatus provided in the present application.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with the present application. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present application, as detailed in the appended claims.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this application and the appended claims, the singular forms "a", "an", and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present application. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

The application provides a text detection method, a text detection device and computer equipment, and aims to provide a text detection method with high accuracy.

In the following, specific examples are given to describe the technical solutions of the present application in detail. The following several specific embodiments may be combined with each other, and details of the same or similar concepts or processes may not be repeated in some embodiments.

Fig. 1 is a flowchart of a first embodiment of a text detection method provided in the present application. Referring to fig. 1, the method provided in this embodiment may include:

s101, acquiring specified information from an image to be detected containing a text, wherein the specified information comprises a feature vector of a candidate text region positioned from the image to be detected.

Specifically, the specific implementation process of this step may include:

(1) extracting the features of the image to be detected to obtain a feature map;

(2) and acquiring the specified information from the characteristic diagram.

Specifically, the feature extraction may be performed on the image to be detected by using a conventional method. For example, a Scale-invariant Feature Transform (SIFT) algorithm is used to extract features of an image to be detected. Of course, a neural network may also be used to perform feature extraction on the image to be detected, for example, in an embodiment, a specific implementation process of this step may include:

inputting the image to be detected into a neural network for feature extraction, and extracting the features of the image to be detected by a specified layer in the neural network; the designated layer comprises a convolutional layer, or the designated layer comprises a convolutional layer and at least one of a pooling layer and a fully-connected layer; and determining the output result of the specified layer as the feature map.

Specifically, the neural network for feature extraction may include a convolutional layer, and the convolutional layer is configured to perform filtering processing on an input image to be detected. Further, at this time, the filtering result output by the convolutional layer is the feature map extracted. In addition, the neural network for feature extraction may further include a pooling layer and/or a fully connected layer. For example, in an embodiment, the neural network for feature extraction includes a convolutional layer, a pooling layer, and a full-link layer, where the convolutional layer is configured to perform filtering processing on an input image to be detected; the pooling layer is used for compressing the filtering result; and the full connection layer is used for carrying out aggregation processing on the compression processing result. Further, at this time, the aggregation processing result output by the full connection layer is the extracted feature map.

Further, in an embodiment, the specifying information includes a feature vector of a candidate text located from the image to be detected; in another embodiment, the specifying information includes a feature vector of a candidate text located from the image to be detected, and at least one of a feature vector of a specified target located from the image to be detected and having a spatial relationship with the candidate text region and attribute information of the image to be detected. The following description will take as an example that the specifying information includes a feature vector of a candidate text located from an image to be detected, a feature vector of a specified target located from an image to be detected and having a spatial relationship with the candidate text region, and attribute information of the image to be detected.

The designated target may be a hidden variable or a custom target related to the task. For example, if the vehicle where the license plate is located has no calibration of the vehicle, a hidden variable can be set; if the vehicle is calibrated, the vehicle can be regarded as a user-defined specified target and can be used for assisting the detection of the license plate.

In addition, the attribute information of the image to be detected may include a deformation attribute, a color attribute, a font attribute, a texture attribute, a perspective attribute, and the like of the image to be detected. In the present embodiment, this is not limited. The following description will take the example that the attribute information of the image to be detected includes the deformation attribute of the image to be detected. It should be noted that, when the attribute information of the image to be detected is the deformation attribute of the image to be detected, the deformation attribute may be represented by the rotation angle θ of each pixel point in the image to be detected.

Specifically, in an embodiment, the process of obtaining the feature vector of the candidate text region from the feature map may include:

(1) inputting the feature map into a neural network for information extraction, carrying out convolution processing on the feature map by a first convolution layer in the neural network to obtain a convolution processing result, processing the convolution processing result by a softmax layer of the neural network, and outputting the probability that each pixel point in the image to be detected belongs to a text; and carrying out convolution processing on the characteristic graph by a second convolution layer in the neural network, and outputting the deviation of each pixel point distance text in the image to be detected.

For example, in one embodiment, the size of the image to be detected is 9 × 9, the dimension of the obtained feature map is 9 × 256, and the dimension of the first convolution layer of the neural network used for information extraction is 1 × 2, so that the dimension of the convolution processing result output by the first convolution layer is 9 × 2. Further, after the convolution processing result is processed by the softmax layer, the dimensionality of the output processing result is 9 × 1, and the probabilities that all pixel points in the image to be detected belong to the text are represented respectively. For another example, in this embodiment, the dimension of the second convolution layer in the neural network for information extraction is 1 × 8, so that after the second convolution layer performs convolution processing on the feature map, the dimension of the output convolution processing result is 9 × 8, and represents the deviation of each pixel point in the image to be detected from the text. In this example, the deviation of a pixel from the text is characterized by the deviation of the pixel from the four corners of the text (corresponding to 8 channels of the convolution processing result).

(2) And positioning a candidate text region in the image to be detected according to the probability of the text of each pixel point in the image to be detected and the deviation of each pixel point in the image to be detected from the text to obtain the characteristic vector of the candidate text region.

Specifically, based on the probability that each pixel point in the image to be detected belongs to the text and the deviation of each pixel point in the image to be detected from the text, the candidate text region in the image to be detected can be positioned, and then the feature vector of the candidate text region is obtained. Further, the feature vector of the candidate text region includes a coordinate of a center point of the candidate text region, a width value, a height value, an angle, a feature vector corresponding to the candidate text region in the feature map, and a confidence of the candidate text region (where the confidence of the candidate text region may be equal to an average value of probabilities that target pixel points located in the candidate text region in the image to be detected belong to the text). It should be noted that, for the specific implementation principle and implementation process related to locating the candidate text region in the image to be detected according to the probability that each pixel point in the image to be detected belongs to the text and the deviation of each pixel point in the image to be detected from the text, to obtain the feature vector of the candidate text region, reference may be made to the description in the related art, and details are not repeated here.

In addition, in another embodiment, the process of obtaining a feature vector of a specific target having a spatial relationship with the candidate text region from the feature map may include:

(1) inputting the feature map into a neural network for information extraction, performing convolution processing on the feature map by a first convolution layer in the neural network to obtain a convolution processing result, processing the convolution processing result by a softmax layer of the neural network, and outputting the probability that each pixel point in the image to be detected belongs to the specified target; and carrying out convolution processing on the characteristic graph by a second convolution layer in the neural network, and outputting the deviation of each pixel point in the image to be detected from the specified target.

(2) And positioning the appointed target in the image to be detected according to the probability that each pixel point in the image to be detected belongs to the appointed target and the deviation of each pixel point in the image to be detected from the appointed target to obtain the characteristic vector of the appointed target.

Specifically, the specific implementation process and implementation principle of obtaining the feature vector of the designated target are similar to the specific implementation process and implementation principle of obtaining the feature vector of the candidate text region, and are not described herein again.

Further, in another embodiment, when the attribute information of the image to be detected is deformation information of the image to be detected, and the deformation information is represented by a rotation angle θ of each pixel point in the image to be detected, the process of extracting the attribute information of the image to be detected from the feature map may include:

(1) inputting the characteristic diagram into a neural network for information extraction, carrying out convolution processing on the characteristic diagram by a convolution layer in the neural network to obtain a convolution processing result, carrying out normalization processing on the convolution processing result by a softmax layer in the neural network to obtain a normalization processing result, and converting the normalization processing result into a rotation angle theta of each pixel point in the image to be detected by a bias layer of the neural network.

Specifically, for example, the size of the image to be detected is 9 × 9, the dimension of the obtained feature map is 9 × 256, and the dimension of the convolution layer of the neural network used for information extraction is 1 × 2, so that after the convolution processing, the dimension of the obtained convolution processing result is 9 × 2. In addition, the softmax layer normalizes the convolution processing result to obtain a normalized processing result with a dimension of 9 × 1. Further, after the bias layer converts the normalization processing result, the dimension of the obtained conversion result is 9 × 1, and the rotation angle θ of each pixel point in the image to be detected is respectively represented.

It should be noted that the process of acquiring other attribute information is similar to the above process, and is not described herein again.

S102, inputting the specified information into a pre-established target neural network for constructing a spatial relationship between a text in an image and a target of interest, and outputting spatial information by the target neural network; wherein the attention target includes at least one of a text in the image, a designated target in the image having a spatial relationship with the text, and attribute information of the image.

S103, correcting the candidate text area according to the spatial information to obtain a final selected text area in the image to be detected.

Specifically, the spatial information can be obtained by using the target neural network using the above-mentioned specifying information.

For example, in an embodiment, when the specifying information includes a feature vector of the candidate text region, the specific implementation procedure of step S102 may include:

(1) and inputting the specified information into a first neural network in the target neural network, and processing the specified information by the first neural network to obtain the confidence of the candidate text region and a position probability map of the suspected text region in the image to be detected.

Specifically, the first neural network is used for constructing a spatial relationship between a text and the text, the input of the neural network is a feature vector of a candidate text region, and the output is a confidence coefficient of the candidate text region and a position probability map of a suspected text region in an image to be detected. For example, in one embodiment, the feature vector of the candidate text region 1 and the feature vector of the candidate text region 2 are input, and the position probability map of the suspected text region 3 is output. It should be noted that the position probability map of the suspected text region 3 refers to the position probability distribution of the next possible text under the condition that all the current candidate text regions exist, and includes the probability that each pixel in the image to be detected is the suspected text region and the deviation of each pixel from the suspected text region.

Specifically, after step S101, a feature vector of at least one candidate text region may be extracted. For example, in one embodiment, the feature vectors of the candidate text regions 1 and the feature vectors of the candidate text regions 2 are extracted. In this step, the feature vector of the candidate text region 1 and the feature vector of the candidate text region 2 are input to the first neural network, and then the concat layer of the first neural network performs fusion processing on the feature vectors to obtain fused specified information, and the full-link layer of the first neural network performs weighting processing on the fused specified information to obtain the confidence of the candidate text region and the position probability map of the suspected text region in the image to be detected.

It should be noted that, for the specific implementation principle and implementation procedure of the fusion process and the weighting process, reference may be made to the description in the related art, and details are not described here.

With reference to the foregoing example, for example, in an embodiment, the size of the image to be detected is 9 × 9, the dimension of the feature vector of the candidate text region 1 is n, the dimension of the feature vector of the candidate text region 2 is n, after the fusion processing, the dimension of the fused specified information is 2n, and the dimension of the full connection coefficient (the network parameter learned in advance) of the full connection layer of the first neural network is 2n (1+1+9 × 9), at this time, after the weighting processing, the dimension of the weighting processing result is 1+1+9 × 9, where the first two dimensions represent the confidence degrees of the candidate text region 1 and the candidate text region 2, the 3 rd to 11 th dimensions represent the probability that each pixel point in the image to be detected is the text suspected region, and the last 8 × 9 dimensions represent the deviation of each pixel point in the image to be detected from the text suspected region (the deviation of each pixel point from the text suspected region is the distance from the text region to be detected) The deviation of four corner points of the region is characterized, that is, the deviation of each point from the suspected region of the text has 8 dimensions). It should be noted that the combination of the probability that each pixel in the image to be detected is a suspected text region and the deviation of each pixel from the suspected text region is a position probability map of the suspected text region in the image to be detected.

(2) Determining the confidence of the candidate text region and the position probability map as the spatial information.

Further, in this embodiment, after obtaining the spatial information in step S102, in step S103, the final text region may be determined according to the following method, where the method includes:

(1) and determining a first candidate text region and the confidence of the first candidate text region according to the position probability map.

Specifically, the specific implementation process of this step may include: in the position probability map, a first target pixel point with a probability (see the above introduction, where the probability refers to the probability that each pixel point in an image to be detected belongs to a text) greater than a first preset threshold is searched, a second target pixel point with a probability greater than a second preset threshold is searched in a specified field of the first target pixel point, the second target pixel point is determined as a pixel point for constructing a first candidate text region, and the first candidate text region constructed based on the second target pixel point is determined according to a deviation of the second target pixel point from the text in the position probability map (a specific implementation process related to this step may refer to descriptions in related technologies, and details are not repeated here).

Further, in an embodiment, an average value of probabilities that each pixel point in the first candidate text region belongs to the text may be determined as the confidence of the first candidate text region.

It should be noted that the first preset threshold is greater than the second preset threshold, and specific values of the first preset threshold and the second preset threshold are set according to actual needs, and in this embodiment, the specific values are not limited. For example, in one embodiment, the first predetermined threshold is 0.7, and the second predetermined threshold is 0.5.

(2) And judging whether the probability corresponding to the candidate text region in the position probability map is smaller than a preset threshold value.

Specifically, according to the feature vector of the candidate text region, the position coordinates of the candidate text region can be known, and then the position of the candidate text region in the position probability map is known, so that the probability corresponding to the candidate text region in the position probability map is obtained. For example, in an embodiment, an average value of probabilities that all pixel points in the candidate text region in the position probability map belong to the text is determined as a probability corresponding to the candidate text region.

The preset threshold is set according to actual needs, and in the present embodiment, the preset threshold is not limited thereto. For example, in one embodiment, the predetermined threshold may be 0.3.

(3) If so, deleting the candidate text region, and performing non-maximum suppression processing on the first candidate region according to the confidence coefficient of the first candidate text region to obtain the final selected text region;

(4) if not, determining the candidate text region as a second candidate text region, and performing non-maximum suppression processing on the first candidate region and the second candidate region according to the confidence coefficient of the first candidate text region and the confidence coefficient of the second candidate text region to obtain the final selected text region.

Specifically, the specific implementation principle and implementation step of the non-maximum suppression processing may be referred to in the description of the related art, and are not described herein again.

In the method provided by the embodiment, the specified information is acquired from the image to be detected containing the text, the specified information is input into a pre-established target neural network for constructing the spatial relationship between the text in the image and the attention target, the target neural network outputs the spatial information, and the candidate text region is corrected according to the spatial information to obtain the final selected text region in the image to be detected. The specific information comprises a feature vector of a candidate text region positioned from the image to be detected, and the attention target comprises at least one of a text in the image, a specific target which has a spatial relationship with the text in the image and attribute information of the image. Therefore, the spatial relationship between the text and the target is fully considered, the spatial information is fully utilized to position the final selected text area, and the accuracy can be improved.

In the following, some more specific examples are given to describe the technical solutions provided in the present application in detail.

Fig. 2 is a flowchart of a second embodiment of a text detection method provided in the present application. In the method provided by this embodiment, the specifying information further includes a feature vector of a specified target located in the image to be detected and having a spatial relationship with the candidate text region, and step S102 may include:

s201, inputting the specified information into a second neural network in the target neural network, processing the specified information by the second neural network, and outputting the confidence of the candidate text region and the position probability map of the text suspected region in the image to be detected.

Specifically, for a specific implementation method and an implementation principle related to extracting a feature vector of a specified target, reference may be made to the description in the foregoing embodiments, and details are not described here.

In addition, the second neural network is used for constructing a spatial relationship between the text and the specified target, and the input of the neural network can be a feature vector of the candidate text region and a feature vector of the specified target, and the output can be confidence of the candidate text region and a position probability map of the text suspected region in the image to be detected. For example, in one embodiment, the feature vector of the candidate text region 1 and the feature vector of the designated target are input, and the confidence of the candidate text region 1 and the position probability map of the suspected text region 2 are output.

Further, the processing of the specified information by the second neural network may include: and performing fusion processing on the specified information to obtain fused specified information, and performing weighting processing on the fused specified information. For example, in an embodiment, the second neural network may include a concat layer and a full connection layer, where the concat layer is configured to perform fusion processing on the specifying information to obtain fused specifying information, and the full connection layer is configured to perform weighting processing on the fused specifying information.

It should be noted that, through the second neural network, text regions that are missed to be detected can be searched, text targets are prevented from being lost, false detection texts are screened out, and accuracy of text detection can be improved. For example, a license plate is generally placed on a vehicle body, a shop name is generally in a door shop range, and the like, so that a false-check text which does not satisfy a certain positional relationship with a specified target can be screened out by training and learning the positional relationship between the specified target and the text.

And S202, determining the confidence of the candidate text region and the position probability map as the spatial information.

Specifically, in this embodiment, after obtaining the spatial information, in step S103, the final text region may be determined according to the method described in the above embodiment, and details are not repeated here.

According to the method provided by the embodiment, the spatial relationship between the text and the specified target can be constructed through the second neural network, so that the spatial information is obtained. Therefore, the final text area is determined based on the obtained spatial information, the text area which is missed to be detected can be searched, the text target is prevented from being lost, the false detection area is screened out, and the accuracy of text detection can be improved.

Fig. 3 is a flowchart of a third embodiment of a text detection method provided in the present application. Referring to fig. 3, in the method provided in this embodiment, the specifying information further includes attribute information of the image to be detected. Step S102 may include:

s301, inputting the specification information into a third neural network in the target neural network, processing the specification information by the third neural network, and outputting the corrected position coordinates of the candidate text regions.

Specifically, the third neural network is used for constructing a spatial relationship between the text and the attribute information of the image to be detected, and the input of the third neural network may be a feature vector of the candidate text region and the attribute information of the image to be detected, and the output may be a corrected position coordinate of the candidate text region. For example, in an embodiment, the feature vector of the candidate text region 1 and the rotation angle θ of each pixel point in the image to be detected are input, and the corrected position coordinates of the candidate text region 1 are output.

It should be noted that, the process of processing the specified information by the third neural network may include: the neural network carries out fusion processing on the specified information to obtain fused specified information, and carries out weighting processing on the fused specified information. For example, in an embodiment, the third neural network may include a concat layer and a full connection layer, where the concat layer is configured to perform fusion processing on the specifying information to obtain fused specifying information, and the full connection layer is configured to perform weighting processing on the fused specifying information.

For example, in an embodiment, the dimension of the feature vector of the candidate text region 1 is n, the dimension of the attribute information of the image to be detected is 1, and at this time, the dimension of the fused specified information is n + 1. Further, the dimension of the full-link coefficient of the full-link layer of the third neural network is (n +1) × 8, so that the dimension of the weighting processing result obtained after the weighting processing is 8, and the corrected position coordinates of the candidate text region 1 (the position coordinates are represented by the coordinates of the four corner points of the candidate text region, and thus are 8 dimensions) are represented. It should be noted that, for the specific implementation procedures and implementation principles of the fusion process and the weighting process, reference may be made to descriptions in the related art, and details are not described here.

S302, determining the corrected position coordinates of the text candidate region as the spatial information.

Accordingly, in this embodiment, when the spatial information is the position coordinates of the text region candidate after correction, in step S103, the position of the text region candidate may be finely adjusted based on the position coordinates of the text region candidate after correction, so as to obtain the final text region. For example, in one embodiment, the final text region may be determined directly based on the revised position coordinates of the candidate text region. For another example, in another embodiment, the final selected text region may also be determined based on the corrected position coordinates of the candidate text region and the initial position coordinates of the candidate text region determined in step S101 (the initial position coordinates of the candidate text region may be determined based on the probability that each pixel point in the to-be-detected image belongs to the text and the deviation of each pixel point in the to-be-detected image from the text). For example, in one embodiment, the final text region is determined based on the corrected position coordinates and an average of the initial position coordinates.

According to the method provided by the embodiment, the spatial information is obtained by constructing the spatial relationship between the text and the attribute information of the image to be detected, and the final selected text region is determined according to the spatial information, so that the fine adjustment of the position of the text region can be realized, and the accuracy of text detection is improved.

Fig. 4 is a flowchart of a fourth embodiment of a text detection method provided in the present application. Referring to fig. 4, in the method provided in this embodiment, the specifying information further includes a feature vector of a specified target located in the image to be detected and having a spatial relationship with the candidate text region, and attribute information of the image to be detected; step S102, comprising:

and S401, inputting the specified information into a fourth neural network in the target neural network, processing the specified information by the fourth neural network, and outputting the confidence of the candidate text region, the corrected position coordinates of the candidate text region and the position probability map of the text suspected region in the image to be detected.

The method for acquiring the feature vector of the designated target and the attribute information of the image to be detected can be referred to the description in the previous embodiment, and will not be described herein again.

Specifically, the fourth neural network is used for constructing a spatial relationship among the text, the designated target and the attribute information of the image to be detected, the input of the fourth neural network can be the characteristic vector of the candidate text region, the characteristic vector of the specified target and the attribute information of the image to be detected, the output is the confidence coefficient of the candidate text region, the position coordinates of the candidate text region after correction and the position probability map of the text suspected region in the image to be detected, for example, in one embodiment, the feature vector of the candidate text region 1, the feature vector of the candidate text region 2, the feature vector of the designated target, and the rotation angle θ of each pixel point in the image to be detected are input, and the confidence coefficient and the corrected position coordinate of the candidate text region 1, the confidence coefficient and the corrected position coordinate of the candidate text region 2, and the position probability map of the suspected text region 3 are output.

It should be noted that, in an embodiment, the fourth neural network may include a concat layer and a full connection layer, where the concat layer is configured to perform fusion processing on the specifying information to obtain fused specifying information, and the full connection layer is configured to perform weighting processing on the fused specifying information, and output a confidence of the candidate text region, a corrected position coordinate of the candidate text region, and a position probability map of a text suspected region in the image to be detected.

With reference to the above example, for example, in an embodiment, the size of the image to be detected is 9 × 9, the dimension of the feature vector of the candidate text region 1 is n, the dimension of the feature vector of the candidate text region 2 is n, the dimension of the feature vector of the designated target is n, the dimension of the attribute information of the image to be detected is 1, and the dimension of the fused designated information is 3n + 1. In this example, the dimension of the full-connection coefficient of the full-connection layer is 9+9+9 × 9, after the weighting processing, the dimension of the weighting processing result is 9+9+9 × 9, wherein the first 9 dimensions represent the confidence coefficient and the corrected position coordinate (8 dimensions) of the candidate text region 1, the middle 9 dimensions represent the confidence coefficient and the corrected position coordinate of the candidate text region 2, and finally the 9-9 dimensions represent the position probability map of the suspected text region in the image to be detected, that is, represent the probability that each pixel in the image to be detected belongs to the text and the deviation of each pixel from the text.

And S402, determining the confidence of the candidate text region, the corrected position coordinates of the candidate text region and the position probability map as the spatial information.

Specifically, in this implementation, the spatial information includes the confidence of the candidate text region, the position coordinates of the candidate text region after correction, and the position probability map of the suspected text region in the image to be detected, and in step S103, the final selected text region may be determined based on the same method as in the first embodiment, where the only difference from the first embodiment is that: when judging whether the probability corresponding to the candidate text region in the position probability map is smaller than a preset threshold value, at this time, the position of the candidate text region in the position probability map can be obtained according to the position coordinates of the candidate text region after correction, and then the probability corresponding to the candidate text region in the position probability map is obtained. Alternatively, when it is determined whether the probability corresponding to the candidate text region in the position probability map is smaller than the preset threshold, at this time, the candidate text region may be refined based on the corrected position coordinate and the initial position coordinate (the initial position coordinate is referred to the above description) of the candidate text region, so as to obtain an adjusted position coordinate (for example, in an embodiment, the adjusted position coordinate is equal to an average value of the corrected position coordinate and the initial position coordinate), and then, based on the adjusted position coordinate, the position of the candidate text region in the position probability map is determined, so as to obtain the probability corresponding to the candidate text region in the position probability map. In addition, when the non-maximum value suppression processing is performed, the processing is performed in accordance with the position coordinates after the adjustment of the candidate text region.

According to the method provided by the embodiment, through the fourth neural network, the spatial relationship among the text, the designated target and the attribute information of the image to be detected can be constructed, and the spatial information can be obtained. Therefore, the final selected text area is determined based on the obtained spatial information, so that the text area which is missed to be detected can be searched, the text target is prevented from being lost, the position of the text area can be finely adjusted, and the accuracy of text detection can be improved.

Specifically, the target neural network is pre-established by the following method:

acquiring a training sample set; the training sample set comprises a plurality of pictures;

establishing a standby neural network for constructing a spatial relationship between the text in the image and the designated target; the input of the standby neural network is the designated information, and the output is the spatial information;

and training the standby neural network by adopting the training sample set to obtain the target neural network.

Corresponding to the embodiment of the text detection method, the application also provides an embodiment of a text detection device.

The embodiment of the text detection device can be applied to computer equipment. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and is formed by reading corresponding computer program instructions in the memory into the memory for operation through the processor of the computer device where the software implementation is located as a logical means. From a hardware aspect, as shown in fig. 5, a hardware structure diagram of a computer device where a text detection apparatus is located is shown in an exemplary embodiment of the present application, except for the storage 510, the processor 520, the memory 530, and the network interface 540 shown in fig. 5, the computer device where the apparatus is located in the embodiment may also include other hardware according to an actual function of the text detection apparatus, which is not described again.

Fig. 6 is a schematic structural diagram of a first embodiment of a text detection apparatus provided in the present application. Referring to fig. 6, the text detection apparatus provided in the present application may include an element generation module 610, a spatial relationship modeling module 620, and a text detection module 630, wherein,

the extracting module 610 is configured to obtain specified information from an image to be detected that includes a text; the specified information comprises a characteristic vector of a candidate text region positioned from the image to be detected;

the spatial relationship modeling module 620 is configured to input the specific information into a pre-established target neural network for constructing a spatial relationship between a text in an image and an attention target, and output spatial information by the target neural network; wherein the attention target includes at least one of a text in the image, a specified target in the image having a spatial relationship with the text, and attribute information of the image;

the text detection module 630 is configured to correct the candidate text region according to the spatial information, so as to obtain a final selected text region in the image to be detected.

The apparatus of this embodiment may be used to implement the technical solution of the method embodiment shown in fig. 1, and the implementation principle and the technical effect are similar, which are not described herein again.

Further, the spatial relationship modeling module 620 is specifically configured to input the specified information into a first neural network in the target neural network, process the specified information by the first neural network, output a confidence of the candidate text region and a position probability map of a suspected text region in the image to be detected, and determine the confidence of the candidate text region and the position probability map as the spatial information.

Further, the specified information also comprises a feature vector of a specified target which is positioned from the image to be detected and has a spatial relationship with the candidate text region; the spatial relationship modeling module 620 is specifically configured to input the specified information into a second neural network in the target neural network, process the specified information by the second neural network, output the confidence of the candidate text region and the position probability map of the suspected text region in the image to be detected, and determine the confidence of the candidate text region and the position probability map as the spatial information.

Further, the specified information also comprises attribute information of the image to be detected; the spatial relationship modeling module 620 is specifically configured to input the specification information into a third neural network in the target neural network, process the specification information by the third neural network, output the position coordinates after the candidate text region is corrected, and determine the position coordinates after the candidate text region is corrected as the spatial information.

Further, the specified information also comprises a feature vector of a specified target which is positioned from the image to be detected and has a spatial relationship with the candidate text region and attribute information of the image to be detected; the spatial relationship modeling module 620 is specifically configured to input the specified information into a fourth neural network in the target neural network, process the specified information by the fourth neural network, output the confidence of the candidate text region, the corrected position coordinates of the candidate text region, and the position probability map of the suspected text region in the image to be detected, and determine the confidence of the candidate text region, the corrected position coordinates of the candidate text region, and the position probability map as the spatial information.

Further, the processing the specified information includes:

and performing fusion processing on the specified information to obtain fused specified information, and performing weighting processing on the fused specified information.

Further, the text detection module 630 is specifically configured to:

determining a first candidate text region and the confidence of the first candidate text region according to the position probability map;

judging whether the probability corresponding to the candidate text region in the position probability map is smaller than a preset threshold value or not;

if so, deleting the candidate text region, and performing non-maximum suppression processing on the first candidate region according to the confidence coefficient of the first candidate text region to obtain the final selected text region;

if not, determining the candidate text region as a second candidate text region, and performing non-maximum suppression processing on the first candidate region and the second candidate region according to the confidence coefficient of the first candidate text region and the confidence coefficient of the second candidate text region to obtain the final selected text region.

Further, the present application also provides a computer readable storage medium having stored thereon a computer program which, when being executed by a processor, carries out the steps of any of the methods provided in the first aspect of the present application.

In particular, computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., internal hard disk or removable disks), magneto-optical disks, and CD ROM and DVD-ROM disks.

With continued reference to fig. 5, the present application further provides a computer device comprising a memory 510, a processor 520 and a computer program stored on the memory 510 and executable on the processor 520, wherein the processor 520 executes the program to perform the steps of any of the methods provided in the first aspect of the present application.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the scope of protection of the present application.

Claims

1. A text detection method, the method comprising:

2. The method of claim 1, wherein the inputting the specified information into a pre-established target neural network for constructing a spatial relationship between text in an image and an object of interest, spatial information being output by the target neural network, comprises:

inputting the specified information into a first neural network in the target neural network, processing the specified information by the first neural network, and outputting the confidence of the candidate text region and a position probability map of the suspected text region in the image to be detected;

determining the confidence of the candidate text region and the position probability map as the spatial information.

3. The method according to claim 1, wherein the specified information further comprises a feature vector of a specified target located in the image to be detected and having a spatial relationship with the candidate text region; inputting the specified information into a pre-established target neural network for constructing a spatial relationship between a text in an image and an attention target, and outputting spatial information by the target neural network, wherein the method comprises the following steps:

inputting the specified information into a second neural network in the target neural network, processing the specified information by the second neural network, and outputting the confidence of the candidate text region and a position probability map of the suspected text region in the image to be detected;

4. The method according to claim 1, wherein the designation information further includes attribute information of the image to be detected; inputting the specified information into a pre-established target neural network for constructing a spatial relationship between a text in an image and an attention target, and outputting spatial information by the target neural network, wherein the method comprises the following steps:

inputting the specified information into a third neural network used in the target neural network, processing the specified information by the third neural network, and outputting the position coordinates of the candidate text region after being corrected;

and determining the position coordinates of the candidate text regions after correction as the spatial information.

5. The method according to claim 1, wherein the specified information further comprises a feature vector of a specified target located from the image to be detected and having a spatial relationship with the candidate text region and attribute information of the image to be detected; inputting the specified information into a pre-established target neural network for constructing a spatial relationship between a text in an image and an attention target, and outputting spatial information by the target neural network, wherein the method comprises the following steps:

inputting the specified information into a fourth neural network in the target neural network, processing the specified information by the fourth neural network, and outputting the confidence of the candidate text region, the corrected position coordinates of the candidate text region and a position probability map of the suspected text region in the image to be detected;

and determining the confidence of the candidate text region, the corrected position coordinates of the candidate text region and the position probability map as the spatial information.

6. The method according to any one of claims 2, 3 and 5, wherein the correcting the candidate text region according to the spatial information to obtain a final text region in the image to be detected comprises:

7. The method of claim 1, wherein the target neural network is pre-established by:

8. A text detection apparatus, comprising an element generation module, a spatial relationship modeling module, and a text detection module, wherein,

9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.

10. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1-8 are implemented when the program is executed by the processor.