CN111738263A

CN111738263A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN111738263A
Application number: CN202010853517.0A
Authority: CN
Inventors: 康凯; 李兵; 秦勇
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2020-10-02

Abstract

The application discloses a target detection method, a target detection device, electronic equipment and a storage medium, and the specific implementation scheme is as follows: responding to the acquisition processing to obtain a to-be-processed picture containing a target text; inputting the picture to be processed into a pre-trained target detection network to obtain at least two candidate detection boxes containing the target text and part of texts in the target text, wherein the part of texts are: a text region that overlaps the target text in whole or in part and includes at least one line of text sequence; and calculating according to the overlapping area of the at least two candidate detection frames to obtain a classification index, and classifying and screening the at least two candidate detection frames according to the classification index to remove the candidate detection frames containing part of texts in the target text to obtain the target detection frame. By the method and the device, the accuracy of target detection can be improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer vision technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium.

Background

With the fact that electronic equipment such as portable equipment and mobile phone terminals are more intelligent than the prior art, the chip has stronger analysis capability, graphic and text information, video information and the like can be efficiently analyzed through a computer vision technology, and target objects in the graphic and text information, the video information and the like can be detected.

Taking an application scene of layout analysis as an example, in an image acquired in the process of layout analysis, most of the target objects are text information, and due to the fact that texture features of the text information lack shape information, compared with video information, the text information is difficult to recognize and classify, and therefore target detection of the text information is low in accuracy rate, accuracy rate of the layout analysis is affected finally, and in the related technology, no effective solution exists for how to improve the target detection accuracy rate of the text information.

Disclosure of Invention

The application provides a target detection method, a target detection device, electronic equipment and a storage medium.

According to an aspect of the present application, there is provided a target detection method including:

responding to the acquisition processing to obtain a to-be-processed picture containing a target text;

inputting the picture to be processed into a pre-trained target detection network to obtain at least two candidate detection boxes containing the target text and part of texts in the target text, wherein the part of texts are: a text region that overlaps the target text in whole or in part and includes at least one line of text sequence;

and calculating according to the overlapping area of the at least two candidate detection frames to obtain a classification index, and classifying and screening the at least two candidate detection frames according to the classification index to remove the candidate detection frames containing part of texts in the target text to obtain the target detection frame.

According to another aspect of the present application, there is provided an object detecting apparatus including:

the image acquisition module is used for responding to acquisition processing to obtain a to-be-processed image containing the target text;

a candidate detection box obtaining module, configured to input the to-be-processed picture into a pre-trained target detection network, so as to obtain at least two candidate detection boxes including the target text and a part of text in the target text, where the part of text is: a text region that overlaps the target text in whole or in part and includes at least one line of text sequence;

and the target detection module is used for obtaining a classification index according to the overlapping region operation of the at least two candidate detection frames, and classifying and screening the at least two candidate detection frames according to the classification index so as to remove the candidate detection frames containing part of texts in the target text to obtain the target detection frame.

According to another aspect of the present application, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method as provided by any one of the embodiments of the present application.

According to another aspect of the present application, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform a method provided by any one of the embodiments of the present application.

By adopting the method and the device, the picture to be processed containing the target text can be obtained in response to the acquisition processing; inputting the picture to be processed into a pre-trained target detection network, and obtaining at least two candidate detection boxes containing the target text and a part of text in the target text, wherein the part of text is as follows: a text region that overlaps the target text in whole or in part and includes at least one line of text sequence; and calculating to obtain a classification index according to the overlapping area of the at least two candidate detection frames, classifying and screening the at least two candidate detection frames according to the classification index, so that the candidate detection frames containing the target text and the candidate detection frames containing part of the text in the target text can be identified, the candidate detection frames containing part of the text in the target text which are detected by mistake are removed from the identified candidate detection frames, and finally the candidate detection frames containing part of the text in the target text are removed to obtain the target detection frames, so that the target detection accuracy rate for the target text information is improved.

It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.

Drawings

The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:

FIGS. 1-2 are schematic diagrams of an output detection block and filtering of the output detection block according to embodiments of the present application;

FIG. 3 is a schematic diagram of an cross-over operation according to an embodiment of the present application;

4-7 are various diagrams of candidate detection box presence boxes or class boxes according to embodiments of the present application;

FIG. 8 is a schematic flow chart diagram of a target detection method according to an embodiment of the present application;

FIG. 9 is a diagram illustrating an overlap region operation according to an embodiment of the present application;

10-11 are schematic diagrams of operational modes applied in layout analysis according to embodiments of the present application;

FIG. 12 is a schematic diagram of the structure of an object detection device according to an embodiment of the present application;

fig. 13 is a block diagram of an electronic device for implementing the object detection method of the embodiment of the present application.

Detailed Description

The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. The term "at least one" herein means any combination of at least two of any one or more of a plurality, for example, including at least one of A, B, C, and may mean including any one or more elements selected from the group consisting of A, B and C. The terms "first" and "second" used herein refer to and distinguish one from another in the similar art, without necessarily implying a sequence or order, or implying only two, such as first and second, to indicate that there are two types/two, first and second, and first and second may also be one or more.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

For target detection, a Non-maximum suppression (NMS) algorithm, which is a post-processing algorithm for a target detection algorithm, may be used, and the target detection algorithm may be a Single-stage multi-box (SSD) algorithm, and the type of the target object to be detected and the detection box for the target object may be directly predicted through the SSD algorithm. Fig. 1-2 are schematic diagrams of an output detection box and filtering of the output detection box according to an embodiment of the present application, as shown in fig. 1, a SSD algorithm is used to output detection boxes with a number greater than the actual number of targets for a same target (e.g., a human face), and a NMS algorithm may be used to filter a plurality of detection boxes output by the SSD algorithm, for example, a plurality of mutually overlapped detection boxes output near the same target, and only one most accurate box is retained on each target, as shown in fig. 2.

In one example, a candidate detection box set B and a score (scores) set S corresponding to the detection box set B and used for representing confidence degrees of candidate detection boxes are set, and the candidate detection box set B and the score (scores) set S correspond to each other, that is, each detection box has a given score, and the filtering is performed on a plurality of detection boxes output by using the SSD algorithm through the NMS algorithm, wherein the filtering includes the following contents:

1. finding out the detection frame M with the highest score, namely finding out the detection frame M with the highest confidence coefficient, and using the detection frame M as a selection frame;

2. deleting the detection frame M from the corresponding detection frame set B;

3. adding the detection frame M into a set D of trusted target detection frames;

4. searching a peripheral detection frame corresponding to the detection frame M from the detection frame set B, taking the peripheral detection frame as a comparison frame relative to the selected frame of the detection frame M, and deleting the comparison frame if the overlapping area of the selected frame and the comparison frame is greater than a threshold value;

5. and repeating the steps 1-4 until only one target detection frame is left in the target detection frame set D, thereby achieving the purpose of reserving the most accurate target detection frame on each target.

Whether the overlapping area of the selected frame and the comparison frame is larger than a threshold value or not is judged, and an Intersection over Union (IoU) can be adopted for calculation. FIG. 3 is a schematic diagram of an intersection and comparison operation according to an embodiment of the present application, and as shown in FIG. 3, the numerator used for calculating IoU is the intersection of the selected box and the comparison box, i.e. the overlapping region 11 of the two, and the denominator used for calculating IoU is the union of the selected box and the comparison box, i.e. the region 12 formed by the outer border of the two superimposed together.

For the situation that the target object is a human face, a relatively accurate target detection frame can be obtained by adopting the target detection algorithm, and for the situation that the target object is text information, for example, in a scene of photographing and judging questions based on layout analysis, a user wants to obtain a question solving answer of a question, the question itself should be used as a target text, that is, the target text is identified as the content of the question to be solved. Because the textual information is compared with the face information, the textural features are not obvious, and therefore, if the target text cannot be accurately identified by using the target detection algorithm for the face condition, a part of text content in the target text can be identified as content to be solved, which results in false detection. Further, since there is a false recognition in the recognition of the question itself, that is, recognition of a part of text content in the question is mixed with the recognition of the question itself with a high probability, a problem solving answer obtained from a target detection result obtained by the recognition is not a correct answer intended by the user.

In a case where the target object is text information, a frame area of a candidate detection frame obtained by identifying a part of text content in the question is smaller than a frame area of a candidate detection frame obtained by identifying the question itself, that is: the candidate detection frame obtained by identifying a part of text content in the topic may be referred to as a frame-in-frame or a frame-like-in-frame with respect to the candidate detection frame obtained by identifying the topic itself. The partial text is a text region that overlaps either entirely (e.g., box in box) or partially (e.g., box in box-like) the target text and includes at least one line of text sequence.

Fig. 4-7 are a plurality of schematic diagrams illustrating candidate detection frames according to an embodiment of the present application having the above-mentioned frame middle frame or the above-mentioned class frame middle frame, as shown in fig. 4, taking layout analysis as an example, a target object in a picture is text information, and because texture features of the text information lack shape information, the candidate detection frames obtained by recognition are not accurate enough, and there is misrecognition, for example, a first detection frame 21 and a second detection frame 22 can be obtained by recognition, and a user needs to extract a target text in the first detection frame 21 (e.g., a complete derivation process of a mathematical question) for solving through layout analysis, but does not need to solve a part of text in the second detection frame 22, and therefore, the candidate detection frames that are misrecognized need to be filtered, e.g., the second detection frame 22.

The present application is not limited to the form of the candidate detection box shown in fig. 4, and the "box in box" shown in fig. 5-6 and the "box in box" shown in fig. 7 are both within the protection scope of the present application, and the candidate detection boxes corresponding to the shadow filling parts shown in fig. 5-7 are results of misrecognition, only include part of the text in the required target text, and need to be filtered out, and these results of misrecognition are filtered out, so as to obtain a precise target detection box, with the present application, a classification index can be used as a ratio of Intersection area to Smaller box area (IoSR, Intersection over Smaller Region) in the candidate detection boxes to replace the above IoU to implement an NMS algorithm, and the NMS algorithm is used to perform the filtering on a plurality of output detection boxes in a layout analysis scene, so as to filter the above "box in box" and "box in box", thereby improving the target detection accuracy rate for the target text in layout analysis, finally, the accuracy of layout analysis is improved, and a correct solving result for the target text is obtained.

The second detection frame is not limited to the second detection frame, and means: and if the selected box is marked as the first detection box (namely, the first detection box is used as a more correct detection box), the second detection box is a smaller box of the two boxes compared with the first detection box.

According to an embodiment of the present application, an object detection method is provided, and fig. 8 is a flowchart of the object detection method according to the embodiment of the present application, which may be applied to an object detection apparatus, for example, in a case where the apparatus may be deployed in a terminal or a server or other processing device for execution, video frame extraction, exposure quality statistics, exposure quality evaluation, and the like may be performed. Among them, the terminal may be a User Equipment (UE), a mobile device, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, a vehicle-mounted device, a wearable device, and so on. In some possible implementations, the method may also be implemented by a processor calling computer readable instructions stored in a memory. As shown in fig. 8, includes:

and S101, responding to acquisition processing to obtain a to-be-processed picture containing the target text.

In one example, as shown in fig. 4, the target text may be a question in a layout page, which includes multiple line sub-formulas, wherein the multiple line sub-formulas are partial texts in the target text.

S102, inputting the picture to be processed into a pre-trained target detection network to obtain at least two candidate detection boxes containing the target text and part of texts in the target text, wherein the part of texts are as follows: a text region that overlaps, either entirely or partially, the target text and includes at least one line of a text sequence.

In an example, as shown in fig. 4 to fig. 7, at least two candidate detection boxes may include, in addition to the candidate detection box including the target text, a candidate detection box of a part of text in the target text.

S103, obtaining a classification index according to the overlapping region operation of the at least two candidate detection frames, and classifying and screening the at least two candidate detection frames according to the classification index to remove the candidate detection frames containing part of texts in the target text to obtain the target detection frame.

In an example, the classification index may be an overlap area index for measuring an overlap area of the at least two candidate detection frames, so as to accurately identify the required target detection frame from the at least two candidate detection frames.

The classification index may be compared with a preset threshold for the classification screening to obtain a comparison result. The candidate detection box containing the target text may be used as a first detection box, and the candidate detection box containing a part of text in the target text may be used as a second detection box. And if the comparison result shows that the candidate detection frame with the classification index larger than the preset threshold is the second detection frame, deleting the second detection frame from the at least two candidate detection frames to obtain the target detection frame.

By adopting the method and the device, the picture to be processed containing the target text can be obtained in response to the acquisition processing; inputting the picture to be processed into a pre-trained target detection network, and obtaining at least two candidate detection boxes containing the target text and a part of text in the target text, wherein the part of text is as follows: a text region that overlaps, either entirely or partially, the target text and includes at least one line of a text sequence. The at least two candidate detection boxes can be classified and screened according to classification indexes obtained by operation based on the overlapping area of the at least two candidate detection boxes, so that the detection boxes containing the target text and the detection boxes containing part of the text in the target text are identified, the target detection boxes are obtained after the candidate detection boxes containing part of the text in the target text are removed, and therefore the target detection accuracy of the text information is improved.

In an example, the target detection network may be an SSD network, and the SSD network may include functions for feature extraction, identification, and classification, each of which corresponds to a different function module or sub-network, for example, the SSD network may include a feature extraction sub-network for feature extraction, and a plurality of fully-connected layers (for performing vector integration on a plurality of extracted feature vectors) and activation layers (for performing non-linear processing on the integrated vectors) may be added to the feature extraction sub-network, and the final output is: the at least two candidate detection boxes. Then, according to the classification index, performing classification screening (which may adopt an NMS algorithm) on the at least two candidate detection boxes to obtain a target detection box with a part of the text candidate detection box removed.

In one example, the method further comprises: and taking the picture data to be processed containing the target text as training sample data, and performing loss operation according to the training sample data and the labeled data used for the target detection network training to obtain a loss function. And training the target detection network according to the back propagation of the loss function and adjusting the weight of the network until the network converges, and finishing the network training to obtain the trained target detection network. And taking the trained target detection network as the pre-trained target detection network.

The embodiment of the present application is not limited to the above network type of the SSD network, and a neural network capable of realizing the object detection of the embodiment of the present application is within the protection scope of the embodiment of the present application.

In an embodiment of the present application, the method further includes: and extracting the first detection box containing the target text and the second detection box containing partial texts in the target text from the at least two candidate detection boxes, and performing overlapping area operation according to the overlapping area of the first detection box and the second detection box to obtain an overlapping area index. And taking the overlap region index as the classification index.

In one example, the performing an overlap region operation according to an overlap region of the first detection frame and the second detection frame to obtain an overlap region index includes: and taking the overlapping area of the first detection frame and the second detection frame as an intersection, taking the frame area of the second detection frame as a union, and performing overlapping area operation according to the intersection and the union to obtain the overlapping area index (such as IoSR). Fig. 9 is a schematic diagram of an overlap region calculation according to an embodiment of the present application, and as shown in fig. 9, the numerator for calculating the IoSR is the intersection of the selected box (e.g., the first detection box) and the alignment box (e.g., the second detection box), i.e., the overlap region 61 of the two, and the denominator for calculating the IoSR is the region of the alignment box (i.e., the region with the smallest area in the first detection box and the second detection box), i.e., the region 62 of the second detection box.

Aiming at the situation of 'shooting and title judging' in a layout analysis scene, for the situation that overlapped frames such as 'frame in frame' or 'class frame in frame' are easy to appear in candidate detection frames for identifying target texts, because the textural features of text information are not obvious in feature contrast compared with scenes such as human face detection, feature identification and classification are not easy to be carried out through a neural network so as to detect the 'frame in frame' or the 'class frame in frame', the IoSR can be used as a classification index and applied to an NMS algorithm by adopting the application, as the classification index 'IoSR' is usually larger than a threshold value in the comparison between the classification index and the preset threshold value, the classification index is easier to be identified and classified, therefore, the classification index can be compared with the preset threshold value for classification screening, if the comparison result is that the candidate detection frame of which the classification index is larger than the preset threshold value is used as a second detection frame, the second detection frame is deleted from the at least two candidate detection frames to obtain the target detection frame, so that the probability of detecting the frame-in-frame or the class-frame-in-frame is improved, the accuracy is high, and the accuracy of the layout analysis can be improved well finally.

In an example, for the above overlapping region, the method includes: a first region (a frame in a block shown in fig. 4-6) where the first detection frame and the second detection frame are overlapped completely exists; and/or a second area (such as a class frame in fig. 7) obtained by partial overlapping of the first detection frame and the second detection frame exists.

The number of the first detection frames may be more than one, and preferably, the first detection frame with the highest confidence level among the plurality of first detection frames may be further found. That is to say, according to the confidence level, obtaining a selected first detection box and a to-be-compared box adjacent to the selected first detection box (i.e. a second detection box corresponding to the selected first detection box), and in accordance with an embodiment of the present invention, extracting, from the at least two candidate detection boxes, a first detection box including the target text and a second detection box including a part of text in the target text includes: extracting a first selection frame with highest confidence level in the candidate first detection frames from the at least two candidate detection frames; and obtaining a second comparison frame closest to the first detection frame according to the distance between the first detection frame and the second detection frame of the peripheral candidate. Correspondingly, the performing the overlapping area operation according to the overlapping area of the first detection frame and the second detection frame to obtain an overlapping area index includes: and performing the overlapping area operation according to the first selection frame and the second comparison frame to obtain the overlapping area index.

In an embodiment of the present application, the inputting the to-be-processed picture into a pre-trained target detection network further includes: according to the pre-trained target detection network, extracting the features of the target text in the picture to be processed to obtain the texture features of the target text; the texture features include: and the method is used for at least one characteristic of vertical operation, diagonal operation and equation operation of layout analysis.

It should be noted that the IoSR-based NMS process of the present application is not only applicable to a scenario of layout analysis, but also applicable to farmland area detection and forest detection of aerial images, even including pedestrian detection, vehicle detection, and the like, and in target detection, there is a case: the texture of the partial region of the target is similar to that of the whole target region, that is, only a part of the target is seen, and as for the condition, for example, pedestrian detection is taken as an example, sometimes, when a pedestrian is detected through the target detection network, a detection frame which only frames out the upper half of the body is given, and for the condition, the use of IoU in the NMS process cannot easily filter the half of the body, but the use of the IoSR of the application can filter the half of the body, thereby improving the accuracy of target detection.

In an example, fig. 10-11 are schematic diagrams of operation modes applied in layout analysis according to an embodiment of the present application, and fig. 10 shows a vertical calculation, which is a column-by-column calculation in the calculation process to simplify the calculation. As shown in fig. 11, the run-away calculation is also called a run-away calculation, which is an operation for writing the calculation process completely, i.e. a run-away calculation. The application range of the present application is not limited to the vertical calculation and the disjoint calculation, and may also include the calculation of a solution equation, where the calculation of the solution equation is a process of solving values of all unknowns in the equation, for example, one equation is: ax + b = z, a, b, z are constants, and the unknown x needs to be solved. In order to solve the problem of "photo-taking problem" in the layout analysis scene, a specific problem type in one page needs to be found for performing problem-taking processing subsequently. By the target detection method of the embodiment, the corresponding specific question type and the corresponding target detection frame can be identified and obtained according to the textural features of the target text, such as at least one feature of vertical operation, diagonal operation and equation operation used for layout analysis, so that several to-be-determined question types (such as vertical, diagonal and equation solving calculations) in the page can be finally detected and obtained.

Application example:

the processing flow of the embodiment of the application comprises the following contents:

the classification index "IoSR" shown in fig. 9 is used to realize the target detection of the present application. The numerator portion used to calculate the IoSR may use the intersection of two boxes, a selected box (e.g., a first detection box) and an alignment box (e.g., a second detection box), and the denominator portion may use the area of the smaller of the two boxes. The classification index "IoSR" obtained by adopting the calculation method is close to 1 for the "frame in frame" or the "frame in frame" of the class, and the preset threshold value can be generally set to about 0.5, and the first detection frame and the second detection frame output in the layout analysis scene are filtered by adopting the NMS algorithm, namely the frames with the classification index "IoSR" greater than the preset threshold value are filtered, so that the "frame in frame" and the "frame in frame" can be filtered, the target detection accuracy rate for the target text in the layout analysis is improved, the accuracy rate for the layout analysis is finally improved, and the correct solution result for the target text is obtained.

Comparing the classification index "IoSR" with the original classification index "IoU" used in the NMS algorithm, it can be seen that mathematically, the area of the smaller of the two boxes must be smaller than the area of the union of the two boxes, so the IoSR must be larger than IoU under the same condition. As the rule of the NMS algorithm is that the frames with the classification indexes for the overlapping region measurement larger than the preset threshold value are filtered, the probability that the comparison frames adjacent to the periphery of the selected frame are filtered is improved after the IoSR is adopted, namely if two detection frames with intersection exist, one of the two detection frames with the intersection is filtered with higher probability.

This classification index "IoSR" that this application adopted is applicable to text information's target detection in, also be adapted to the target detection scene that does not overlap or overlap less by itself, for example in the pedestrian detection that does not overlap or overlap less by itself, because IoSR has higher probability to give up the detection frame that has the intersection, under the condition that the intersection just can not appear in the detection target itself, consequently, can not lead to the accuracy to reduce, can promote the accuracy on the contrary, for example at entrance guard's face identification etc., adopt IoSR can promote the detection accuracy.

However, under the condition that the targets are overlapped, the detection accuracy is reduced because the IoSR is easy to filter the detection frames that should exist, for example, in a large-scale target detection scene with overlapped faces (such as face detection in a supermarket or a market), a plurality of targets exist at the same time and are easy to overlap with each other, and the detection accuracy is reduced by adopting the IoSR.

The specific description is as follows:

in face detection, "frame in frame" or "frame in class" rarely occurs, that is, only a small part of the face, such as the nose or eyes of the face, is not taken as the face target frame in target detection, for example, in fig. 1, all the frames contain most of the face, and the situation that only the nose or eyes are taken as the face to be recognized does not occur. From the image perspective, the characteristics of the face region are very clear and unique, and if the complete face information cannot be extracted in the target detection, the region will not be output as the target region. However, in the layout analysis scenario of the present application, for example, the cross-over calculation of fig. 4, the texture features are represented by a line-by-line expression, and the line becomes shorter and shorter from the top to the bottom, so that it is highly likely that a detection box containing only a part of text, i.e., "box-in-box" or "class-box-in-box" in fig. 5-7, will be output in the target detection. Therefore, the classification index "IoSR" of the present application is more suitable for target detection of text information.

In a large-scale target detection scene with overlapped faces, for example, in the face detection of multiple pedestrians, there is a high possibility that multiple pedestrians are overlapped in a staggered manner, and a part of the pedestrian frames is easily discarded in the NMS processing process applying the IoSR, so that the model index is reduced. However, such a problem does not occur in the layout analysis scenario of the present application, because these topics theoretically do not have an overlapping region, and each topic is isolated, so that application of the IoSR only brings positive improvement in the NMS process of layout analysis, and does not reduce the detection index. In summary, the target detection method and the classification index 'IoSR' are more suitable for target detection of text information, and the performance of layout analysis can be well improved.

According to an embodiment of the present application, there is provided an object detection apparatus, and fig. 12 is a schematic structural diagram of the object detection apparatus according to the embodiment of the present application, and as shown in fig. 12, the object detection apparatus includes: the picture acquisition module 81 is used for responding to acquisition processing to obtain a picture to be processed containing a target text; a candidate detection box obtaining module 82, configured to input the to-be-processed picture into a pre-trained target detection network, so as to obtain at least two candidate detection boxes including the target text and a part of text in the target text, where the part of text is: a text region that overlaps the target text in whole or in part and includes at least one line of text sequence; and the target detection module 83 is configured to obtain a classification index according to the overlapping area operation of the at least two candidate detection boxes, and perform classification screening on the at least two candidate detection boxes according to the classification index to remove candidate detection boxes containing part of texts in the target text to obtain a target detection box.

In one embodiment, the apparatus further comprises: a detection box extracting module, configured to extract, from the at least two candidate detection boxes, a first detection box containing the target text and a second detection box containing a part of text in the target text; the overlapping processing module is used for performing overlapping area operation according to the overlapping areas of the first detection frame and the second detection frame to obtain an overlapping area index; and taking the overlap region index as the classification index.

In one embodiment, the overlap processing module is configured to use an overlap area of the first detection frame and the second detection frame as an intersection; taking the frame area of the second detection frame as a union set; and performing the overlapping region operation according to the intersection and the union to obtain the overlapping region index.

In one embodiment, the target detection module is configured to compare the classification index with a preset threshold for the classification screening to obtain a comparison result; and if the comparison result is that the candidate detection frame with the classification index larger than the preset threshold is the second detection frame, deleting the second detection frame from the at least two candidate detection frames to obtain the target detection frame.

In one embodiment, the overlap region comprises: a first area obtained by completely overlapping the first detection frame and the second detection frame; and/or a second area obtained by partial overlapping of the first detection frame and the second detection frame exists.

In one embodiment, the detection frame extracting module is configured to extract a first selected frame with the highest confidence level from among the at least two candidate detection frames; and obtaining a second comparison frame closest to the first detection frame according to the distance between the first detection frame and the second detection frame of the peripheral candidate. Correspondingly, the overlap processing module is configured to perform the overlap region operation according to the first selection frame and the second comparison frame to obtain the overlap region index.

In one embodiment, the candidate detection box obtaining module includes a feature extraction sub-module, configured to perform feature extraction on the target text in the to-be-processed picture according to the pre-trained target detection network to obtain a texture feature of the target text; the texture features include: and the method is used for at least one characteristic of vertical operation, diagonal operation and equation operation of layout analysis.

In one embodiment, the apparatus further includes a training module, configured to use image data to be processed including a target text as training sample data; performing loss operation according to the training sample data and the labeled data used for the target detection network training to obtain a loss function; training the target detection network according to the back propagation of the loss function to obtain a trained target detection network, and taking the trained target detection network as the pre-trained target detection network

The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.

According to an embodiment of the present application, an electronic device and a readable storage medium are also provided.

Fig. 13 is a block diagram of an electronic device for implementing the object detection method according to the embodiment of the present application. The electronic device may be the aforementioned deployment device or proxy device. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.

As shown in fig. 13, the electronic apparatus includes: one or more processors 801, memory 802, and interfaces for connecting the various components, including a high speed interface and a low speed interface. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). Fig. 13 illustrates an example of a processor 801.

The memory 802 is a non-transitory computer readable storage medium as provided herein. Wherein the memory stores instructions executable by at least one processor to cause the at least one processor to perform the object detection method provided herein. The non-transitory computer-readable storage medium of the present application stores computer instructions for causing a computer to perform the object detection method provided by the present application.

The memory 802, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the object detection method in the embodiment of the present application (for example, the image capture module, the candidate detection frame acquisition module, the object detection module, and the like shown in fig. 12). The processor 801 executes various functional applications of the server and data processing by running non-transitory software programs, instructions, and modules stored in the memory 802, that is, implements the object detection method in the above-described method embodiments.

The memory 802 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to use of the electronic device, and the like. Further, the memory 802 may include high speed random access memory and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 802 optionally includes memory located remotely from the processor 801, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The electronic device of the target detection method may further include: an input device 803 and an output device 804. The processor 801, the memory 802, the input device 803, and the output device 804 may be connected by a bus or other means, and are exemplified by a bus in fig. 13.

The input device 803 may receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device, such as a touch screen, keypad, mouse, track pad, touch pad, pointer stick, one or more mouse buttons, track ball, joystick, or other input device. The output devices 804 may include a display device, auxiliary lighting devices (e.g., LEDs), and haptic feedback devices (e.g., vibrating motors), among others. The display device may include, but is not limited to, a Liquid Crystal Display (LCD), a Light Emitting Diode (LED) display, and a plasma display. In some implementations, the display device can be a touch screen.

Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

These computer programs (also known as programs, software applications, or code) include machine instructions for a programmable processor, and may be implemented using high-level procedural and/or object-oriented programming languages, and/or assembly/machine languages. As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, and the present invention is not limited thereto as long as the desired results of the technical solutions disclosed in the present application can be achieved.

The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method of object detection, the method comprising:

2. The method of claim 1, further comprising:

extracting a first detection box containing the target text and a second detection box containing partial text in the target text from the at least two candidate detection boxes;

performing the overlapping area operation according to the overlapping area of the first detection frame and the second detection frame to obtain an overlapping area index;

and taking the overlap region index as the classification index.

3. The method of claim 2, wherein performing the overlap region operation according to the overlap region of the first detection frame and the second detection frame to obtain an overlap region indicator comprises:

taking the overlapping area of the first detection frame and the second detection frame as an intersection;

taking the frame area of the second detection frame as a union set;

and performing the overlapping region operation according to the intersection and the union to obtain the overlapping region index.

4. The method according to claim 2 or 3, wherein the classifying and screening the at least two candidate detection boxes according to the classification index to remove the candidate detection boxes containing part of the text in the target text to obtain the target detection box comprises:

comparing the classification index with a preset threshold value for the classification screening to obtain a comparison result;

and if the comparison result is that the candidate detection frame with the classification index larger than the preset threshold is the second detection frame, deleting the second detection frame from the at least two candidate detection frames to obtain the target detection frame.

5. The method of claim 2 or 3, wherein the overlapping region comprises:

the first detection frame and the second detection frame have a first area obtained by the total overlapping; and/or the presence of a gas in the gas,

the first detection frame and the second detection frame have a second region obtained by the partial overlapping.

6. The method according to claim 2 or 3, wherein the extracting, from the at least two candidate detection boxes, a first detection box containing the target text and a second detection box containing partial text in the target text comprises:

extracting a first selection frame with highest confidence level in the candidate first detection frames from the at least two candidate detection frames;

obtaining a second comparison frame closest to the first detection frame according to the distance between the first detection frame and a second detection frame of the peripheral candidate;

correspondingly, the performing the overlapping area operation according to the overlapping area of the first detection frame and the second detection frame to obtain an overlapping area index includes:

and performing the overlapping area operation according to the first selection frame and the second comparison frame to obtain the overlapping area index.

7. The method of claim 1, wherein the inputting the picture to be processed into a pre-trained target detection network further comprises:

according to the pre-trained target detection network, extracting the features of the target text in the picture to be processed to obtain the texture features of the target text;

the texture features include: and the method is used for at least one characteristic of vertical operation, diagonal operation and equation operation of layout analysis.

8. The method of claim 1, further comprising:

taking the picture data to be processed containing the target text as training sample data;

performing loss operation according to the training sample data and the labeled data used for the target detection network training to obtain a loss function;

and training the target detection network according to the back propagation of the loss function to obtain a trained target detection network, and taking the trained target detection network as the pre-trained target detection network.

9. An object detection apparatus, characterized in that the apparatus comprises:

10. The apparatus of claim 9, further comprising:

a detection box extracting module, configured to extract, from the at least two candidate detection boxes, a first detection box containing the target text and a second detection box containing a part of text in the target text;

an overlap processing module to:

and taking the overlap region index as the classification index.

11. The apparatus of claim 10, wherein the overlap processing module is configured to:

taking the frame area of the second detection frame as a union set;

12. The apparatus of claim 10 or 11, wherein the object detection module is configured to:

13. The apparatus of claim 10 or 11, wherein the overlap region comprises:

14. The apparatus of claim 10 or 11, wherein the detection box extraction module is configured to:

the overlap processing module is configured to:

15. The apparatus of claim 9, wherein the candidate detection box acquisition module comprises a feature extraction sub-module configured to:

16. The apparatus of claim 9, further comprising a training module to:

17. An electronic device, comprising:

at least one processor; and

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.