CN112949589A

CN112949589A - Target detection method, device, equipment and computer readable storage medium

Info

Publication number: CN112949589A
Application number: CN202110349799.5A
Authority: CN
Inventors: 程龙; 梁鼎
Original assignee: Shenzhen Sensetime Technology Co Ltd
Current assignee: Shenzhen Sensetime Technology Co Ltd
Priority date: 2021-03-31
Filing date: 2021-03-31
Publication date: 2021-06-11
Also published as: WO2022205816A1

Abstract

Disclosed are a target detection method, apparatus, device and computer-readable storage medium, the method comprising: acquiring characteristic information of an image to be detected; obtaining a first boundary frame of the target object according to the characteristic information; and obtaining a second boundary frame based on the feature information and the first boundary frame, and determining a region to be identified according to an image region corresponding to the second boundary frame, wherein the vertex of the second boundary frame is positioned on the first boundary frame and corresponds to the corner of the target object.

Description

Target detection method, device, equipment and computer readable storage medium

Technical Field

The present disclosure relates to computer vision technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for object detection.

Background

In the task of detecting and identifying bills and documents, bill mixed pasting is a very common phenomenon and comprises the bill mixed pasting of the same type, the bill mixed pasting of different types, the bill mixed pasting of different orientations and the like. However, the difference between the types of the bills, the difference of the orientation angles, and the mutual shielding between different bills bring great difficulty to the detection and identification of the mixed bill. Therefore, it is necessary to develop a method for effectively detecting and identifying a mixed note.

Disclosure of Invention

The embodiment of the disclosure provides a target detection scheme.

According to an aspect of the present disclosure, there is provided a target detection method, the method including: acquiring characteristic information of an image to be detected; obtaining a first boundary frame of the target object according to the characteristic information; and obtaining a second boundary frame based on the feature information and the first boundary frame, and determining a region to be identified according to an image region corresponding to the second boundary frame, wherein the vertex of the second boundary frame is positioned on the first boundary frame and corresponds to the corner of the target object.

In combination with any embodiment provided by the present disclosure, the obtaining a second bounding box based on the feature information and the first bounding box includes: determining the offset of the corner point of the target object relative to the first boundary box based on the feature information and the first boundary box; obtaining position information of the corner of the target object according to the first boundary box and the offset of the corner of the target object relative to the first boundary box; and obtaining the second bounding box according to the position information of the corner point.

In combination with any embodiment provided by the present disclosure, the method further comprises: and obtaining direction information of the target object according to the feature information and the position information of the corner points, wherein the direction information is represented by using one or more edges of a second bounding box formed by the corner points of the target object.

In combination with any embodiment provided by the present disclosure, the determining a region to be identified according to an image region corresponding to the second bounding box includes: performing affine transformation on the second boundary frame according to the direction information of the target object to obtain a corrected second boundary frame; and determining the image area corresponding to the corrected second bounding box as the area to be identified.

In combination with any embodiment provided by the present disclosure, the method further comprises: and performing text recognition on the area to be recognized to obtain a text recognition result.

In combination with any embodiment provided by the present disclosure, the method further comprises: obtaining visual features of the area to be identified; respectively determining the similarity between the visual features and the features of each type of object in the plurality of types of objects acquired in advance; and determining the classification result of the target object according to the similarity.

In combination with any one of the embodiments provided in the disclosure, the separately determining the similarity between the visual feature and the feature of each of the plurality of pre-acquired category objects includes: respectively acquiring Euclidean distances between the visual features and the features of each category object in the plurality of category objects; the determining the classification result of the target object according to the similarity includes: and determining the classification result of the target object according to the object class corresponding to the minimum Euclidean distance in the Euclidean distances.

In combination with any one of the embodiments provided by the present disclosure, the target detection method is performed by using a neural network, the neural network including a feature extraction network for obtaining feature information of an image to be detected, a first detection network for obtaining a first bounding box for a target object, and a second detection network for obtaining a second bounding box for the target object; the method further comprises the following steps: and performing end-to-end training on the feature extraction network, the first detection network and the second detection network by using a sample image, wherein the sample image is marked with a first boundary box of a target object and a corner of the target object.

In combination with any one of the embodiments provided by the present disclosure, the sample image is further labeled with direction information of a target object, and the direction is characterized by one of the edges of the target object.

According to an aspect of the present disclosure, there is provided an object detection apparatus, the apparatus including: the first acquisition unit is used for acquiring the characteristic information of the image to be detected; the second acquisition unit is used for acquiring a first boundary frame of the target object according to the characteristic information; and the determining unit is used for obtaining a second boundary box based on the feature information and the first boundary box, and determining a region to be identified according to an image region corresponding to the second boundary box, wherein a vertex of the second boundary box is positioned on the first boundary box and corresponds to a corner of the target object.

In combination with any embodiment provided by the present disclosure, when the determining unit is configured to obtain a second bounding box based on the feature information and the first bounding box, the determining unit is specifically configured to: determining the offset of the corner point of the target object relative to the first boundary box based on the feature information and the first boundary box; obtaining position information of the corner of the target object according to the first boundary box and the offset of the corner of the target object relative to the first boundary box; and obtaining the second bounding box according to the position information of the corner point.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further comprises: and the direction obtaining unit is used for obtaining the direction information of the target object according to the feature information and the position information of the corner points, wherein the direction information is represented by using one or more sides of a second boundary box formed by the corner points of the target object.

In combination with any embodiment provided by the present disclosure, when the determining unit is configured to determine the region to be identified according to the image region corresponding to the second bounding box, specifically, the determining unit is configured to: performing affine transformation on the second boundary frame according to the direction information of the target object to obtain a corrected second boundary frame; and determining the image area corresponding to the corrected second bounding box as the area to be identified.

In combination with any embodiment provided by the present disclosure, the apparatus further includes a text recognition unit, configured to perform text recognition on the region to be recognized, so as to obtain a text recognition result.

In combination with any one of the embodiments provided by the present disclosure, the apparatus further includes a classification unit, configured to obtain visual features of the area to be identified; respectively determining the similarity between the visual features and the features of each type of object in the plurality of types of objects acquired in advance; and determining the classification result of the target object according to the similarity.

In combination with any one of the embodiments provided in the disclosure, the classification unit, when being configured to determine the similarity between the visual feature and the feature of each of the plurality of pre-acquired category objects, is specifically configured to: respectively acquiring Euclidean distances between the visual features and the features of each category object in the plurality of category objects; the determining the classification result of the target object according to the similarity includes: and determining the classification result of the target object according to the object class corresponding to the minimum Euclidean distance in the Euclidean distances.

In combination with any one of the embodiments provided by the present disclosure, the target detection method is performed by using a neural network, the neural network including a feature extraction network for obtaining feature information of an image to be detected, a first detection network for obtaining a first bounding box for a target object, and a second detection network for obtaining a second bounding box for the target object; the device further comprises a training unit, configured to perform end-to-end training on the feature extraction network, the first detection network, and the second detection network by using a sample image, where the sample image is labeled with a first bounding box of a target object and a corner of the target object.

According to an aspect of the present disclosure, there is provided an electronic device, the device comprising a memory for storing computer instructions executable on a processor, the processor being configured to perform the method according to any one of the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any one of the embodiments of the present disclosure.

According to an aspect of the present disclosure, there is provided a computer program product comprising a computer program, characterized in that the computer program, when executed by a processor, implements the method of any of the embodiments of the present disclosure.

According to the target detection method, the target detection device, the target detection equipment and the computer readable medium, firstly, a first detection result for a target object is obtained according to feature information of an image to be detected, and under the condition that the first detection result comprises a first boundary box of the target object, offset of a corner of the target object relative to the boundary box is obtained according to the feature information and the boundary box of the target object, so that a more accurate boundary box of the target object can be obtained, accurate detection and segmentation of each target object can be realized under a mixed scene of target objects with multiple angles and types, and subsequent text recognition processing is facilitated.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present specification and together with the description, serve to explain the principles of the specification.

Fig. 1 is a flowchart of a target detection method proposed by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of target detection using a neural network according to an embodiment of the present disclosure;

fig. 3 is a schematic diagram of an offset of a corner point of a target object with respect to the first boundary box in the target detection method proposed by the embodiment of the present disclosure;

fig. 4 is a schematic diagram of a detection result obtained by the target detection method provided by the embodiment of the disclosure;

FIG. 5 is a flow chart of a method for classifying objects according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an object detection device in accordance with an implementation of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.

It is to be understood that although the terms first, second, third, etc. may be used herein to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of the present disclosure. The word "if" as used herein may be interpreted as "at … …" or "when … …" or "in response to a determination", depending on the context.

At least one embodiment of the present disclosure provides an object detection method, which may be performed by an electronic device such as a terminal device or a server, where the terminal device may be a User Equipment (UE), a mobile device, a User terminal, a cellular phone, a cordless phone, a Personal Digital Assistant (PDA), a handheld device, a computing device, an in-vehicle device, a wearable device, or the like, and the method may be implemented by a processor calling a computer readable instruction stored in a memory.

Fig. 1 illustrates an object detection method according to at least one embodiment of the present disclosure. As shown in fig. 1, the method includes steps 101 to 103.

In step 101, feature information of an image to be detected is acquired.

In some embodiments, the image to be detected may be an image comprising at least one paper carrier, such as a ticket, document, photograph, or the like. In the case of containing a plurality of paper carriers, the plurality of paper carriers may be mixed and pasted at different types and different angles. It will be appreciated by those skilled in the art that the image to be detected may also comprise other types of paper carriers, not limited to those described above.

In one example, the feature data of the image to be detected may be extracted through a feature extraction network, such as a convolutional neural network, and the specific structure of the feature extraction network is not limited by the embodiments of the present disclosure.

Fig. 2 is a schematic diagram of target detection using a neural network according to an embodiment of the present disclosure. As shown in FIG. 2, an image I to be detected is obtained according to a feature extraction network 201_iAnd extracting the features to obtain feature information.

In step 102, a first bounding box of the target object is obtained according to the feature information.

In the embodiment of the present disclosure, a plurality of candidate regions may be first generated according to the feature information, for example, a candidate region may be generated by using an rpn (region pro positive network) network structure, and each candidate region may be represented by a candidate bounding box and has a confidence indicating whether a target object exists in the candidate region. From the plurality of candidate regions, a target candidate region may be determined, for example, by a Non-Maximum Suppression (NMS), from among the plurality of candidate regions, according to the confidence of the plurality of candidate regions. The target candidate region may be indicated by a first bounding box of the target object, the first bounding box typically being a bounding rectangular box of the target object, and the four vertex coordinates of the first bounding box being known.

Referring to the schematic diagram of performing target detection by using a neural network shown in fig. 2, the first detection network 202 is used to perform target detection on the feature information obtained in step 101, so as to obtain the image I to be detected_iI.e. the first detection network 202 outputs parameter information of the first bounding box of the target object, e.g. coordinates of four vertices of a rectangular first bounding box.

In step 103, a second bounding box is obtained based on the feature information and the first bounding box, and the region to be identified is determined according to the image region corresponding to the second bounding box.

When the first bounding box of the target object is obtained, the first bounding box may be modified according to the feature information to obtain a second bounding box of the target object. And the vertex of the second bounding box is positioned on the first bounding box and corresponds to the corner of the target object. That is, the shape, size and position of the bounding box formed by connecting the second bounding box with the corner points of the target object in sequence are matched. And determining a region to be recognized for subsequent text recognition processing according to the image region corresponding to the second bounding box.

In an embodiment of the present disclosure, the physical shape of the target object is a polygon. Taking a quadrangle as an example, four corner points of the target object are respectively located on four edges of the first boundary frame. And connecting the four corner points of the target object according to a clockwise or anticlockwise sequence to obtain a second boundary box of the target object.

Referring to the schematic diagram of target detection using a neural network shown in fig. 2, a second bounding box of the target object may be obtained by using the second detection network 203 to compare the feature information obtained in step 101 with the first bounding box output by the first detection network 202. At the output of the second detection network 203Processed target image I_tThe target image I_tA second bounding box of the target object is included, and a first bounding box of the target object may also be included.

In the embodiment of the disclosure, a first boundary frame for a target object is obtained according to the characteristic information of an image to be detected; and obtaining a second boundary frame based on the feature information and the first boundary frame, and determining a region to be identified according to an image region corresponding to the second boundary frame, wherein the vertex of the second boundary frame is positioned on the first boundary frame and corresponds to the corner of the target object, so that the second boundary frame is more fitted with the shape of the target object, accurate detection and segmentation of each target object can be realized under the mixed scene of target objects with multiple angles and types, and subsequent text identification processing is facilitated.

In some embodiments, an offset of a corner point of the target object with respect to the first bounding box may be first determined based on the feature information and the first bounding box; then obtaining the position information of the corner of the target object according to the first boundary box and the offset of the corner of the target object relative to the first boundary box; and obtaining the second bounding box according to the position information of the corner point.

Fig. 3 is a schematic diagram of an offset of a corner point of a target object with respect to the first boundary box in the target detection method provided by the embodiment of the disclosure. As shown in fig. 3, for the target object whose physical shape is a quadrangle, the rectangular first bounding box 301 is obtained in step 102, and the four vertices of the first bounding box 301 are t₁、t₂、t₃、t₄. In step 103, the predicted offset of the corner point with respect to the first bounding box 301 comprises the corner point p of the target object₁At the edge t of the first bounding box 301₁-t₂Amount of deviation Δ of₁Angular point p₂At the edge t₂-t₃Amount of deviation Δ of₂Angular point p₃At the edge t₃-t₄Amount of deviation Δ of₃Angular point p₄At the edge t₄-t₁Amount of deviation Δ of₄. And obtaining the position information of the corner point of the target object according to the first detection result and the offset of the corner point relative to the first boundary frame. And according to the position information of the angle, a second boundary frame which is more in line with the shape and the posture of the target object and is more in fit with the target object can be obtained.

As shown in FIG. 3, the vertex coordinates of the first bounding box 301 are used, along with the amount of shift Δ₁、Δ₂、Δ₃、Δ₄Then four corner points p of the target object can be determined₁、p₂、p₃、p₄So that a second bounding box 302 of the target object may be obtained.

In the embodiment of the disclosure, by obtaining the second bounding box more closely attached to the target object, the background image which does not belong to the target object is prevented from being included in the detection result, and the accuracy of target detection is improved.

In some embodiments, the direction information of the target object may be further obtained according to the feature information and the position information of the corner points, where the direction information may be characterized by using one or more edges of a second bounding box formed by the corner points of the target object.

For a target object such as a ticket, document, etc., the directional information of the target object is typically related to the direction of the lines of text in the target object. For example, an edge that is consistent or tends to be consistent with the direction of a line of text may be determined as an edge indicating direction information.

For a quadrilateral target object, the direction of the target object may be represented by one of the edges of the second bounding box of the quadrilateral, that is, the direction of the target object is essentially divided into four classes, and the direction information is the result of the four-class output.

As shown in FIG. 3, the directional information of the target object may utilize four sides p of the second bounding box 302₁-p₂、p₂-p₃、p₃-p₄、p₄-p₁Is characterized by one of the above.

Fig. 4 is a schematic diagram of a detection result obtained by the target detection method provided in the embodiment of the disclosure, that is, the target image I in fig. 2_tEnlarged view of (a). As shown in fig. 4, the image to be detected includes a tabbed ticket region and a background region, wherein the background region is an image region other than the ticket, such as the image region indicated by 400 in fig. 4. From the feature information obtained by feature detection of the image in fig. 4, a first bounding box of a rectangle for each bill can be obtained, as indicated by 401 in fig. 4. As can be seen from the figure, since the bill is inclined, the first bounding box in the rectangular shape includes the background image except for the bill, which brings difficulty to subsequent character detection and recognition. In the embodiment of the disclosure, the feature information and the first bounding box of each bill are subjected to offset of the corner of each bill to determine the position information of the corner, so as to obtain a second bounding box formed by the corner of each bill and matched with the shape of the bill, as indicated by 402 in fig. 4. As can be seen from the figure, the second boundary frame is more attached to the edge of the bill, contains fewer background images and is beneficial to subsequent character detection and identification. And the detection result obtained by performing object detection on the image to be detected further includes direction information of each bill, where the direction information is represented by one of the edges of the second bounding box, such as a line indicated by 403 in fig. 4, and is used to represent the direction information of the bill in the second bounding box 402, that is, the text direction of the bill in the second bounding box 402 is consistent with or tends to be consistent with the line 403.

In some embodiments, affine transformation may be performed on the second bounding box according to the direction information of the target object, so as to obtain a corrected second bounding box; and determining the image area corresponding to the corrected second bounding box as the area to be recognized so as to perform text recognition and obtain a text recognition result. The second bounding box is affine-transformed according to the direction information of the target object, and the second bounding box can be transformed into a bounding box with a regular shape and a forward direction. The direction is a forward direction, which means that the side of the direction information is parallel or vertical to the side of the image to be detected.

In the embodiment of the present disclosure, the text recognition is performed on the region to be recognized, that is, on the partial image region in the whole image to be detected, so that the data amount processed by the text recognition is reduced, and the text recognition efficiency is improved.

The direction detection result shown in fig. 4 is taken as an example. In the case where the target object is a ticket, i.e. the physical shape is a rectangle, the second bounding box of the ticket may be a quadrilateral of arbitrary shape, as indicated at 402, where one side 403 of the second bounding box indicates the orientation information of the ticket. The affine transformation is performed on the second bounding box according to the direction information of the side 403, and the second bounding box is corrected, so that the second bounding box becomes a rectangular bounding box with the direction parallel to the side in the horizontal direction of the image to be detected.

In the embodiment of the present disclosure, since the direction information of the target object is indicated by the bar in the second bounding box, the rectification efficiency of the second bounding box can be improved according to the direction information. Moreover, according to the object detection method provided by the embodiment of the disclosure, the obtained second bounding box contains fewer background images, the second bounding box is subjected to affine transformation for correcting, the image area corresponding to the corrected second bounding box contains fewer pieces of image information irrelevant to the bill, and text detection and recognition are performed on such image area, so that a more accurate text detection and recognition result is obtained.

In the related art, a neural network is usually trained by using a sample image of a set class to classify a target object detected in an object to be detected. However, in this method, for untrained classes or classes with little sample data, the neural network cannot identify or accurately classify the target object.

In order to solve the above problem, an embodiment of the present disclosure provides a method for classifying a target object. FIG. 5 shows a flow chart of the method, comprising steps 501-503.

In step 501, the visual characteristics of the image area corresponding to the second bounding box are obtained.

In the embodiment of the present disclosure, the visual feature represented in the form of the feature vector may be obtained according to the feature information corresponding to the second bounding box. For target objects of a ticket, document type, the visual features may include semantic related features, position related features, etc. of text blocks, which refer to one or more lines of text contained in the target object.

In step 502, the similarity between the visual feature and the feature of each of the plurality of pre-acquired category objects is determined, respectively.

In the embodiment of the present disclosure, the features of the plurality of category objects obtained in advance may be stored in the base library as the feature template. By calling each feature template in the bottom library to be compared with the visual features to be processed, the similarity between each feature template and the visual features is determined, and the target object to which the visual features belong can be determined to be closer to the feature template of which category in the visual features such as semantics, text distribution and the like.

In step 503, a classification result of the target object is determined according to the similarity.

In one example, the similarity between the visual features and the feature templates can be calculated by calculating euclidean distances between the visual features to be processed and the respective feature templates, and the object class corresponding to the feature template with the minimum distance is determined as the classification result of the target object.

Taking the classification of the bills as an example, by obtaining the visual features of the bills and calculating the similarity between the visual features and the feature templates of the bills of each type stored in the bottom library, the bill type corresponding to the feature template with the highest similarity can be determined as the final classification result. According to the classification method, any type of bills can be classified, and efficient bill classification is realized. In addition, according to the target detection method provided by the embodiment of the disclosure, detection, correction and classification of any mixed note can be realized at the same time.

Under the condition that the image to be detected contains the mixed note, according to the target detection method disclosed by any embodiment of the disclosure, the second detection frame of each note in the image to be detected can be obtained, and the region to be identified of each note is determined according to the image region corresponding to the second detection frame. The text recognition result of each bill can be obtained by performing text recognition on the region to be recognized corresponding to each bill, wherein the text recognition result can contain a structured text.

In some embodiments, in the case that at least two tickets of the same type are included in the mixed note ticket, the text recognition results of the tickets of the same type can be integrally and structurally output.

Taking the image of the mixed note shown in fig. 4 as an example, fig. 4 contains A, B, C, D four notes, wherein notes C, D are of the same type. The structured text may be output simultaneously with the text recognition result of the ticket C, D, for example, in the form of a table. Taking the ticket C, D as a train ticket as an example, the output text recognition result includes the following table.

TABLE 1

	Target object C	Target object D
			Starting station	XX	XX
Start station	XX	XX
			Number of vehicles	XX	XX
Price of ticket	XX	XX
			...

In some embodiments, in a case where at least two different types of tickets are included in the sticky note ticket, an association between the different types of tickets may be established when outputting the text recognition result to associate the text recognition results of the different types of tickets.

Also exemplified by the image of a mixed note shown in fig. 4, note A, B is a different type of note. Assuming that the bill a is an air transportation electronic itinerary and the bill B is a hotel accommodation bill, the text recognition result of the output bill A, B can be associated with the "XXX travel bill" as a topic according to the name recognized in the air transportation electronic itinerary.

In some embodiments, the structured text recognition results for each of the tickets in the mixed-posting ticket may be output independently.

It will be understood by those skilled in the art that the present disclosure is not intended to limit the output form of the text recognition result of the sticky note, and those skilled in the art can specifically select the type of the sticky note included in the image to be detected, or the purpose of the text recognition result.

The target detection method proposed by the embodiment of the present disclosure may be performed by using a neural network, referring to fig. 2, the neural network includes a feature extraction network 201 for acquiring feature information of an image to be detected, a first detection network 202 for acquiring a first detection result for a target object, and a second detection network 203 for acquiring an offset of a corner of the target object with respect to the first bounding box. The feature extraction network 201, the first detection network 202, and the second detection network 203 may be trained end to end by using a sample image, where the sample image is labeled with a first boundary frame of a target object and a corner of the target object, labeled with a corner of the target object, and sequentially labeled with coordinates of the corner of the target object. For example, for a target object with a quadrilateral physical shape, coordinates of four corner points may be marked clockwise from the top left corner point, so that an offset of the corner points of the target object in the sample image relative to the first bounding box and a direction of the target object may be calculated. Therefore, the trained neural network is used for carrying out target detection on the image to be detected, and the offset of the corner point of the target object in the image to be detected relative to the first boundary frame can be obtained.

In some embodiments, the sample image is further labeled with the direction information of the target object, that is, one of the edges in the second bounding box of the target object in the sample image is labeled to indicate the orientation of the sample image. The method comprises the steps of training a neural network by using a sample image marked with direction information, and performing target detection on an image to be detected by using the trained neural network, so as to obtain the direction information of a target object in the image to be detected.

Fig. 6 is a schematic diagram of an object detection apparatus according to an implementation of the present disclosure, and as shown in fig. 6, the apparatus includes: a first obtaining unit 601, configured to obtain feature information of an image to be detected; a second obtaining unit 602, configured to obtain a first bounding box of the target object according to the feature information; a determining unit 603, configured to obtain a second bounding box based on the feature information and the first bounding box, and determine a region to be identified according to an image region corresponding to the second bounding box, where a vertex of the second bounding box is located on the first bounding box and corresponds to a corner of the target object.

In some embodiments, when the determining unit is configured to obtain the second bounding box based on the feature information and the first bounding box, the determining unit is specifically configured to: determining the offset of the corner point of the target object relative to the first boundary box based on the feature information and the first boundary box; obtaining position information of the corner of the target object according to the first boundary box and the offset of the corner of the target object relative to the first boundary box; and obtaining the second bounding box according to the position information of the corner point.

In some embodiments, the apparatus further comprises: and the direction obtaining unit is used for obtaining the direction information of the target object according to the feature information and the position information of the corner points, wherein the direction information is represented by using one or more sides of a second boundary box formed by the corner points of the target object.

In some embodiments, when the determining unit is configured to determine the region to be identified according to the image region corresponding to the second bounding box, the determining unit is specifically configured to: performing affine transformation on the second boundary frame according to the direction information of the target object to obtain a corrected second boundary frame; and determining the image area corresponding to the corrected second bounding box as the area to be identified.

In some embodiments, the apparatus further includes a text recognition unit, configured to perform text recognition on the region to be recognized, so as to obtain a text recognition result.

In some embodiments, the apparatus further comprises a classification unit for obtaining visual features of the region to be identified; respectively determining the similarity between the visual features and the features of each type of object in the plurality of types of objects acquired in advance; and determining the classification result of the target object according to the similarity.

In some embodiments, the classification unit, when configured to determine the similarity between the visual feature and the feature of each of the pre-acquired multiple types of objects, is specifically configured to: respectively acquiring Euclidean distances between the visual features and the features of each category object in the plurality of category objects; the determining the classification result of the target object according to the similarity includes: and determining the classification result of the target object according to the object class corresponding to the minimum Euclidean distance in the Euclidean distances.

In some embodiments, the target detection method is performed using a neural network including a feature extraction network for obtaining feature information of an image to be detected, a first detection network for obtaining a first bounding box for a target object, and a second detection network for obtaining a second bounding box for the target object; the device further comprises a training unit, configured to perform end-to-end training on the feature extraction network, the first detection network, and the second detection network by using a sample image, where the sample image is labeled with a first bounding box of a target object and a corner of the target object.

In some embodiments, the sample image is further labeled with direction information of a target object, the direction being characterized by one of the edges of the target object.

The present disclosure also provides an electronic device, please refer to fig. 7, which shows a structure of the device, where the device includes a memory for storing computer instructions executable on a processor, and the processor is configured to implement the method according to any embodiment of the present disclosure when executing the computer instructions.

The present disclosure also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the embodiments of the present disclosure.

The present disclosure also provides a computer program product comprising a computer program which, when executed by a processor, performs the method of any of the embodiments of the present disclosure.

As will be appreciated by one skilled in the art, one or more embodiments of the present description may be provided as a method, system, or computer program product. Accordingly, one or more embodiments of the present description may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, one or more embodiments of the present description may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the data processing apparatus embodiment, since it is substantially similar to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to part of the description of the method embodiment.

The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the acts or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.

Embodiments of the subject matter and the functional operations described in this specification can be implemented in: digital electronic circuitry, tangibly embodied computer software or firmware, computer hardware including the structures disclosed in this specification and their structural equivalents, or a combination of one or more of them. Embodiments of the subject matter described in this specification can be implemented as one or more computer programs, i.e., one or more modules of computer program instructions, encoded on a tangible, non-transitory program carrier for execution by, or to control the operation of, data processing apparatus. Alternatively or additionally, the program instructions may be encoded on an artificially generated propagated signal, e.g., a machine-generated electrical, optical, or electromagnetic signal, that is generated to encode and transmit information to suitable receiver apparatus for execution by the data processing apparatus. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processes and logic flows described in this specification can be performed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating on input data and generating output. The processes and logic flows can also be performed by, and apparatus can also be implemented as, special purpose logic circuitry, e.g., an FPGA (field programmable gate array) or an ASIC (application-specific integrated circuit).

Computers suitable for executing computer programs include, for example, general and/or special purpose microprocessors, or any other type of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory and/or a random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include, or be operatively coupled to receive data from or transfer data to, or both, one or more mass storage devices for storing data, e.g., magnetic, magneto-optical disks, or optical disks. However, a computer does not necessarily have such a device. Moreover, a computer may be embedded in another device, e.g., a mobile telephone, a Personal Digital Assistant (PDA), a mobile audio or video player, a game console, a Global Positioning System (GPS) receiver, or a portable storage device such as a Universal Serial Bus (USB) flash drive, to name a few.

Computer-readable media suitable for storing computer program instructions and data include all forms of non-volatile memory, media and memory devices, including by way of example semiconductor memory devices (e.g., EPROM, EEPROM, and flash memory devices), magnetic disks (e.g., an internal hard disk or a removable disk), magneto-optical disks, and CD ROM and DVD-ROM disks. The processor and the memory can be supplemented by, or incorporated in, special purpose logic circuitry.

While this specification contains many specific implementation details, these should not be construed as limitations on the scope of any invention or of what may be claimed, but rather as descriptions of features specific to particular embodiments of particular inventions. Certain features that are described in this specification in the context of separate embodiments can also be implemented in combination in a single embodiment. In other instances, features described in connection with one embodiment may be implemented as discrete components or in any suitable subcombination. Moreover, although features may be described above as acting in certain combinations and even initially claimed as such, one or more features from a claimed combination can in some cases be excised from the combination, and the claimed combination may be directed to a subcombination or variation of a subcombination.

Similarly, while operations are depicted in the drawings in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order, or that all illustrated operations be performed, to achieve desirable results. In some cases, multitasking and parallel processing may be advantageous. Moreover, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can generally be integrated together in a single software product or packaged into multiple software products.

Thus, particular embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. In some cases, the actions recited in the claims can be performed in a different order and still achieve desirable results. Further, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some implementations, multitasking and parallel processing may be advantageous.

The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims

1. A method of object detection, the method comprising:

acquiring characteristic information of an image to be detected;

obtaining a first boundary frame of the target object according to the characteristic information;

and obtaining a second boundary frame based on the feature information and the first boundary frame, and determining a region to be identified according to an image region corresponding to the second boundary frame, wherein the vertex of the second boundary frame is positioned on the first boundary frame and corresponds to the corner of the target object.

2. The method of claim 1, wherein deriving a second bounding box based on the feature information and the first bounding box comprises:

determining the offset of the corner point of the target object relative to the first boundary box based on the feature information and the first boundary box;

obtaining position information of the corner of the target object according to the first boundary box and the offset of the corner of the target object relative to the first boundary box;

and obtaining the second bounding box according to the position information of the corner point.

3. The method of claim 2, further comprising: and obtaining direction information of the target object according to the feature information and the position information of the corner points, wherein the direction information is represented by using one or more edges of a second bounding box formed by the corner points of the target object.

4. The method according to claim 3, wherein the determining the region to be identified according to the image region corresponding to the second bounding box comprises:

performing affine transformation on the second boundary frame according to the direction information of the target object to obtain a corrected second boundary frame;

and determining the image area corresponding to the corrected second bounding box as the area to be identified.

5. The method according to any one of claims 1 to 4, further comprising:

and performing text recognition on the area to be recognized to obtain a text recognition result.

6. The method according to any one of claims 1 to 5, further comprising:

obtaining visual features of the area to be identified;

respectively determining the similarity between the visual features and the features of each type of object in the plurality of types of objects acquired in advance;

and determining the classification result of the target object according to the similarity.

7. The method according to claim 6, wherein the determining the similarity between the visual feature and the feature of each of the plurality of pre-acquired class objects comprises:

respectively acquiring Euclidean distances between the visual features and the features of each category object in the plurality of category objects;

the determining the classification result of the target object according to the similarity includes:

and determining the classification result of the target object according to the object class corresponding to the minimum Euclidean distance in the Euclidean distances.

8. The method according to any one of claims 1 to 7, wherein the target detection method is performed using a neural network including a feature extraction network for obtaining feature information of an image to be detected, a first detection network for obtaining a first bounding box for a target object, and a second detection network for obtaining a second bounding box for the target object; the method further comprises the following steps:

and performing end-to-end training on the feature extraction network, the first detection network and the second detection network by using a sample image, wherein the sample image is marked with a first boundary box of a target object and a corner of the target object.

9. The method of claim 8, wherein the sample image is further labeled with direction information of a target object, the direction being characterized by one of the edges of the target object.

10. An object detection apparatus, characterized in that the apparatus comprises:

the first acquisition unit is used for acquiring the characteristic information of the image to be detected;

the second acquisition unit is used for acquiring a first boundary frame of the target object according to the characteristic information;

and the determining unit is used for obtaining a second boundary box based on the feature information and the first boundary box, and determining a region to be identified according to an image region corresponding to the second boundary box, wherein a vertex of the second boundary box is positioned on the first boundary box and corresponds to a corner of the target object.

11. An electronic device, comprising a memory for storing computer instructions executable on a processor, the processor being configured to implement the method of any one of claims 1 to 9 when executing the computer instructions.

12. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the method of any one of claims 1 to 9.

13. A computer program product comprising a computer program, characterized in that the computer program realizes the method of any of claims 1 to 9 when executed by a processor.