CN115797660A

CN115797660A - Image detection method, image detection device, electronic equipment and storage medium

Info

Publication number: CN115797660A
Application number: CN202211415661.1A
Authority: CN
Inventors: 夏春龙
Original assignee: Apollo Zhilian Beijing Technology Co Ltd; Apollo Zhixing Technology Guangzhou Co Ltd
Current assignee: Apollo Zhilian Beijing Technology Co Ltd; Apollo Zhixing Technology Guangzhou Co Ltd
Priority date: 2022-11-11
Filing date: 2022-11-11
Publication date: 2023-03-14

Abstract

The disclosure provides an image detection method, an image detection device, electronic equipment and a storage medium, and relates to the technical field of artificial intelligence, in particular to the fields of automatic driving, intelligent transportation, computer vision and the like. The implementation scheme is as follows: obtaining a target image to be detected, wherein the target image comprises a plurality of objects; performing feature extraction on the target image to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior boxes corresponding to the plurality of scales; aiming at each feature in a plurality of features corresponding to a plurality of feature scales, obtaining a sub-feature corresponding to each prior frame in a plurality of prior frames corresponding to the feature; obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and obtaining a plurality of prediction boxes of the target image based on the plurality of target sub-features, each of the plurality of prediction boxes indicating one of the plurality of objects.

Description

Image detection method, image detection device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of artificial intelligence, particularly to the fields of automatic driving, intelligent transportation, computer vision, and the like, and in particular, to an image detection method, apparatus, electronic device, computer-readable storage medium, and computer program product.

Background

Artificial intelligence is the subject of research that makes computers simulate some human mental processes and intelligent behaviors (such as learning, reasoning, thinking, planning, etc.), both at the hardware level and at the software level. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing, and the like; the artificial intelligence software technology mainly comprises a computer vision technology, a voice recognition technology, a natural language processing technology, machine learning/deep learning, a big data processing technology, a knowledge map technology and the like.

Image detection techniques based on artificial intelligence, which obtain image features of an image through an image detection model and obtain objects (e.g., people, vehicles, etc.) contained in the image based on the image features, have been widely used in various scenes.

The approaches described in this section are not necessarily approaches that have been previously conceived or pursued. Unless otherwise indicated, it should not be assumed that any of the approaches described in this section qualify as prior art merely by virtue of their inclusion in this section. Similarly, unless otherwise indicated, the problems mentioned in this section should not be considered as having been acknowledged in any prior art.

Disclosure of Invention

The present disclosure provides an image detection method, apparatus, electronic device, computer-readable storage medium, and computer program product.

According to an aspect of the present disclosure, there is provided an image detection method including: obtaining a target image to be detected, wherein the target image comprises a plurality of objects; performing feature extraction on the target image to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior boxes corresponding to a plurality of scales;

for each feature in a plurality of features corresponding to the feature scales, obtaining a sub-feature corresponding to each prior frame in a plurality of prior frames corresponding to the feature; obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and obtaining a plurality of prediction boxes of the target image based on the plurality of target sub-features, each of the plurality of prediction boxes indicating one of the plurality of objects.

According to another aspect of the present disclosure, there is provided an image detection apparatus including: a target image acquisition unit configured to obtain a target image to be detected, the target image including a plurality of objects; a feature extraction unit configured to perform feature extraction on a target image to be detected to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior frames corresponding to a plurality of scales; a sub-feature obtaining unit, configured to obtain, for each of a plurality of features corresponding to the plurality of feature scales, a sub-feature corresponding to each of a plurality of prior frames corresponding to the feature; a target sub-feature obtaining unit configured to obtain a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and a prediction frame obtaining unit configured to obtain a plurality of prediction frames of the target image based on the plurality of target sub-features, each of the plurality of prediction frames indicating one of the plurality of objects.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method according to embodiments of the present disclosure.

According to another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method according to the embodiments of the present disclosure.

According to another aspect of the present disclosure, a computer program product is provided, comprising a computer program, wherein the computer program, when executed by a processor, implements the method according to embodiments of the present disclosure.

According to one or more embodiments of the present disclosure, a range of scales of an object in an image detected in image detection may be expanded, so that target detection can be achieved for more scales of the object.

It should be understood that the statements in this section are not intended to identify key or critical features of the embodiments of the present disclosure, nor are they intended to limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the embodiments and, together with the description, serve to explain the exemplary implementations of the embodiments. The illustrated embodiments are for purposes of illustration only and do not limit the scope of the claims. Throughout the drawings, identical reference numbers designate similar, but not necessarily identical, elements.

FIG. 1 illustrates a schematic diagram of an exemplary system in which various methods described herein may be implemented, according to an embodiment of the present disclosure;

FIG. 2 shows a flow diagram of an image processing method according to an embodiment of the present disclosure;

fig. 3 is a flowchart illustrating a process of obtaining a plurality of target sub-features from a plurality of sub-features corresponding to a plurality of features in an image processing method according to an embodiment of the present disclosure;

FIG. 4 is a flowchart illustrating a process of obtaining a plurality of prediction blocks of a target image based on a plurality of target sub-features in an image processing method according to an embodiment of the present disclosure;

fig. 5 shows a schematic diagram of a process in which an image processing method according to an embodiment of the present disclosure may be implemented;

fig. 6 is a flowchart illustrating a process of obtaining a plurality of target sub-features based on a prediction probability corresponding to each of a plurality of sub-features corresponding to each of a plurality of features in an image processing method according to an embodiment of the present disclosure;

FIG. 7 is a flowchart illustrating a process of obtaining a plurality of prediction blocks of a target image based on a plurality of target sub-features in an image processing method according to an embodiment of the present disclosure;

fig. 8 shows a block diagram of the structure of an image detection apparatus according to an embodiment of the present disclosure;

FIG. 9 illustrates a block diagram of an exemplary electronic device that can be used to implement embodiments of the present disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

In the present disclosure, unless otherwise specified, the use of the terms "first", "second", and the like to describe various elements is not intended to limit the positional relationship, the temporal relationship, or the importance relationship of the elements, and such terms are used only to distinguish one element from another. In some examples, a first element and a second element may refer to the same instance of the element, and in some cases, based on the context, they may also refer to different instances.

The terminology used in the description of the various described examples in this disclosure is for the purpose of describing particular examples only and is not intended to be limiting. Unless the context clearly indicates otherwise, if the number of elements is not specifically limited, the element may be one or a plurality of. Furthermore, the term "and/or" as used in this disclosure is intended to encompass any and all possible combinations of the listed items.

Embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.

Fig. 1 illustrates a schematic diagram of an exemplary system 100 in which various methods and apparatus described herein may be implemented in accordance with embodiments of the present disclosure. Referring to fig. 1, the system 100 includes one or

more client devices

101, 102, 103, 104, 105, and 106, a server 120, and one or more communication networks 110 coupling the one or more client devices to the server 120.

Client devices

101, 102, 103, 104, 105, and 106 may be configured to execute one or more applications.

In an embodiment of the present disclosure, the server 120 may run one or more services or software applications that enable the execution of the image detection method according to the present disclosure.

In some embodiments, the server 120 may also provide other services or software applications, which may include non-virtual environments and virtual environments. In some embodiments, these services may be provided as web-based services or cloud services, for example, provided to users of

client devices

101, 102, 103, 104, 105, and/or 106 under a software as a service (SaaS) model.

In the configuration shown in fig. 1, server 120 may include one or more components that implement the functions performed by server 120. These components may include software components, hardware components, or a combination thereof, which may be executed by one or more processors. A user

operating client devices

101, 102, 103, 104, 105, and/or 106 may, in turn, utilize one or more client applications to interact with server 120 to take advantage of the services provided by these components. It should be understood that a variety of different system configurations are possible, which may differ from system 100. Accordingly, fig. 1 is one example of a system for implementing the various methods described herein, and is not intended to be limiting.

The user may use the

client device

101, 102, 103, 104, 105, and/or 106 to receive a plurality of prediction boxes obtained according to the image detection method of the present disclosure. The client device may provide an interface that enables a user of the client device to interact with the client device. The client device may also output information to the user via the interface. Although fig. 1 depicts only six client devices, those skilled in the art will appreciate that any number of client devices may be supported by the present disclosure.

Client devices

101, 102, 103, 104, 105, and/or 106 may include various types of computer devices, such as portable handheld devices, general purpose computers (such as personal computers and laptop computers), workstation computers, wearable devices, smart screen devices, self-service terminal devices, service robots, gaming systems, thin clients, various messaging devices, sensors or other sensing devices, and so forth. These computer devices may run various types and versions of software applications and operating systems, such as MICROSOFT Windows, APPLE iOS, UNIX-like operating systems, linux, or Linux-like operating systems (e.g., GOOGLE Chrome OS); or include various Mobile operating systems such as MICROSOFT Windows Mobile OS, iOS, windows Phone, android. Portable handheld devices may include cellular telephones, smart phones, tablets, personal Digital Assistants (PDAs), and the like. Wearable devices may include head-mounted displays (such as smart glasses) and other devices. The gaming system may include a variety of handheld gaming devices, internet-enabled gaming devices, and the like. The client device is capable of executing a variety of different applications, such as various Internet-related applications, communication applications (e.g., email applications), short Message Service (SMS) applications, and may use a variety of communication protocols.

Network 110 may be any type of network known to those skilled in the art that may support data communications using any of a variety of available protocols, including but not limited to TCP/IP, SNA, IPX, etc. By way of example only, one or more networks 110 may be a Local Area Network (LAN), an ethernet-based network, a token ring, a Wide Area Network (WAN), the internet, a virtual network, a Virtual Private Network (VPN), an intranet, an extranet, a blockchain network, a Public Switched Telephone Network (PSTN), an infrared network, a wireless network (e.g., bluetooth, WIFI), and/or any combination of these and/or other networks.

The server 120 may include one or more general purpose computers, special purpose server computers (e.g., PC (personal computer) servers, UNIX servers, mid-end servers), blade servers, mainframe computers, server clusters, or any other suitable arrangement and/or combination. The server 120 may include one or more virtual machines running a virtual operating system, or other computing architecture involving virtualization (e.g., one or more flexible pools of logical storage that may be virtualized to maintain virtual storage for the server). In various embodiments, the server 120 may run one or more services or software applications that provide the functionality described below.

The computing units in server 120 may run one or more operating systems including any of the operating systems described above, as well as any commercially available server operating systems. The server 120 may also run any of a variety of additional server applications and/or middle tier applications, including HTTP servers, FTP servers, CGI servers, JAVA servers, database servers, and the like.

In some implementations, the server 120 may include one or more applications to analyze and consolidate data feeds and/or event updates received from users of the

client devices

101, 102, 103, 104, 105, and/or 106. Server 120 may also include one or more applications to display data feeds and/or real-time events via one or more display devices of

client devices

101, 102, 103, 104, 105, and/or 106.

In some embodiments, the server 120 may be a server of a distributed system, or a server incorporating a blockchain. The server 120 may also be a cloud server, or a smart cloud computing server or a smart cloud host with artificial intelligence technology. The cloud Server is a host product in a cloud computing service system, and is used for solving the defects of high management difficulty and weak service expansibility in the traditional physical host and Virtual Private Server (VPS) service.

The system 100 may also include one or more databases 130. In some embodiments, these databases may be used to store data and other information. For example, one or more of the databases 130 may be used to store information such as audio files and video files. The database 130 may reside in various locations. For example, the database used by the server 120 may be local to the server 120, or may be remote from the server 120 and may communicate with the server 120 via a network-based or dedicated connection. The database 130 may be of different types. In certain embodiments, the database used by the server 120 may be, for example, a relational database. One or more of these databases may store, update, and retrieve data to and from the databases in response to the commands.

In some embodiments, one or more of the databases 130 may also be used by applications to store application data. The databases used by the application may be different types of databases, such as key-value stores, object stores, or regular stores supported by a file system.

The system 100 of fig. 1 may be configured and operated in various ways to enable application of the various methods and apparatus described in accordance with the present disclosure.

In the related art, a predetermined number of prior frames (anchors) corresponding to objects in an image to be detected are designed a priori, and target detection is performed on the image according to the predetermined number of prior frames to obtain each object in the image corresponding to the predetermined number of prior frames. Due to the limited number of the prior frames, when the scale distribution of the object is wide, the object without the corresponding prior frame in the image is often not detected.

According to an aspect of the present disclosure, there is provided an image detection method. As shown in fig. 2, an image detection method 200 according to some embodiments of the present disclosure includes:

step S210: obtaining a target image to be detected, wherein the target image comprises a plurality of objects;

step S220: performing feature extraction on the target image to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior boxes corresponding to a plurality of scales;

step S230: for each of a plurality of features corresponding to the plurality of feature scales, obtaining a sub-feature corresponding to each of a plurality of prior frames corresponding to the feature;

step S240: obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and

step S250: based on the plurality of target sub-features, a plurality of prediction boxes of the target image are obtained, each of the plurality of prediction boxes indicating one of the plurality of objects.

The method comprises the steps of extracting features of a target image to be detected to obtain features of each feature scale of a plurality of feature scales, wherein each feature scale of the plurality of feature scales corresponds to a plurality of prior frames, screening the plurality of prior frames corresponding to the plurality of feature scales in the prediction process to obtain sub-features of the prior frame of the corresponding prediction frame to be predicted, and dynamically selecting the sub-features corresponding to the proper prior frame of the plurality of prior frames corresponding to the plurality of features to predict in the image detection process, so that the scale range of an object in the image detected in the image detection is expanded, and the target detection of the object with more scales can be realized.

In some embodiments, in step S210, the target image may be any image to be detected, such as an image acquired from a camera. The object included in the target image may be any object, for example, when the target image is an image obtained from an on-vehicle camera, the included objects may include a vehicle, a pedestrian, a traffic light, a zebra crossing, a lane line, a building, a traffic cone, and the like, and is not limited herein.

In some embodiments, in step S220, feature extraction is performed on the target image using a feature extraction network. In some embodiments, the feature extraction network may be a backbone network based on resnet, densenet, darknet, etc. architecture.

The feature extraction network comprises a plurality of feature extraction layers, and the plurality of feature extraction layers obtain features corresponding to each of a plurality of feature scales by down-sampling a target image. Wherein, the higher the multiple of down-sampling, the smaller the characteristic scale.

It will be appreciated that the higher the multiple of down-sampling, the larger the field of view of the obtained feature, i.e. the larger the scale on which the object can be detected; conversely, the smaller the multiple of down-sampling, the smaller the scale of the object that can be detected based on it.

According to an embodiment of the present disclosure, each of the plurality of feature scales corresponds to a plurality of prior boxes. The plurality of prior frames may be obtained by clustering according to different objects in advance. For example, for different objects, clustering is performed based on respective scales, obtaining a plurality of prior boxes whose aspect ratios conform to the aspect ratio of each object.

In some embodiments, each of the plurality of feature scales corresponds to as many prior boxes as possible to cover objects of as many scales as possible.

In some embodiments, the plurality of prior boxes for each of the plurality of feature scales is the same. That is, the multiple scales of the multiple prior frames corresponding to the characteristic scales are the same multiple scales.

In some embodiments, for a first feature scale of the plurality of feature scales and a second feature scale larger than the first feature scale, a scale of each prior box of the plurality of prior boxes to which the first feature scale corresponds is smaller than a scale of each prior box of the plurality of prior boxes to which the second feature scale corresponds.

For example, for a first feature metric of 128 × 128, its corresponding a priori boxes may be a priori boxes having a height of any value from 70 to 110 and a width of any value from 70 to 110. For a second feature scale with a feature scale of 64 × 64, the corresponding multiple prior boxes can be prior boxes with a height of any value in the range of 30-50 and a width of any value in the range of 30-50.

For the features with larger feature scale, the smaller the receptive field of the features, the smaller the scale of the corresponding prior frame is set, so that the smaller scale object can be detected, and for the features with smaller feature scale, the larger the receptive field of the features is, the larger the scale of the corresponding prior frame is set, so that the larger scale object can be detected, the problem that the smaller scale object can not be accurately detected due to the detection of the smaller scale object based on the smaller feature scale is avoided, and the accuracy of the detection result is improved.

In some embodiments, in step S230, a sliding window based traversal is performed for each feature to obtain a sub-feature of the feature corresponding to each of the respective plurality of prior boxes. In some embodiments, in step S230, for each of the plurality of features, the feature is input to the overall prediction head to obtain a sub-feature of the feature corresponding to each of the respective plurality of prior boxes.

For example, for a feature with a feature scale of 128 × 128, by inputting the feature into the overall prediction head, the overall prediction head matches each prior frame in a plurality of prior frames corresponding to the feature based on each point in a plurality of points on the feature, so as to obtain a sub-feature of the feature corresponding to each prior frame in the plurality of prior frames at each point; for each prior box in the plurality of prior boxes, obtaining a sub-feature of the feature corresponding to the prior box from a plurality of sub-features of the feature corresponding to the prior box at a plurality of points. Wherein the sub-feature of the feature corresponding to the prior frame is the sub-feature with the highest confidence among the sub-features of the feature corresponding to the prior frame at a plurality of points.

In some embodiments, as shown in fig. 3, in step S240, obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features includes:

step S310: for each of the plurality of features, obtaining a prediction probability corresponding to each of a plurality of sub-features corresponding to the feature, wherein the prediction probability indicates the importance degree of the sub-feature in the plurality of sub-features; and

step S320: obtaining the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features.

The prediction probability of the sub-features corresponding to each feature is obtained, and the target sub-features are obtained, wherein the prediction probability indicates the importance degree of the sub-features in the sub-features, so that the obtained target sub-features are the sub-features with higher importance, and the prediction accuracy is improved.

In some embodiments, in step S310, for each of the plurality of features, a fused feature is obtained based on a plurality of sub-features of the feature, and a probability corresponding to each of the plurality of sub-features is obtained based on the fused feature.

In some embodiments, in obtaining the fusion feature based on a plurality of sub-features of the feature, the plurality of sub-features may be scaled accordingly to obtain a plurality of features at the same scale, and the plurality of features at the same scale may be spliced in a channel direction to obtain the fusion feature.

In some embodiments, for each of the plurality of features, the feature is first fused with a plurality of sub-features corresponding to the plurality of features to obtain a plurality of fused features, and the plurality of fused features are spliced in the channel direction to obtain a fused feature.

In some embodiments, in step S310, for each of a plurality of features, a plurality of sub-features corresponding to the feature are input to the scale perception module to obtain a probability corresponding to each of the plurality of sub-features.

In some embodiments, the scale-aware module includes a global average pooling network and a full-link layer, and in step S310, the fusion feature obtained based on the plurality of sub-features corresponding to the feature is globally averaged and pooled through the global average pooling network to obtain pooled features, and the pooled features are input to the full-link layer to obtain a probability corresponding to each sub-feature of the plurality of sub-features.

In some embodiments, the obtaining the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features comprises:

for each feature in the plurality of features, obtaining at least one first target sub-feature corresponding to the feature from a plurality of sub-features corresponding to the feature, wherein the prediction probability of each first target sub-feature in the at least one first target sub-feature is greater than the prediction probability of each first target sub-feature in the plurality of sub-features, and the first target sub-feature is different from each first target sub-feature in the at least one first target sub-feature; and

obtaining the plurality of target sub-features based on at least one first target sub-feature corresponding to each feature of the plurality of features.

The method comprises the steps of obtaining at least one first target sub-feature with the top importance from a plurality of sub-features corresponding to each feature, obtaining a plurality of target sub-features based on the at least one first target sub-feature, enabling the obtained plurality of target sub-features to be obtained based on a plurality of prior frames corresponding to each feature, covering the receptive field of each feature, and enabling the range of the size of an object corresponding to the obtained detection frame to be as wide as possible.

In some embodiments, for a plurality of sub-features corresponding to each of a plurality of features, one or more first sub-features of the plurality of sub-features whose corresponding prediction probabilities are greater than a preset threshold are obtained; and determining one or more first sub-features corresponding to each feature in the plurality of features as a plurality of first target features corresponding to the feature.

In some embodiments, for a plurality of sub-features corresponding to each of a plurality of features, obtaining a preset number of second sub-features with higher prediction probabilities corresponding to the plurality of sub-features; and determining a preset number of second sub-features corresponding to each feature in the plurality of features as a plurality of first target features corresponding to the feature.

In some embodiments, in step S250, each of the plurality of target sub-features is input into the prediction head for prediction, so that the prediction head performs classification and regression on each of the target sub-features to obtain a class and a detection frame corresponding to each of the target sub-features.

In some embodiments, the plurality of target sub-features includes at least one first target sub-feature corresponding to each of the plurality of features, and as shown in fig. 4, the step S250 of obtaining a plurality of prediction boxes of the target image based on the plurality of target sub-features includes:

step S410: for each feature in the plurality of features, obtaining a plurality of first prediction frames corresponding to the feature based on at least one first target sub-feature corresponding to the feature; and

step S420: obtaining the plurality of prediction boxes based on a plurality of first prediction boxes corresponding to each feature of the plurality of features.

By performing regression and classification on at least one first target sub-feature corresponding to each of the plurality of features in the plurality of target sub-features, a plurality of first prediction frames corresponding to the feature are obtained, and the calculation amount is reduced compared with performing regression and classification on the basis of each of the plurality of target sub-features.

In some embodiments, for each of the plurality of features, at least one first target sub-feature of the plurality of target sub-features corresponding to the feature is input to the prediction head to obtain a plurality of first prediction boxes corresponding to the feature.

In some embodiments, the plurality of first prediction frames corresponding to each of the plurality of features is taken as the plurality of prediction frames corresponding to the target image.

Referring to fig. 5, a schematic diagram illustrating a process flow of an image detection method according to some embodiments of the present disclosure is shown.

As shown in fig. 5, first, a feature extraction network 510 including a plurality of feature extraction layers is used to perform feature extraction on a target image 500 to obtain a plurality of

features

501, 502 and 503 corresponding to a plurality of feature scales, then, each of the plurality of

features

501, 502 and 503 is input to a corresponding overall prediction head 520 to obtain a plurality of sub-features corresponding to the feature, and the plurality of sub-features are input to a scale perception module 530 to obtain prediction probabilities corresponding to the plurality of sub-features; then, for each of the plurality of

features

501, 502, and 503, based on the plurality of sub-features corresponding to the feature and the prediction probabilities respectively corresponding to the plurality of sub-features, an optimal selection module 540 is used to perform screening of target sub-features to obtain at least one first target sub-feature corresponding to each feature; finally, for each of the plurality of

features

501, 502, and 503, a prediction head 550 is used to perform classification (cls) and regression (Reg) based on at least one first target sub-feature corresponding to the feature to obtain a plurality of first detection frames corresponding to the feature.

It can be understood that in the above processing flow, the image detection model composed of the feature extraction network 510 and the whole prediction head 520, the scale sensing module 530, the optimal selection module 540 and the prediction head 550 corresponding to each feature realizes the detection of the target image 500. In the training process, similar to the prediction process, a training image including a plurality of objects is input to the image detection model, a plurality of prediction frames are output via the feature extraction network 510 and the entire prediction head 520, the scale sensing module 530, the optimal selection module 540, and the prediction head 550 corresponding to each feature, and after loss calculation is performed on the basis of the plurality of prediction frames and a labeling frame labeling the plurality of objects in the training image, model parameters are updated on the basis of the loss, thereby realizing training of the model.

In some embodiments, as shown in fig. 6, in step S240, obtaining the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features includes:

step S610: obtaining a prediction probability for each of the plurality of features based on a prediction probability for each of a plurality of sub-features for each of the plurality of features, the prediction probability indicating a degree of importance of the feature in the plurality of features;

step S620: obtaining at least one target feature of the plurality of features, a prediction probability of each of the at least one target feature being greater than a prediction probability of a first feature of the plurality of features, the first feature being distinct from each of the at least one target feature; and

step S630: obtaining a plurality of target sub-features based on a plurality of sub-features corresponding to each of the at least one target feature.

Because the objects in the same image are limited, the scale distribution range of a plurality of objects is limited, in order to avoid calculating more redundant sub-features, a plurality of target features in the plurality of features are obtained, namely the features extracted based on the preferential feature extraction layer, so that the target sub-features are obtained, the calculation amount is reduced, and the consumption of calculation resources is reduced.

In some embodiments, in step S610, obtaining the predicted probability for each of the plurality of features based on the predicted probability for each of the plurality of sub-features corresponding to each of the plurality of features comprises:

for each of the plurality of features, obtaining at least one second target sub-feature in a plurality of sub-features corresponding to the feature, wherein the corresponding prediction probability of each of the at least one second target sub-feature is greater than the prediction probability of a second sub-feature in the plurality of sub-features, and the second sub-feature is different from each of the at least one second target sub-feature; and

and aiming at each of the plurality of features, obtaining a prediction probability corresponding to the feature based on the prediction probability corresponding to each of at least one second target sub-feature corresponding to the feature.

The method comprises the steps of obtaining at least one second target sub-feature with the top importance from a plurality of sub-features corresponding to each feature, obtaining a prediction probability corresponding to the feature based on the prediction probability of each second target sub-feature in the at least one second target sub-feature, and enabling the prediction probability corresponding to the feature to be obtained based on the at least one second target sub-feature with the top importance in the plurality of sub-features corresponding to the feature, wherein the prediction probability indicates the probability of an accurate prediction frame which can be obtained based on the feature, so that the accurate prediction frame can be obtained according to the plurality of target features obtained according to the prediction probability of each feature in the plurality of features, and the accuracy of the plurality of prediction frames of the obtained target image is improved.

In some embodiments, a preset number of sub-features in the plurality of sub-features corresponding to each of the plurality of features is obtained, and the preset number of sub-features is determined as the at least one second target sub-feature corresponding to the feature.

In some embodiments, the at least one sub-feature is determined as the at least one second target sub-feature corresponding to each of the plurality of features by obtaining at least one sub-feature corresponding to the plurality of sub-features, the prediction probability of which is greater than the probability threshold.

In some embodiments, as shown in fig. 7, in step S250, the plurality of target sub-features includes a plurality of sub-features corresponding to each target feature of the at least one target feature, and the obtaining the plurality of prediction boxes of the target image based on the plurality of target sub-features includes:

step S710: for each target feature of the at least one target feature, obtaining a plurality of second prediction frames corresponding to the target feature based on a plurality of sub-features corresponding to the target feature; and

step S720: obtaining the plurality of prediction boxes based on a plurality of second prediction boxes corresponding to each target feature of the at least one target feature.

By carrying out regression and classification on a plurality of sub-features of each feature, a plurality of prediction frames corresponding to the feature are obtained, the calculation amount is reduced, and the consumption of calculation resources is reduced.

In some embodiments, for each of the plurality of target features, a plurality of sub-features corresponding to the target feature are input to the prediction head to obtain a plurality of second prediction boxes corresponding to the target feature.

In some embodiments, the plurality of second prediction frames corresponding to each of the plurality of target features is taken as the plurality of prediction frames corresponding to the target image.

According to another aspect of the present disclosure, an image detection apparatus is also provided. As shown in fig. 8, the apparatus 800 includes: a target image obtaining unit 810 configured to obtain a target image to be detected, the target image including a plurality of objects; a feature extraction unit 820 configured to perform feature extraction on a target image to be detected to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior frames corresponding to a plurality of scales; a sub-feature obtaining unit 830, configured to obtain, for each of the plurality of features corresponding to the plurality of feature scales, a sub-feature corresponding to each of a plurality of prior frames corresponding to the feature; a target sub-feature obtaining unit 840 configured to obtain a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and a prediction frame obtaining unit 850 configured to obtain a plurality of prediction frames of the target image based on the plurality of target sub-features, each of the plurality of prediction frames indicating one of the plurality of objects.

In some embodiments, for a first feature scale of the plurality of feature scales and a second feature scale larger than the first feature scale, the scale of each prior box of the plurality of prior boxes to which the first feature scale corresponds is smaller than the scale of each prior box of the plurality of prior boxes to which the second feature scale corresponds.

In some embodiments, the target sub-feature obtaining unit includes: a prediction probability obtaining unit configured to obtain, for each of the plurality of features, a prediction probability corresponding to each of a plurality of sub-features corresponding to the feature, the prediction probability indicating a degree of importance of the sub-feature in the plurality of sub-features; and a first obtaining unit, configured to obtain the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features.

In some embodiments, the first obtaining unit includes: a first obtaining sub-unit, configured to obtain, for each of the plurality of features, at least one first target sub-feature corresponding to the feature from a plurality of sub-features corresponding to the feature, where a prediction probability of each of the at least one first target sub-features is greater than a prediction probability of a first sub-feature of the plurality of sub-features, and the first sub-feature is distinguished from each of the at least one first target sub-features; and a second obtaining sub-unit configured to obtain the plurality of target sub-features based on at least one first target sub-feature corresponding to each of the plurality of features.

In some embodiments, the prediction block acquisition unit includes: a first prediction unit configured to, for each of the plurality of features, obtain, based on at least one first target sub-feature corresponding to the feature, a plurality of first prediction frames corresponding to the feature; and a first prediction frame obtaining subunit configured to obtain the plurality of prediction frames based on a plurality of first prediction frames corresponding to each of the plurality of features.

In some embodiments, the first obtaining unit includes: a second prediction probability obtaining unit configured to obtain a prediction probability of each of the plurality of features based on a prediction probability corresponding to each of a plurality of sub-features corresponding to each of the plurality of features, the prediction probability indicating a degree of importance of the feature among the plurality of features; a third obtaining subunit, configured to obtain at least one target feature of the plurality of features, a prediction probability of each of the at least one target feature being greater than a prediction probability of a first feature of the plurality of features, the first feature being different from each of the at least one target feature; and a fourth obtaining sub-unit, configured to obtain the plurality of target sub-features based on a plurality of sub-features corresponding to each of the at least one target feature.

In some embodiments, the second prediction probability obtaining unit includes: a fifth obtaining sub-unit, configured to obtain, for each of the plurality of features, at least one second target sub-feature in a plurality of sub-features corresponding to the feature, where a corresponding prediction probability of each of the at least one second target sub-feature is greater than a prediction probability of a second sub-feature in the plurality of sub-features, and the second sub-feature is different from each of the at least one second target sub-feature; and a fifth obtaining sub-unit, configured to, for each of the plurality of features, obtain a prediction probability corresponding to each of the at least one second target sub-feature based on the prediction probability corresponding to the feature.

In some embodiments, the prediction block acquisition unit includes: a second prediction unit, configured to, for each target feature of the at least one target feature, obtain, based on a plurality of sub-features corresponding to the target feature, a plurality of second prediction frames corresponding to the target feature; and a second prediction frame obtaining subunit configured to obtain the plurality of prediction frames based on a plurality of second prediction frames corresponding to each of the at least one target feature.

According to an embodiment of the present disclosure, there is also provided an electronic device, a readable storage medium, and a computer program product.

Referring to fig. 9, a block diagram of a structure of an electronic device 900 that may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic devices may also represent various forms of mobile devices, such as personal digital processors, cellular telephones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the electronic device 900 includes a computing unit 901 that can perform various appropriate actions and processes in accordance with a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM903, various programs and data necessary for the operation of the electronic apparatus 900 can be stored. The calculation unit 901, ROM 902, and RAM903 are connected to each other via a bus 904. An input/output (I/O) interface 905 is also connected to bus 904.

A number of components in the electronic device 900 are connected to the I/O interface 905, including: an input unit 906, an output unit 907, a storage unit 908, and a communication unit 909. The input unit 906 may be any type of device capable of inputting information to the electronic device 900, and the input unit 906 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device, and may include, but is not limited to, a mouse, a keyboard, a touch screen, a track pad, a track ball, a joystick, a microphone, and/or a remote control. Output unit 907 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 908 may include, but is not limited to, a magnetic disk, an optical disk. The communication unit 909 allows the electronic device 900 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, 802.11 devices, wiFi devices, wiMax devices, cellular communication devices, and/or the like.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of the computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various dedicated Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The computing unit 901 performs the various methods and processes described above, such as the method 200. For example, in some embodiments, the method 200 may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 900 via the ROM 902 and/or the communication unit 909. When loaded into RAM903 and executed by computing unit 901, may perform one or more of the steps of method 200 described above. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the method 200 by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user may provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be performed in parallel, sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

Although embodiments or examples of the present disclosure have been described with reference to the accompanying drawings, it is to be understood that the above-described methods, systems and apparatus are merely exemplary embodiments or examples and that the scope of the present invention is not limited by these embodiments or examples, but only by the claims as issued and their equivalents. Various elements in the embodiments or examples may be omitted or may be replaced with equivalents thereof. Further, the steps may be performed in an order different from that described in the present disclosure. Further, various elements in the embodiments or examples may be combined in various ways. It is important that as technology evolves, many of the elements described herein may be replaced with equivalent elements that appear after the present disclosure.

Claims

1. An image detection method, comprising:

obtaining a target image to be detected, wherein the target image comprises a plurality of objects;

performing feature extraction on the target image to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior boxes corresponding to the plurality of scales;

for each feature in a plurality of features corresponding to the feature scales, obtaining a sub-feature corresponding to each prior frame in a plurality of prior frames corresponding to the feature;

obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and

based on the plurality of target sub-features, a plurality of prediction boxes of the target image are obtained, each of the plurality of prediction boxes indicating one of the plurality of objects.

2. The method of claim 1, wherein for a first feature scale of the plurality of feature scales and a second feature scale larger than the first feature scale, a scale of each prior box of a plurality of prior boxes corresponding to the first feature scale is smaller than a scale of each prior box of a plurality of prior boxes corresponding to the second feature scale.

3. The method of claim 2, wherein said obtaining a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features comprises:

for each of the plurality of features, obtaining a prediction probability corresponding to each of a plurality of sub-features corresponding to the feature, wherein the prediction probability indicates the importance degree of the sub-feature in the plurality of sub-features; and

obtaining the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features.

4. The method of claim 3, wherein the obtaining the plurality of target sub-features based on the prediction probability for each of the plurality of sub-features for each of the plurality of features comprises:

for each of the plurality of features, obtaining at least one first target sub-feature corresponding to the feature from a plurality of sub-features corresponding to the feature, wherein the prediction probability of each of the at least one first target sub-features is greater than the prediction probability of a first sub-feature of the plurality of sub-features, and the first sub-feature is different from each of the at least one first target sub-features; and

5. The method of claim 4, wherein the obtaining a plurality of prediction boxes for the target image based on the plurality of target sub-features comprises:

for each feature in the plurality of features, obtaining a plurality of first prediction frames corresponding to the feature based on at least one first target sub-feature corresponding to the feature; and

obtaining the plurality of prediction boxes based on a plurality of first prediction boxes corresponding to each feature of the plurality of features.

6. The method of claim 3, wherein the obtaining the plurality of target sub-features based on the prediction probability for each of the plurality of sub-features for each of the plurality of features comprises:

obtaining a prediction probability for each of the plurality of features based on a prediction probability for each of a plurality of sub-features for each of the plurality of features, the prediction probability indicating a degree of importance of the feature in the plurality of features;

obtaining at least one target feature of the plurality of features, a prediction probability of each of the at least one target feature being greater than a prediction probability of a first feature of the plurality of features, the first feature being distinct from each of the at least one target feature; and

obtaining a plurality of target sub-features based on a plurality of sub-features corresponding to each of the at least one target feature.

7. The method of claim 6, wherein the obtaining a predicted probability for each of the plurality of features based on the predicted probability for each of the plurality of sub-features for each of the plurality of features comprises:

for each of the plurality of features,

obtaining at least one second target sub-feature in a plurality of sub-features corresponding to the feature, wherein the corresponding prediction probability of each target sub-feature in the at least one second target sub-feature is greater than the prediction probability of a second sub-feature in the plurality of sub-features, and the second sub-feature is different from each second target sub-feature in the at least one second target sub-feature; and

and obtaining the prediction probability corresponding to each second target sub-feature based on the prediction probability corresponding to each second target sub-feature in the at least one second target sub-feature.

8. The method of claim 6, wherein the plurality of target sub-features comprises a plurality of sub-features corresponding to each of the at least one target feature, and wherein obtaining the plurality of prediction boxes for the target image based on the plurality of target sub-features comprises:

for each target feature of the at least one target feature, obtaining a plurality of second prediction frames corresponding to the target feature based on a plurality of sub-features corresponding to the target feature; and

obtaining the plurality of prediction boxes based on a plurality of second prediction boxes corresponding to each target feature of the at least one target feature.

9. An image detection apparatus comprising:

a target image acquisition unit configured to obtain a target image to be detected, the target image including a plurality of objects;

a feature extraction unit configured to perform feature extraction on a target image to be detected to obtain features corresponding to each of a plurality of feature scales, wherein each of the plurality of feature scales corresponds to a plurality of prior frames corresponding to a plurality of scales;

a sub-feature obtaining unit, configured to obtain, for each of a plurality of features corresponding to the plurality of feature scales, a sub-feature corresponding to each of a plurality of prior frames corresponding to the feature;

a target sub-feature obtaining unit configured to obtain a plurality of target sub-features from a plurality of sub-features corresponding to the plurality of features; and

a prediction frame obtaining unit configured to obtain a plurality of prediction frames of the target image based on the plurality of target sub-features, each of the plurality of prediction frames indicating one of the plurality of objects.

10. The apparatus of claim 9, wherein, for a first feature scale of the plurality of feature scales and a second feature scale larger than the first feature scale, a scale of each prior box of a plurality of prior boxes to which the first feature scale corresponds is smaller than a scale of each prior box of a plurality of prior boxes to which the second feature scale corresponds.

11. The apparatus of claim 10, wherein the target sub-feature obtaining unit comprises:

a prediction probability obtaining unit configured to obtain, for each of the plurality of features, a prediction probability corresponding to each of a plurality of sub-features corresponding to the feature, the prediction probability indicating a degree of importance of the sub-feature in the plurality of sub-features; and

a first obtaining unit, configured to obtain the plurality of target sub-features based on the prediction probability corresponding to each of the plurality of sub-features corresponding to each of the plurality of features.

12. The apparatus of claim 11, wherein the first obtaining unit comprises:

a first obtaining sub-unit, configured to obtain, for each of the plurality of features, at least one first target sub-feature corresponding to the feature from a plurality of sub-features corresponding to the feature, where a prediction probability of each of the at least one first target sub-features is greater than a prediction probability of a first sub-feature of the plurality of sub-features, and the first sub-feature is distinguished from each of the at least one first target sub-features; and

a second obtaining sub-unit configured to obtain the plurality of target sub-features based on at least one first target sub-feature corresponding to each of the plurality of features.

13. The apparatus of claim 12, wherein the prediction block acquisition unit comprises:

a first prediction unit configured to, for each of the plurality of features, obtain, based on at least one first target sub-feature corresponding to the feature, a plurality of first prediction frames corresponding to the feature; and

a first prediction box obtaining subunit configured to obtain the plurality of prediction boxes based on a plurality of first prediction boxes corresponding to each of the plurality of features.

14. The apparatus of claim 11, wherein the first obtaining unit comprises:

a second prediction probability obtaining unit configured to obtain a prediction probability of each of the plurality of features based on a prediction probability corresponding to each of a plurality of sub-features corresponding to each of the plurality of features, the prediction probability indicating a degree of importance of the feature among the plurality of features;

a third obtaining subunit configured to obtain at least one target feature of the plurality of features, a prediction probability of each of the at least one target feature being greater than a prediction probability of a first feature of the plurality of features, the first feature being distinct from each of the at least one target feature; and

a fourth obtaining subunit, configured to obtain a plurality of target sub-features based on a plurality of sub-features corresponding to each of the at least one target feature.

15. The apparatus according to claim 14, wherein the second prediction probability obtaining unit includes:

a fifth obtaining sub-unit, configured to obtain, for each of the plurality of features, at least one second target sub-feature in a plurality of sub-features corresponding to the feature, where a corresponding prediction probability of each of the at least one second target sub-feature is greater than a prediction probability of a second sub-feature in the plurality of sub-features, and the second sub-feature is different from each of the at least one second target sub-feature; and

and the fifth obtaining sub-unit is configured to, for each of the plurality of features, obtain a prediction probability corresponding to each of the at least one second target sub-feature based on the prediction probability corresponding to the feature.

16. The apparatus of claim 15, wherein the prediction block acquisition unit comprises:

a second prediction unit, configured to, for each target feature of the at least one target feature, obtain, based on a plurality of sub-features corresponding to the target feature, a plurality of second prediction frames corresponding to the target feature; and

a second prediction frame obtaining subunit configured to obtain a plurality of prediction frames based on a plurality of second prediction frames corresponding to each of the at least one target feature.

17. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein

The memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-8.

18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-8.

19. A computer program product comprising a computer program, wherein the computer program realizes the method of any one of claims 1-8 when executed by a processor.