CN116258915B - Method and device for jointly detecting multiple target parts - Google Patents

Method and device for jointly detecting multiple target parts Download PDF

Info

Publication number
CN116258915B
CN116258915B CN202310538418.7A CN202310538418A CN116258915B CN 116258915 B CN116258915 B CN 116258915B CN 202310538418 A CN202310538418 A CN 202310538418A CN 116258915 B CN116258915 B CN 116258915B
Authority
CN
China
Prior art keywords
feature
regression
processing
classification
stage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310538418.7A
Other languages
Chinese (zh)
Other versions
CN116258915A (en
Inventor
王夏洪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Xumi Yuntu Space Technology Co Ltd
Original Assignee
Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Xumi Yuntu Space Technology Co Ltd filed Critical Shenzhen Xumi Yuntu Space Technology Co Ltd
Priority to CN202310538418.7A priority Critical patent/CN116258915B/en
Publication of CN116258915A publication Critical patent/CN116258915A/en
Application granted granted Critical
Publication of CN116258915B publication Critical patent/CN116258915B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/766Arrangements for image or video recognition or understanding using pattern recognition or machine learning using regression, e.g. by projecting features on hyperplanes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The disclosure relates to the technical field of target detection, and provides a method and a device for jointly detecting multiple target positions. The method comprises the following steps: extracting a first part stage characteristic and a second part stage characteristic of a target object in a target picture by utilizing a residual error network; determining a first location feature based on the first location stage feature, and determining a second location feature based on the second location stage feature; performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and further determining a second part detection result; and carrying out first part regression processing based on the first part characteristic and the second part regression characteristic to obtain a first part regression characteristic, carrying out first part classification processing based on the first part characteristic and the second part classification characteristic to obtain a first part classification characteristic, and further determining a first part detection result.

Description

Method and device for jointly detecting multiple target parts
Technical Field
The disclosure relates to the technical field of target detection, and in particular relates to a method and a device for jointly detecting multiple target positions.
Background
In the existing target detection technology, multi-target detection is commonly used, such as yolov3, retinanet and the like, and the networks are used for multi-target detection, and the targets are independent and have no association, but in some detection, the targets are associated, for example, two parts of the detected target are associated, and the part of one target is associated. The existing target detection technology ignores the problem of inconsistent sizes of a plurality of parts of the same target, ignores the problem that the plurality of parts of the same target have a high matching relationship, and further causes the problem of low accuracy in joint detection of the plurality of parts of the same target.
In the process of implementing the disclosed concept, the inventor finds that at least the following technical problems exist in the related art: aiming at the problem that the joint detection accuracy of a plurality of parts of the same target is low.
Disclosure of Invention
In view of the above, embodiments of the present disclosure provide a method, an apparatus, an electronic device, and a computer readable storage medium for joint detection of multiple target sites, so as to solve the problem in the prior art that the joint detection accuracy of multiple sites for the same target is low.
In a first aspect of an embodiment of the present disclosure, a method for joint detection of multiple target sites is provided, including: inputting a target picture to be detected into a residual error network, and outputting a first part stage characteristic and a second part stage characteristic of a target object in the target picture; constructing a first part feature pyramid and a second part feature pyramid, processing the first part stage features by using the first part feature pyramid to obtain first part features, and processing the second part stage features by using the second part feature pyramid to obtain second part features; performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object based on the second part regression features and the second part classification features; and performing first part regression processing based on the first part feature and the second part regression feature to obtain a first part regression feature, performing first part classification processing based on the first part feature and the second part classification feature to obtain a first part classification feature, and determining a first part detection result related to the target object based on the first part regression feature and the first part classification feature.
In a second aspect of embodiments of the present disclosure, there is provided an apparatus for joint detection of a plurality of target sites, including: the residual module is configured to input a target picture to be detected into a residual network and output a first part stage characteristic and a second part stage characteristic of a target object in the target picture; the pyramid module is configured to construct a first position feature pyramid and a second position feature pyramid, process the first position stage features by using the first position feature pyramid to obtain first position features, and process the second position stage features by using the second position feature pyramid to obtain second position features; the second part detection module is configured to perform second part regression processing and second part classification processing based on the second part characteristics to obtain second part regression characteristics and second part classification characteristics, and determine a second part detection result about the target object based on the second part regression characteristics and the second part classification characteristics; the first part detection module is configured to perform first part regression processing based on the first part feature and the second part regression feature to obtain a first part regression feature, perform first part classification processing based on the first part feature and the second part classification feature to obtain a first part classification feature, and determine a first part detection result related to the target object based on the first part regression feature and the first part classification feature.
In a third aspect of the disclosed embodiments, an electronic device is provided, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the above method when executing the computer program.
In a fourth aspect of the disclosed embodiments, a computer-readable storage medium is provided, which stores a computer program which, when executed by a processor, implements the steps of the above-described method.
Compared with the prior art, the embodiment of the disclosure has the beneficial effects that: because the embodiment of the disclosure outputs the first part stage feature and the second part stage feature about the target object in the target picture by inputting the target picture to be detected into the residual network; constructing a first part feature pyramid and a second part feature pyramid, processing the first part stage features by using the first part feature pyramid to obtain first part features, and processing the second part stage features by using the second part feature pyramid to obtain second part features; performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object based on the second part regression features and the second part classification features; the first position regression processing is carried out based on the first position feature and the second position regression feature to obtain a first position regression feature, the first position classification processing is carried out based on the first position feature and the second position classification feature to obtain a first position classification feature, and a first position detection result related to the target object is determined based on the first position regression feature and the first position classification feature.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are required for the embodiments or the description of the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is a scene schematic diagram of an application scene of an embodiment of the present disclosure;
FIG. 2 is a flow chart of a method for joint detection of multiple target sites according to an embodiment of the present disclosure;
FIG. 3 is a schematic structural diagram of an apparatus for joint detection of multiple target sites according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure.
Detailed Description
In the following description, for purposes of explanation and not limitation, specific details are set forth, such as particular system configurations, techniques, etc. in order to provide a thorough understanding of the disclosed embodiments. However, it will be apparent to one skilled in the art that the present disclosure may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present disclosure with unnecessary detail.
A method and apparatus for joint detection of multiple target sites according to embodiments of the present disclosure will be described in detail below with reference to the accompanying drawings.
Fig. 1 is a scene diagram of an application scene of an embodiment of the present disclosure. The application scenario may include terminal devices 101, 102, and 103, server 104, and network 105.
The terminal devices 101, 102, and 103 may be hardware or software. When the terminal devices 101, 102, and 103 are hardware, they may be various electronic devices having a display screen and supporting communication with the server 104, including but not limited to smartphones, tablets, laptop and desktop computers, etc.; when the terminal devices 101, 102, and 103 are software, they may be installed in the electronic device as above. Terminal devices 101, 102, and 103 may be implemented as multiple software or software modules, or as a single software or software module, as embodiments of the present disclosure are not limited in this regard. Further, various applications, such as a data processing application, an instant messaging tool, social platform software, a search class application, a shopping class application, and the like, may be installed on the terminal devices 101, 102, and 103.
The server 104 may be a server that provides various services, for example, a background server that receives a request transmitted from a terminal device with which communication connection is established, and the background server may perform processing such as receiving and analyzing the request transmitted from the terminal device and generate a processing result. The server 104 may be a server, a server cluster formed by a plurality of servers, or a cloud computing service center, which is not limited in the embodiments of the present disclosure.
The server 104 may be hardware or software. When the server 104 is hardware, it may be various electronic devices that provide various services to the terminal devices 101, 102, and 103. When the server 104 is software, it may be a plurality of software or software modules providing various services to the terminal devices 101, 102, and 103, or may be a single software or software module providing various services to the terminal devices 101, 102, and 103, which is not limited by the embodiments of the present disclosure.
The network 105 may be a wired network using coaxial cable, twisted pair wire, and optical fiber connection, or may be a wireless network that can implement interconnection of various communication devices without wiring, for example, bluetooth (Bluetooth), near field communication (Near Field Communication, NFC), infrared (Infrared), etc., which are not limited by the embodiments of the present disclosure.
The user can establish a communication connection with the server 104 via the network 105 through the terminal devices 101, 102, and 103 to receive or transmit information or the like. It should be noted that the specific types, numbers and combinations of the terminal devices 101, 102 and 103, the server 104 and the network 105 may be adjusted according to the actual requirements of the application scenario, which is not limited by the embodiment of the present disclosure.
Fig. 2 is a flow chart of a method for joint detection of multiple target sites according to an embodiment of the disclosure. The method of joint detection of multiple target sites of fig. 2 may be performed by the computer or server of fig. 1, or software on the computer or server. As shown in fig. 2, the method for jointly detecting a plurality of target sites includes:
s201, inputting a target picture to be detected into a residual error network, and outputting a first part stage characteristic and a second part stage characteristic of a target object in the target picture;
s202, constructing a first part feature pyramid and a second part feature pyramid, processing the first part stage features by using the first part feature pyramid to obtain first part features, and processing the second part stage features by using the second part feature pyramid to obtain second part features;
s203, performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object based on the second part regression features and the second part classification features;
s204, performing first part regression processing based on the first part feature and the second part regression feature to obtain a first part regression feature, performing first part classification processing based on the first part feature and the second part classification feature to obtain a first part classification feature, and determining a first part detection result related to the target object based on the first part regression feature and the first part classification feature.
The second location regression process may be considered a second location regression head, the second location classification process may be considered a second location classification head, the first location regression process may be considered a first location regression head, and the first location classification process may be considered a first location classification head. Therefore, the algorithm can be regarded as a first-part and second-part joint detection model, wherein the internal network of the first-part and second-part joint detection model is a residual network, a first-part feature pyramid, a second-part regression head, a second-part classification head, a first-part regression head and a first-part classification head, wherein the first-part feature pyramid and the second-part feature pyramid are in parallel relation, the second-part regression head and the second-part classification head are in parallel relation, and the first-part regression head and the first-part classification head are in parallel relation.
The second part regression feature is used for representing a detection frame of a second part of the target object in the target picture; the second part classification feature is used to indicate whether a certain position belongs to a second part of the target object in the target picture, and indicates a category of a different part of the second part of the target object in the target picture, for example, the second part is a body of the target object, and then the different parts of the second part may be: left shoulder, right shoulder, left hand, right hand, left knee, right knee, left foot, right foot; the first position regression feature is used for representing a detection frame of a first position of a target object in the target picture; the first region classification feature is used to indicate whether a certain position belongs to a first region of a target object in the target picture, for example, the second region may be a face of the target object.
The "feature" in this disclosure may be understood as a "feature map".
According to the technical scheme provided by the embodiment of the disclosure, a target picture to be detected is input into a residual error network, and a first part stage characteristic and a second part stage characteristic of a target object in the target picture are output; constructing a first part feature pyramid and a second part feature pyramid, processing the first part stage features by using the first part feature pyramid to obtain first part features, and processing the second part stage features by using the second part feature pyramid to obtain second part features; performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object based on the second part regression features and the second part classification features; the first position regression processing is carried out based on the first position feature and the second position regression feature to obtain a first position regression feature, the first position classification processing is carried out based on the first position feature and the second position classification feature to obtain a first position classification feature, and a first position detection result related to the target object is determined based on the first position regression feature and the first position classification feature.
Inputting a target picture to be detected into a residual error network, and outputting a first part stage characteristic and a second part stage characteristic of a target object in the target picture, wherein the method comprises the following steps: inputting a target picture to be detected into a residual error network, respectively outputting a first stage characteristic, a second stage characteristic and a third stage characteristic related to a target object through a second stage network, a third stage network and a fourth stage network of the residual error network, and respectively outputting a fourth stage characteristic and a fifth stage characteristic related to the target object through the third stage network and the fourth stage network of the residual error network; wherein the first site phase feature comprises: a first stage feature, a second stage feature, and a third stage feature, a second part stage feature, comprising: fourth stage features and fifth stage features.
Residual networks, such as ResNet50, include a zeroth Stage network Stage0, a first Stage network Stage1, a second Stage network Stage2, a third Stage network Stage3, and a fourth Stage network Stage4.
It should be noted that, the features output by Stage0 and Stage1 are not used, in the embodiment of the disclosure, the second Stage network outputs the first Stage feature, the third Stage network outputs the second Stage feature and the fourth Stage feature, and the fourth Stage network outputs the third Stage feature and the fifth Stage feature.
Processing the first part stage feature by using a first part feature pyramid to obtain a first part feature, including: performing first convolution processing on the first stage feature, the second stage feature and the third stage feature to obtain a fourth feature, a second feature and a third feature; performing second convolution processing on the third feature to obtain a fourth feature; sequentially performing second convolution processing and linear interpolation processing of up-sampling of a preset multiple on the fourth feature, and adding the result and the second feature to obtain a fifth feature; sequentially carrying out second convolution treatment and average pooling treatment of up-sampling of a preset multiple on the fifth characteristic to obtain a sixth characteristic; wherein the first site phase feature comprises: a first stage feature, a second stage feature, and a third stage feature, a first site feature, comprising: the fourth feature, the fifth feature, and the sixth feature.
And constructing a first part feature pyramid, namely constructing an algorithm for processing the first part stage features by the first part feature pyramid.
To more clearly and specifically illustrate the various algorithms in this disclosure, examples are as follows: the first convolution process may be performing a normal convolution with a convolution kernel of 1x1, the second convolution process may be performing a normal convolution with a convolution kernel of 3x3, the pre-set multiple up-sampling may be 2 times up-sampling, the "adding the result to the second feature" may be a corresponding matrix addition, the third convolution process may be performing a deformable convolution with a convolution kernel of 2 x3, the fourth convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 18, the fifth convolution process may be performing a hole convolution with a convolution kernel of 2 x3, the sixth convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 1, and the seventh convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 4. The linear interpolation process and the average pooling process are common processes and will not be described in detail.
Processing the second part stage feature by using a second part feature pyramid to obtain a second part feature, including: performing first convolution processing on the fourth-stage feature and the fifth-stage feature to obtain a seventh feature and an eighth feature; performing second convolution processing on the eighth feature to obtain a ninth feature; sequentially performing second convolution processing and linear interpolation processing of up-sampling of a preset multiple on the ninth feature, and adding the result and the seventh feature to obtain a tenth feature; sequentially performing second convolution processing and average pooling processing of up-sampling of a preset multiple on the ninth feature to obtain an eleventh position feature; wherein the second site phase feature comprises: a fourth stage feature and a fifth stage feature, a second site feature, comprising: ninth feature, tenth feature, and eleventh feature.
And constructing a second part feature pyramid, namely constructing an algorithm for processing the second part stage features by the second part feature pyramid.
Performing a second location regression process and a second location classification process based on the second location feature to obtain a second location regression feature and a second location classification feature, including: sequentially performing third convolution processing and fourth convolution processing on the second part feature to obtain a second part regression feature, wherein a loss function used in the second part regression processing is a smoothL1 loss function; and sequentially carrying out fifth convolution processing, sixth convolution processing and activation function processing on the second part characteristics to obtain second part classification characteristics, wherein a loss function used in the second part classification processing is a cross entropy loss function.
The second location regression process may be considered a second location regression header that uses smoothL1 loss functions and the second location classification process may be considered a second location classification header that uses cross entropy loss functions.
The fourth feature, the fifth feature and the third second location feature correspond to a second location regression feature and a second location classification feature, a second location detection frame for the target object is determined based on the plurality of second location regression features, whether a location of the target picture is the second location of the target object is determined based on the plurality of second location classification features, and a second location category for the target object is determined, and a second location detection result of the target object is determined.
Performing a first location regression process based on the first location feature and the second location regression feature to obtain a first location regression feature, including: performing third convolution processing on the first part characteristics to obtain preliminary regression characteristics; adding the primary regression feature and the second position regression feature to obtain a fusion regression feature; and carrying out seventh convolution processing on the fusion regression feature to obtain a first position regression feature, wherein the loss function used in the first position regression processing is a smoothL1 loss function.
The first site regression process can be considered as a first site regression head that uses smoothL1 loss functions. The "adding the preliminary regression feature to the second site regression feature" may be a corresponding matrix addition (each feature or feature map is actually a matrix).
According to the embodiment of the disclosure, the second position regression feature is utilized to assist the first position feature to carry out the first position regression processing, so that the accuracy of joint detection of a plurality of target positions can be improved.
Performing a first location classification process based on the first location feature and the second location classification feature to obtain a first location classification feature, including: performing fifth convolution processing on the first part features to obtain preliminary classification features; adding the preliminary classification feature and the second part classification feature to obtain a fusion classification feature; and sequentially carrying out sixth convolution processing and activation function processing on the fusion classification characteristic to obtain a first part classification characteristic, wherein a loss function used in the first part classification processing is a cross entropy loss function.
The first region classification process may be regarded as a first region classification head that employs a cross entropy loss function. The activation function processing may be processing using a sigmoid function. The preliminary classification feature and the second location classification feature are added, which may be a corresponding matrix addition.
According to the embodiment of the disclosure, the second part classification feature is utilized to assist the first part feature in the first part classification processing, so that the accuracy of joint detection of a plurality of target parts can be improved.
The first stage feature, the second stage feature and the third stage feature correspond to a first position regression feature and a first position classification feature, a first position detection frame related to the target object is determined based on the plurality of first position regression features, whether a certain position of the target picture is the first position of the target object is determined based on the plurality of first position classification features, and then a first position detection result of the target object is determined.
Any combination of the above optional solutions may be adopted to form an optional embodiment of the present application, which is not described herein.
The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.
Fig. 3 is a schematic diagram of an apparatus for joint detection of multiple target sites according to an embodiment of the disclosure. As shown in fig. 3, the apparatus for joint detection of a plurality of target sites includes:
a residual module 301 configured to input a target picture to be detected into a residual network, and output a first part stage feature and a second part stage feature about a target object in the target picture;
The pyramid module 302 is configured to construct a first location feature pyramid and a second location feature pyramid, process the first location stage feature by using the first location feature pyramid to obtain a first location feature, and process the second location stage feature by using the second location feature pyramid to obtain a second location feature;
a second location detection module 303 configured to perform a second location regression process and a second location classification process based on the second location feature, obtain a second location regression feature and a second location classification feature, and determine a second location detection result regarding the target object based on the second location regression feature and the second location classification feature;
the first location detection module 304 is configured to perform a first location regression process based on the first location feature and the second location regression feature, obtain a first location regression feature, perform a first location classification process based on the first location feature and the second location classification feature, obtain a first location classification feature, and determine a first location detection result about the target object based on the first location regression feature and the first location classification feature.
The second location regression process may be considered a second location regression head, the second location classification process may be considered a second location classification head, the first location regression process may be considered a first location regression head, and the first location classification process may be considered a first location classification head. Therefore, the algorithm can be regarded as a first-part and second-part joint detection model, wherein the internal network of the first-part and second-part joint detection model is a residual network, a first-part feature pyramid, a second-part regression head, a second-part classification head, a first-part regression head and a first-part classification head, wherein the first-part feature pyramid and the second-part feature pyramid are in parallel relation, the second-part regression head and the second-part classification head are in parallel relation, and the first-part regression head and the first-part classification head are in parallel relation.
The second part regression feature is used for representing a detection frame of a second part of the target object in the target picture; the second part classification feature is used to indicate whether a certain position belongs to a second part of the target object in the target picture, and indicates a category of a different part of the second part of the target object in the target picture, for example, the second part is a body of the target object, and then the different parts of the second part may be: left shoulder, right shoulder, left hand, right hand, left knee, right knee, left foot, right foot; the first position regression feature is used for representing a detection frame of a first position of a target object in the target picture; the first region classification feature is used to indicate whether a certain position belongs to a first region of a target object in the target picture, for example, the second region may be a face of the target object.
The "feature" in this disclosure may be understood as a "feature map".
According to the technical scheme provided by the embodiment of the disclosure, a target picture to be detected is input into a residual error network, and a first part stage characteristic and a second part stage characteristic of a target object in the target picture are output; constructing a first part feature pyramid and a second part feature pyramid, processing the first part stage features by using the first part feature pyramid to obtain first part features, and processing the second part stage features by using the second part feature pyramid to obtain second part features; performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object based on the second part regression features and the second part classification features; the first position regression processing is carried out based on the first position feature and the second position regression feature to obtain a first position regression feature, the first position classification processing is carried out based on the first position feature and the second position classification feature to obtain a first position classification feature, and a first position detection result related to the target object is determined based on the first position regression feature and the first position classification feature.
Optionally, the residual module 301 is further configured to input the target picture to be detected into a residual network, output the first stage feature, the second stage feature, and the third stage feature about the target object through a second stage network, a third stage network, and a fourth stage network of the residual network, and output the fourth stage feature and the fifth stage feature about the target object through a third stage network and a fourth stage network of the residual network, respectively; wherein the first site phase feature comprises: a first stage feature, a second stage feature, and a third stage feature, a second part stage feature, comprising: fourth stage features and fifth stage features.
Residual networks, such as ResNet50, include a zeroth Stage network Stage0, a first Stage network Stage1, a second Stage network Stage2, a third Stage network Stage3, and a fourth Stage network Stage4.
It should be noted that, the features output by Stage0 and Stage1 are not used, the second Stage network output in the embodiment of the disclosure is the first Stage feature, the third Stage network output is the second Stage feature and the fourth Stage feature, and the fourth Stage network output is the third Stage feature and the fifth Stage feature.
Optionally, the pyramid module 302 is further configured to perform a first convolution on the first stage feature, the second stage feature, and the third stage feature to obtain a fourth feature, a second feature, and a third feature; performing second convolution processing on the third feature to obtain a fourth feature; sequentially performing second convolution processing and linear interpolation processing of up-sampling of a preset multiple on the fourth feature, and adding the result and the second feature to obtain a fifth feature; sequentially carrying out second convolution treatment and average pooling treatment of up-sampling of a preset multiple on the fifth characteristic to obtain a sixth characteristic; wherein the first site phase feature comprises: a first stage feature, a second stage feature, and a third stage feature, a first site feature, comprising: the fourth feature, the fifth feature, and the sixth feature.
And constructing a first part feature pyramid, namely constructing an algorithm for processing the first part stage features by the first part feature pyramid.
To more clearly and specifically illustrate the various algorithms in this disclosure, examples are as follows: the first convolution process may be performing a normal convolution with a convolution kernel of 1x1, the second convolution process may be performing a normal convolution with a convolution kernel of 3x3, the pre-set multiple up-sampling may be 2 times up-sampling, the "adding the result to the second feature" may be a corresponding matrix addition, the third convolution process may be performing a deformable convolution with a convolution kernel of 2 x3, the fourth convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 18, the fifth convolution process may be performing a hole convolution with a convolution kernel of 2 x3, the sixth convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 1, and the seventh convolution process may be performing a normal convolution with a convolution kernel of 3x3 and a channel number of 4. The linear interpolation process and the average pooling process are common processes and will not be described in detail.
Optionally, the pyramid module 302 is further configured to perform a first convolution on the fourth stage feature and the fifth stage feature to obtain a seventh feature and an eighth feature; performing second convolution processing on the eighth feature to obtain a ninth feature; sequentially performing second convolution processing and linear interpolation processing of up-sampling of a preset multiple on the ninth feature, and adding the result and the seventh feature to obtain a tenth feature; sequentially performing second convolution processing and average pooling processing of up-sampling of a preset multiple on the ninth feature to obtain an eleventh position feature; wherein the second site phase feature comprises: a fourth stage feature and a fifth stage feature, a second site feature, comprising: ninth feature, tenth feature, and eleventh feature.
And constructing a second part feature pyramid, namely constructing an algorithm for processing the second part stage features by the second part feature pyramid.
Optionally, the second location detection module 303 is further configured to sequentially perform a third convolution process and a fourth convolution process on the second location feature to obtain a second location regression feature, where a loss function used in the second location regression process is a smoothL1 loss function; and sequentially carrying out fifth convolution processing, sixth convolution processing and activation function processing on the second part characteristics to obtain second part classification characteristics, wherein a loss function used in the second part classification processing is a cross entropy loss function.
The second location regression process may be considered a second location regression header that uses smoothL1 loss functions and the second location classification process may be considered a second location classification header that uses cross entropy loss functions.
The fourth feature, the fifth feature and the third second location feature correspond to a second location regression feature and a second location classification feature, a second location detection frame for the target object is determined based on the plurality of second location regression features, whether a location of the target picture is the second location of the target object is determined based on the plurality of second location classification features, and a second location category for the target object is determined, and a second location detection result of the target object is determined.
Optionally, the first location detection module 304 is further configured to perform a third convolution on the first location feature to obtain a preliminary regression feature; adding the primary regression feature and the second position regression feature to obtain a fusion regression feature; and carrying out seventh convolution processing on the fusion regression feature to obtain a first position regression feature, wherein the loss function used in the first position regression processing is a smoothL1 loss function.
The first site regression process can be considered as a first site regression head that uses smoothL1 loss functions. The "adding the preliminary regression feature to the second site regression feature" may be a corresponding matrix addition (each feature or feature map is actually a matrix).
According to the embodiment of the disclosure, the second position regression feature is utilized to assist the first position feature to carry out the first position regression processing, so that the accuracy of joint detection of a plurality of target positions can be improved.
Optionally, the first location detection module 304 is further configured to perform a fifth convolution on the first location feature to obtain a preliminary classification feature; adding the preliminary classification feature and the second part classification feature to obtain a fusion classification feature; and sequentially carrying out sixth convolution processing and activation function processing on the fusion classification characteristic to obtain a first part classification characteristic, wherein a loss function used in the first part classification processing is a cross entropy loss function.
The first region classification process may be regarded as a first region classification head that employs a cross entropy loss function. The activation function processing may be processing using a sigmoid function. The preliminary classification feature and the second location classification feature are added, which may be a corresponding matrix addition.
According to the embodiment of the disclosure, the second part classification feature is utilized to assist the first part feature in the first part classification processing, so that the accuracy of joint detection of a plurality of target parts can be improved.
The first stage feature, the second stage feature and the third stage feature correspond to a first position regression feature and a first position classification feature, a first position detection frame related to the target object is determined based on the plurality of first position regression features, whether a certain position of the target picture is the first position of the target object is determined based on the plurality of first position classification features, and then a first position detection result of the target object is determined.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the disclosure.
Fig. 4 is a schematic diagram of an electronic device 4 provided by an embodiment of the present disclosure. As shown in fig. 4, the electronic apparatus 4 of this embodiment includes: a processor 401, a memory 402 and a computer program 403 stored in the memory 402 and executable on the processor 401. The steps of the various method embodiments described above are implemented by processor 401 when executing computer program 403. Alternatively, the processor 401, when executing the computer program 403, performs the functions of the modules/units in the above-described apparatus embodiments.
The electronic device 4 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The electronic device 4 may include, but is not limited to, a processor 401 and a memory 402. It will be appreciated by those skilled in the art that fig. 4 is merely an example of the electronic device 4 and is not limiting of the electronic device 4 and may include more or fewer components than shown, or different components.
The processor 401 may be a central processing unit (Central Processing Unit, CPU) or other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application SpecificIntegrated Circuit, ASIC), field programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like.
The memory 402 may be an internal storage unit of the electronic device 4, for example, a hard disk or a memory of the electronic device 4. The memory 402 may also be an external storage device of the electronic device 4, for example, a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash Card (Flash Card) or the like, which are provided on the electronic device 4. Memory 402 may also include both internal storage units and external storage devices of electronic device 4. The memory 402 is used to store computer programs and other programs and data required by the electronic device.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit.
The integrated modules/units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present disclosure may implement all or part of the flow of the method of the above-described embodiments, or may be implemented by a computer program to instruct related hardware, and the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of the method embodiments described above. The computer program may comprise computer program code, which may be in source code form, object code form, executable file or in some intermediate form, etc. The computer readable medium may include: any entity or device capable of carrying computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the jurisdiction's jurisdiction and the patent practice, for example, in some jurisdictions, the computer readable medium does not include electrical carrier signals and telecommunication signals according to the jurisdiction and the patent practice.
The above embodiments are merely for illustrating the technical solution of the present disclosure, and are not limiting thereof; although the present disclosure has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the disclosure, and are intended to be included in the scope of the present disclosure.

Claims (7)

1. A method for joint detection of a plurality of target sites, comprising:
inputting a target picture to be detected into a residual error network, and outputting a first part stage characteristic and a second part stage characteristic of a target object in the target picture;
constructing a first position feature pyramid and a second position feature pyramid, processing the first position stage feature by using the first position feature pyramid to obtain a first position feature, and processing the second position stage feature by using the second position feature pyramid to obtain a second position feature;
performing second part regression processing and second part classification processing on the basis of the second part features to obtain second part regression features and second part classification features, and determining a second part detection result about the target object on the basis of the second part regression features and the second part classification features;
Performing first location regression processing based on the first location feature and the second location regression feature to obtain a first location regression feature, performing first location classification processing based on the first location feature and the second location classification feature to obtain a first location classification feature, and determining a first location detection result about the target object based on the first location regression feature and the first location classification feature;
and performing second part regression processing and second part classification processing based on the second part features to obtain second part regression features and second part classification features, wherein the second part regression processing and the second part classification processing comprise: sequentially performing third convolution processing and fourth convolution processing on the second part feature to obtain a second part regression feature, wherein a loss function used in the second part regression processing is a smoothL1 loss function; sequentially performing fifth convolution processing, sixth convolution processing and activation function processing on the second part characteristics to obtain second part classification characteristics, wherein a loss function used in the second part classification processing is a cross entropy loss function;
the first location regression processing is performed based on the first location feature and the second location regression feature, so as to obtain a first location regression feature, which includes: performing third convolution processing on the first part characteristic to obtain a preliminary regression characteristic; adding the primary regression feature and the second position regression feature to obtain a fusion regression feature; performing seventh convolution processing on the fusion regression feature to obtain the first position regression feature, wherein a loss function used in the first position regression processing is a smoothL1 loss function;
The first part classification processing is performed based on the first part feature and the second part classification feature to obtain a first part classification feature, which comprises the following steps: performing fifth convolution processing on the first part features to obtain preliminary classification features; adding the preliminary classification feature and the second part classification feature to obtain a fusion classification feature; and sequentially carrying out sixth convolution processing and activation function processing on the fusion classification characteristic to obtain the first part classification characteristic, wherein a loss function used in the first part classification processing is a cross entropy loss function.
2. The method of claim 1, wherein inputting the target picture to be detected into a residual network, outputting first and second location phase features relating to a target object in the target picture, comprises:
inputting a target picture to be detected into a residual network, respectively outputting a first-stage characteristic, a second-stage characteristic and a third-stage characteristic related to the target object through a second-stage network, a third-stage network and a fourth-stage network of the residual network, and respectively outputting a fourth-stage characteristic and a fifth-stage characteristic related to the target object through the third-stage network and the fourth-stage network of the residual network;
Wherein the first site phase feature comprises: the first stage feature, the second stage feature, and the third stage feature, the second site stage feature, comprising: the fourth stage feature and the fifth stage feature.
3. The method of claim 1, wherein processing the first site-phase feature using the first site feature pyramid to obtain a first site feature comprises:
performing first convolution processing on the first stage feature, the second stage feature and the third stage feature to obtain a fourth feature, a second feature and a third feature;
performing second convolution processing on the third feature to obtain a fourth feature;
sequentially performing the second convolution processing and the linear interpolation processing of up-sampling of a preset multiple on the fourth feature, and adding the result to the second feature to obtain a fifth feature;
sequentially carrying out the second convolution processing and the average pooling processing of up-sampling of the preset multiple on the fifth feature to obtain a sixth feature;
wherein the first site phase feature comprises: the first stage feature, the second stage feature, and the third stage feature, the first site feature, comprising: the fourth feature, the fifth feature, and the sixth feature.
4. The method of claim 1, wherein processing the second site-phase feature using the second site feature pyramid to obtain a second site feature comprises:
performing first convolution processing on the fourth-stage feature and the fifth-stage feature to obtain a seventh feature and an eighth feature;
performing second convolution processing on the eighth feature to obtain a ninth feature;
sequentially performing the second convolution processing and the linear interpolation processing of up-sampling of a preset multiple on the ninth feature, and adding the result to the seventh feature to obtain a tenth feature;
sequentially performing the second convolution processing and the average pooling processing of up-sampling of the preset multiple on the ninth feature to obtain an eleventh part feature;
wherein the second site phase feature comprises: the fourth stage feature and the fifth stage feature, the second site feature, comprising: the ninth feature, the tenth feature, and the eleventh feature.
5. An apparatus for joint detection of a plurality of target sites, comprising:
the residual module is configured to input a target picture to be detected into a residual network and output a first part stage characteristic and a second part stage characteristic related to a target object in the target picture;
The pyramid module is configured to construct a first position feature pyramid and a second position feature pyramid, process the first position stage feature by using the first position feature pyramid to obtain a first position feature, and process the second position stage feature by using the second position feature pyramid to obtain a second position feature;
a second location detection module configured to perform a second location regression process and a second location classification process based on the second location feature, obtain a second location regression feature and a second location classification feature, and determine a second location detection result regarding the target object based on the second location regression feature and the second location classification feature;
the first part detection module is configured to perform first part regression processing based on the first part feature and the second part regression feature to obtain a first part regression feature, perform first part classification processing based on the first part feature and the second part classification feature to obtain a first part classification feature, and determine a first part detection result related to the target object based on the first part regression feature and the first part classification feature;
The second part detection module is further configured to sequentially perform third convolution processing and fourth convolution processing on the second part feature to obtain a second part regression feature, wherein a loss function used in the second part regression processing is a smoothL1 loss function; sequentially performing fifth convolution processing, sixth convolution processing and activation function processing on the second part characteristics to obtain second part classification characteristics, wherein a loss function used in the second part classification processing is a cross entropy loss function;
the first part detection module is further configured to perform third convolution processing on the first part feature to obtain a preliminary regression feature; adding the primary regression feature and the second position regression feature to obtain a fusion regression feature; performing seventh convolution processing on the fusion regression feature to obtain the first position regression feature, wherein a loss function used in the first position regression processing is a smoothL1 loss function;
the first part detection module is further configured to perform fifth convolution processing on the first part feature to obtain a preliminary classification feature; adding the preliminary classification feature and the second part classification feature to obtain a fusion classification feature; and sequentially carrying out sixth convolution processing and activation function processing on the fusion classification characteristic to obtain the first part classification characteristic, wherein a loss function used in the first part classification processing is a cross entropy loss function.
6. An electronic device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when the computer program is executed.
7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.
CN202310538418.7A 2023-05-15 2023-05-15 Method and device for jointly detecting multiple target parts Active CN116258915B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310538418.7A CN116258915B (en) 2023-05-15 2023-05-15 Method and device for jointly detecting multiple target parts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310538418.7A CN116258915B (en) 2023-05-15 2023-05-15 Method and device for jointly detecting multiple target parts

Publications (2)

Publication Number Publication Date
CN116258915A CN116258915A (en) 2023-06-13
CN116258915B true CN116258915B (en) 2023-08-29

Family

ID=86684653

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310538418.7A Active CN116258915B (en) 2023-05-15 2023-05-15 Method and device for jointly detecting multiple target parts

Country Status (1)

Country Link
CN (1) CN116258915B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160379A (en) * 2018-11-07 2020-05-15 北京嘀嘀无限科技发展有限公司 Training method and device of image detection model and target detection method and device
CN111553280A (en) * 2020-04-28 2020-08-18 上海无线电设备研究所 Target part identification method based on deep learning
CN112581430A (en) * 2020-12-03 2021-03-30 厦门大学 Deep learning-based aeroengine nondestructive testing method, device, equipment and storage medium
CN113887602A (en) * 2021-09-27 2022-01-04 厦门汇利伟业科技有限公司 Object detection and classification method and computer-readable storage medium
CN113887282A (en) * 2021-08-30 2022-01-04 中国科学院信息工程研究所 Detection system and method for any-shape adjacent text in scene image
CN115761220A (en) * 2022-12-19 2023-03-07 北方工业大学 Target detection method for enhancing detection of occluded target based on deep learning

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114118124B (en) * 2021-09-29 2023-09-12 北京百度网讯科技有限公司 Image detection method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111160379A (en) * 2018-11-07 2020-05-15 北京嘀嘀无限科技发展有限公司 Training method and device of image detection model and target detection method and device
CN111553280A (en) * 2020-04-28 2020-08-18 上海无线电设备研究所 Target part identification method based on deep learning
CN112581430A (en) * 2020-12-03 2021-03-30 厦门大学 Deep learning-based aeroengine nondestructive testing method, device, equipment and storage medium
CN113887282A (en) * 2021-08-30 2022-01-04 中国科学院信息工程研究所 Detection system and method for any-shape adjacent text in scene image
CN113887602A (en) * 2021-09-27 2022-01-04 厦门汇利伟业科技有限公司 Object detection and classification method and computer-readable storage medium
CN115761220A (en) * 2022-12-19 2023-03-07 北方工业大学 Target detection method for enhancing detection of occluded target based on deep learning

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谢金衡等.基于深度残差和特征金字塔网络的实时多人脸关键点定位算法.《计算机应用》.2019,第39卷(第12期),3659-3664. *

Also Published As

Publication number Publication date
CN116258915A (en) 2023-06-13

Similar Documents

Publication Publication Date Title
CN109144647B (en) Form design method and device, terminal equipment and storage medium
CN109255337B (en) Face key point detection method and device
CN111800513B (en) Method and device for pushing information and computer readable medium of electronic equipment
CN116403250A (en) Face recognition method and device with shielding
CN109118456B (en) Image processing method and device
US20210200971A1 (en) Image processing method and apparatus
CN115953803A (en) Training method and device for human body recognition model
EP3851961A1 (en) Method and apparatus for generating information
CN116258915B (en) Method and device for jointly detecting multiple target parts
CN116030520A (en) Face recognition method and device with shielding
CN115048430B (en) Data verification method, system, device and storage medium
CN116129501A (en) Face pose estimation method and device
CN116385328A (en) Image data enhancement method and device based on noise addition to image
CN110019531B (en) Method and device for acquiring similar object set
CN111626802A (en) Method and apparatus for processing information
CN112085733B (en) Image processing method, image processing device, electronic equipment and computer readable medium
CN113656286A (en) Software testing method and device, electronic equipment and readable storage medium
CN113160942A (en) Image data quality evaluation method and device, terminal equipment and readable storage medium
CN112788551A (en) Message processing method and device, terminal equipment and storage medium
CN116912518B (en) Image multi-scale feature processing method and device
CN115862117A (en) Face recognition method and device with occlusion
CN113819989B (en) Article packaging method, apparatus, electronic device and computer readable medium
CN115937929A (en) Training method and device of face recognition model for difficult sample
CN114012739B (en) Method and device for controlling robot based on holographic projection technology
CN114862281B (en) Method and device for generating task state diagram corresponding to accessory system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant