Disclosure of Invention
The embodiment of the invention provides a target detection method and device, a nonvolatile storage medium and electronic equipment, which at least solve the technical problem of low detection precision of a traditional target detection model in the prior art.
According to an aspect of an embodiment of the present invention, there is provided a target detection method, including: obtaining a plurality of feature layers of a target detection model, wherein the plurality of feature layers comprise: the semantic information of the first characteristic layer is less than that of the second characteristic layer; performing fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, wherein the fusion processing is used for enhancing semantic information of the first feature layer; and updating the target detection model based on the target detection layer.
Optionally, the fusing the first feature layer and the second feature layer in the plurality of feature layers to obtain the target detection layer includes: amplifying the first scale of the first characteristic layer into a second scale by adopting an interpolation mode, wherein the second scale is equal to the scale of a second characteristic layer adjacent to the first characteristic layer; adding the processed first characteristic layer and the second characteristic layer to obtain a new first characteristic layer; amplifying the third scale of the new first feature layer into a fourth scale by adopting an interpolation mode, wherein the fourth scale is equal to the scale of a new second feature layer adjacent to the new first feature layer; and adding the processed new first feature layer and the new second feature layer until the fusion processing of all the first feature layers and the second feature layers in the plurality of feature layers is finished, thereby obtaining the target detection layer.
Optionally, the method further includes: acquiring a real receptive field of the target detection model; adjusting an anchor frame of the target detection model based on the real receptive field, wherein the size of the anchor frame affects the anchor frame classification and the anchor frame regression of the target detection model; determining the number of samples of the target detection model according to the anchor frame, wherein the number of samples comprises: the number of positive samples and the number of negative samples.
Optionally, determining the number of samples of the target detection model according to the anchor frame includes: determining a loss function of the target detection model according to the anchor frame, wherein the loss function includes: a classification loss function and a regression loss function; the number of samples is determined based on the loss function.
Optionally, determining a loss function L of the target detection model according to the anchor frame by using the following calculation formula, including:
where i denotes the index of the anchor frame, P
iRepresenting the prediction probability, P, of the above-mentioned anchor frame as the target
i *A true value indicating that the anchor frame is a target, and P is a true value when the anchor frame is the target
i *Is 1, otherwise P
i *Is 0; t is t
iIndicating the predicted detection frame coordinate correction value,
representing the actual coordinate value of the detection frame; p
i *L
rRepresenting the calculation of the regression loss, N, against the anchor box of the sample only
cAnd N
rRespectively representing the number of positive and negative anchor frames and the number of regressed positive anchor frames during classification, and lambda represents a balance parameter for balancing classification loss and regression loss.
Optionally, the classification loss function is a Softmax loss function, and the regression loss function is a Smooth-L1 loss function.
According to another aspect of the embodiments of the present invention, there is also provided an object detection apparatus, including: an obtaining module, configured to obtain a plurality of feature layers of a target detection model, where the plurality of feature layers include: the semantic information of the first characteristic layer is less than that of the second characteristic layer; a fusion processing module, configured to perform fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, where the fusion processing is used to enhance semantic information of the first feature layer; and the updating module is used for updating the target detection model based on the target detection layer.
According to another aspect of the embodiments of the present invention, there is also provided a non-volatile storage medium storing a plurality of instructions, the instructions being adapted to be loaded by a processor and to perform any one of the above object detection methods.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program is configured to execute any one of the above object detection methods when running.
According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform any one of the above object detection methods.
In the embodiment of the present invention, a plurality of feature layers of a target detection model are obtained, where the plurality of feature layers include: the semantic information of the first characteristic layer is less than that of the second characteristic layer; performing fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, wherein the fusion processing is used for enhancing semantic information of the first feature layer; and updating the target detection model based on the target detection layer.
The first characteristic layer and the second characteristic layer of the target detection model are subjected to recursive fusion processing, the semantic features of the first characteristic layer are enhanced, and the purpose of improving the detection accuracy of the target detection model is achieved, so that the technical effect of enhancing the detection accuracy of the target detection model is achieved, and the technical problem that the detection accuracy of the traditional target detection model is low in the prior art is solved.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Example 1
In accordance with an embodiment of the present invention, there is provided an embodiment of an object detection method, it should be noted that the steps illustrated in the flowchart of the drawings may be performed in a computer system such as a set of computer-executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
Fig. 1 is a flowchart of an object detection method according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, obtaining a plurality of characteristic layers of a target detection model, wherein the plurality of characteristic layers comprise: the semantic information of the first characteristic layer is less than that of the second characteristic layer;
step S104, performing fusion processing on the first characteristic layer and the second characteristic layer in the plurality of characteristic layers to obtain a target detection layer, wherein the fusion processing is used for enhancing semantic information of the first characteristic layer;
step S106, updating the target detection model based on the target detection layer.
In the embodiment of the present invention, a plurality of feature layers of a target detection model are obtained, where the plurality of feature layers include: the semantic information of the first characteristic layer is less than that of the second characteristic layer; performing fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, wherein the fusion processing is used for enhancing semantic information of the first feature layer; and updating the target detection model based on the target detection layer.
The shallow characteristic layer of the target detection model is used as a detection layer, the shallow characteristic layer and the deep characteristic layer are subjected to recursive fusion processing, the semantic characteristics of the shallow characteristic layer are enhanced, and the purpose of improving the detection accuracy of the target detection model is achieved, so that the technical effect of enhancing the detection accuracy of the target detection model is achieved, and the technical problem that the detection accuracy of the traditional target detection model is low in the prior art is solved.
Optionally, the target detection model may be any type of target detection model, for example, a Faster R-CNN target detection model, an SDD target detection model, a YOLO target detection model, and the like.
Taking the above target detection model as an SSD target detection model as an example, the SSD target detection model is a single-stage, efficient and general object detection framework. The effect of detecting the target in real time can be achieved while the precision is ensured, but when the SSD target detection model is used for detecting the small target, the detection effect is not ideal. According to analysis, the SSD target detection model is too deep in a feature detection layer, and the downsampling multiple is too large, so that feature information of a small target is lost. Secondly, the SSD anchor frame is laid at intervals which are too large, so that small targets are easily not in the anchor frame, and the small targets are also missed to be inspected.
In the embodiment of the application, aiming at the problem of low recall rate of small target detection, a shallow feature layer is used as a detection layer, and the shallow feature layer and a deep feature layer are fused to enhance semantic feature information; aiming at the problem that the positive and negative samples are extremely unbalanced in small target detection, a negative sample screening strategy is added, and the number of the positive and negative samples is balanced; the shallow characteristic layer is used as a detection layer, the deep characteristic layer and the shallow characteristic layer are fused, positive and negative samples are balanced, and the interference of the negative samples on detection results in target detection is reduced.
Optionally, the plurality of feature layers include: the semantic information of the first characteristic layer is less than that of the second characteristic layer.
It should be noted that the shallow feature and the deep feature are relative, for example, when two adjacent features are compared, the lower feature belongs to the shallow feature, and the upper feature belongs to the deep feature, but another upper feature adjacent to the upper feature is a deep feature of the upper feature, and the upper feature belongs to the shallow feature relative to the other upper feature.
In an optional embodiment, fig. 2 is a flowchart of an optional target detection method according to an embodiment of the present invention, and as shown in fig. 2, the fusing the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer includes:
step S202, enlarging the first scale of the first characteristic layer into a second scale by adopting an interpolation mode, wherein the second scale is equal to the scale of a second characteristic layer adjacent to the first characteristic layer;
step S204, adding the processed first characteristic layer and the second characteristic layer to obtain a new first characteristic layer;
step S206, enlarging the third dimension of the new first feature layer to a fourth dimension by adopting an interpolation mode, wherein the fourth dimension is equal to the dimension of a new second feature layer adjacent to the new first feature layer;
step S208 is to add the processed new first feature layer and the new second feature layer until the fusion process of all the first feature layers and the second feature layers in the plurality of feature layers is completed, thereby obtaining the target detection layer.
In order to solve the problem of the prior art, in the embodiment of the present application, a detection layer may be added to a shallow feature layer, however, semantic information of the shallow feature layer is not rich enough, and it is difficult to identify an object, and in order to solve the problem of less semantic information of the shallow feature layer, semantic information of the shallow feature layer may be enhanced by feature fusion with a deep feature layer through a feature pyramid, and a specific fusion processing procedure is as shown in fig. 3, based on an input image, a plurality of feature layers Conv3_3, Conv4_3, Conv5_3, Conv6_2, and Conv7_2 in an SSD object detection model are obtained as feature pyramid layers, starting from Conv7_2, the feature layer size is enlarged to be the same as the previous layer (Conv6_2) by means of interpolation, and the two are added to obtain a new feature layer P1, the P1 is subjected to the above interpolation operation and added to the Conv5_3 to obtain P2, the above operations were repeated to obtain P3 and P4, and P4 was used as the target detection layer.
In an optional embodiment, the method further includes:
step S302, acquiring a real receptive field of the target detection model;
step S304, adjusting an anchor frame of the target detection model based on the real receptive field, wherein the size of the anchor frame influences the anchor frame classification and the anchor frame regression of the target detection model;
step S306, determining the number of samples of the target detection model according to the anchor frame, where the number of samples includes: the number of positive samples and the number of negative samples.
In the embodiment of the present application, the size of the anchor frame of the target detection model may be, but is not limited to, set to 16 according to the size of the actual receptive field being twenty to forty percent of the theoretical receptive field, so that the anchor frame can be classified and regressed more accurately.
It should be noted that, in the embodiment of the present application, although the anchor frame laid by the target detection model can cover a small target more comprehensively, a large number of negative samples are generated, so that the number of the negative samples is far higher than that of the positive samples, the positive samples and the negative samples are extremely unbalanced, and the network training result at this time is necessarily biased toward the negative samples. For example, if only one of the 100 anchor boxes is a positive sample (target) and the other 99 anchor boxes are negative samples (background), the accuracy of the embodiment of the present application can be as high as 99% if the detection result is directly set as the background.
In an alternative embodiment, determining the number of samples of the target detection model according to the anchor frame includes:
step S402, determining a loss function of the target detection model according to the anchor frame, wherein the loss function includes: a classification loss function and a regression loss function;
in step S404, the number of samples is determined based on the loss function.
Aiming at the problem of imbalance of the positive and negative samples, the loss function of the target detection model can be determined through an anchor frame and sequenced, and the number of the positive samples and the number of the negative samples are respectively selected according to the proportion of 3:1 of the positive samples and the negative samples.
In an alternative embodiment, the loss function includes two parts, one is a classification loss function of whether the anchor frame is a human face, and the other is a regression loss function of the anchor frame as the coordinate correction value of the detection frame of the human face.
In an alternative embodiment, the classification loss function is a Softmax loss function, and the regression loss function is a Smooth-L1 loss function.
In an alternative embodiment, determining the loss function L of the target detection model according to the anchor frame by using the following calculation formula includes:
where i denotes the index of the anchor frame, P
iRepresenting the prediction probability, P, of the above-mentioned anchor frame as the target
i *A true value indicating that the anchor frame is a target, and P is a true value when the anchor frame is the target
i *Is 1, otherwise P
i *Is 0; t is t
iIndicating the predicted detection frame coordinate correction value,
representing the actual coordinate value of the detection frame; p
i *L
rRepresenting the calculation of the regression loss, N, against the anchor box of the sample only
cAnd N
rRespectively representing the number of positive and negative anchor frames and the number of regressed positive anchor frames during classification, and lambda represents a balance parameter for balancing classification loss and regression loss.
Aiming at the condition that the detection effect of the SSD target on the small target is not good, the shallow feature layer is used as a detection layer and is fused with the deep feature layer; adjusting the size of the anchor frame according to the real receptive field; and the number of the positive samples and the negative samples is balanced, so that the detection precision of the small target is improved.
Example 2
According to an embodiment of the present invention, there is further provided an apparatus embodiment for implementing the object detection method, and fig. 4 is a schematic structural diagram of an object detection apparatus according to an embodiment of the present invention, as shown in fig. 4, the object detection apparatus includes: an obtaining module 400, a fusion processing module 402, and an updating module 404, wherein:
an obtaining module 400, configured to obtain a plurality of feature layers of a target detection model, where the plurality of feature layers include: the semantic information of the first characteristic layer is less than that of the second characteristic layer; a fusion processing module 402, configured to perform fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, where the fusion processing is used to enhance semantic information of the first feature layer; an updating module 404, configured to update the object detection model based on the object detection layer.
It should be noted that the above modules may be implemented by software or hardware, for example, for the latter, the following may be implemented: the modules can be located in the same processor; alternatively, the modules may be located in different processors in any combination.
It should be noted here that the above-mentioned obtaining module 400, the fusion processing module 402 and the updating module 404 correspond to steps S102 to S106 in embodiment 1, and the above-mentioned modules are the same as the examples and application scenarios realized by the corresponding steps, but are not limited to what is disclosed in embodiment 1. It should be noted that the modules described above may be implemented in a computer terminal as part of an apparatus.
It should be noted that, reference may be made to the relevant description in embodiment 1 for alternative or preferred embodiments of this embodiment, and details are not described here again.
The above object detection device may further include a processor and a memory, where the above obtaining module 400, the fusion processing module 402, the updating module 404, and the like are all stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls a corresponding program unit from the memory, wherein one or more than one kernel can be arranged. The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
According to an embodiment of the present application, there is also provided an embodiment of a non-volatile storage medium. Optionally, in this embodiment, the nonvolatile storage medium includes a stored program, and the apparatus in which the nonvolatile storage medium is located is controlled to execute the any one of the object detection methods when the program runs.
Optionally, in this embodiment, the nonvolatile storage medium may be located in any one of a group of computer terminals in a computer network, or in any one of a group of mobile terminals, and the nonvolatile storage medium includes a stored program.
Optionally, the apparatus in which the non-volatile storage medium is controlled to perform the following functions when the program is executed: obtaining a plurality of feature layers of a target detection model, wherein the plurality of feature layers comprise: the semantic information of the first characteristic layer is less than that of the second characteristic layer; performing fusion processing on the first feature layer and the second feature layer in the plurality of feature layers to obtain a target detection layer, wherein the fusion processing is used for enhancing semantic information of the first feature layer; and updating the target detection model based on the target detection layer.
Optionally, the apparatus in which the non-volatile storage medium is controlled to perform the following functions when the program is executed: amplifying the first scale of the first characteristic layer into a second scale by adopting an interpolation mode, wherein the second scale is equal to the scale of a second characteristic layer adjacent to the first characteristic layer; adding the processed first characteristic layer and the second characteristic layer to obtain a new first characteristic layer; amplifying the third scale of the new first feature layer into a fourth scale by adopting an interpolation mode, wherein the fourth scale is equal to the scale of a new second feature layer adjacent to the new first feature layer; and adding the processed new first feature layer and the new second feature layer until the fusion processing of all the first feature layers and the second feature layers in the plurality of feature layers is finished, thereby obtaining the target detection layer.
Optionally, the apparatus in which the non-volatile storage medium is controlled to perform the following functions when the program is executed: acquiring a real receptive field of the target detection model; adjusting an anchor frame of the target detection model based on the real receptive field, wherein the size of the anchor frame affects the anchor frame classification and the anchor frame regression of the target detection model; determining the number of samples of the target detection model according to the anchor frame, wherein the number of samples comprises: the number of positive samples and the number of negative samples.
Optionally, the apparatus in which the non-volatile storage medium is controlled to perform the following functions when the program is executed: determining a loss function of the target detection model according to the anchor frame, wherein the loss function includes: a classification loss function and a regression loss function; the number of samples is determined based on the loss function.
Optionally, the apparatus in which the non-volatile storage medium is controlled to perform the following functions when the program is executed: determining a loss function L of the target detection model according to the anchor frame by adopting the following calculation formula, wherein the calculation formula comprises the following steps:
where i denotes the index of the anchor frame, P
iRepresenting the prediction probability, P, of the above-mentioned anchor frame as the target
i *A true value indicating that the anchor frame is a target, and P is a true value when the anchor frame is the target
i *Is 1, otherwise P
i *Is 0; t is t
iIndicating the predicted detection frame coordinate correction value,
representing the actual coordinate value of the detection frame; p
i *L
rRepresenting the calculation of the regression loss, N, against the anchor box of the sample only
cAnd N
rRespectively representing the number of positive and negative anchor frames and the number of regressed positive anchor frames during classification, and lambda represents a balance parameter for balancing classification loss and regression loss.
According to an embodiment of the present application, there is also provided an embodiment of a processor. Optionally, in this embodiment, the processor is configured to execute a program, where the program executes the any one of the object detection methods.
According to an embodiment of the present application, there is also provided an embodiment of an electronic device, including a memory and a processor, where the memory stores a computer program, and the processor is configured to execute the computer program to perform any one of the above object detection methods.
There is further provided, in accordance with an embodiment of the present application, an embodiment of a computer program product, which, when executed on a data processing device, is adapted to execute a program initialized with the steps of the object detection method of any one of the above.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the above-described division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable non-volatile storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a non-volatile storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned nonvolatile storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.