CN116109556A

CN116109556A - Target detection method, device and storage medium based on Sv2-v3 model

Info

Publication number: CN116109556A
Application number: CN202211434355.2A
Authority: CN
Inventors: 李献领; 陶模; 郑伟; 吕书玉; 何鹏元; 柴文婷; 邹海; 冯毅; 熊卿; 邱志强; 赵振兴; 吴君
Original assignee: 719th Research Institute of CSIC
Current assignee: 719th Research Institute of CSIC
Priority date: 2022-11-16
Filing date: 2022-11-16
Publication date: 2023-05-12

Abstract

The invention provides a target detection method, equipment and a storage medium based on an Sv2-v3 model, wherein the method comprises the following steps: acquiring a video image dataset containing detection targets; performing feature extraction on a video image dataset of a detection target by using a first model to obtain a multi-layer feature map information set; extracting the multi-layer feature map information of each multi-layer feature map information set for multiple times, wherein the sizes of the feature maps extracted each time are different; fusing the extracted feature map information to obtain a multi-scale overall fused feature map set of the detection target; and determining a boundary frame data set of the detection target according to the multi-scale overall fusion feature map of the detection target, classifying and positioning the boundary frame data set, and outputting the specific position of the detection target after processing to finish target detection. The detection precision and the detection speed of the Sv2-v3 model are improved by multi-scale fusion, introduction of a focus loss function and extraction of a backbone network by taking the Shuffle Net v2 as a characteristic.

Description

Target detection method, device and storage medium based on Sv2-v3 model

Technical Field

The invention belongs to the technical field of edge intelligence. And more particularly, to a target detection method, apparatus and storage medium based on an Sv2-v3 model.

Background

Edge intelligence is a product of large fusion of a series of information technologies such as big data, cloud computing, artificial intelligence, intelligent chips, edge computing, federal learning, blockchain, 5G communication and the like, and is a system framework for comprehensive integration in the 'cloud-edge-end' multi-technical field. Among them, deep neural networks are an important representative of new generation edge intelligence. Traditional deep neural network realizes mobile application scene demand mainly through the interaction of mobile terminal and high in the clouds. But because of the massive, high-speed, heterogeneous and diverse data, the method has the characteristics of large calculation amount, high storage cost, complex model and the like. Therefore, the application transmission process is susceptible to factors such as network delay and storage, and the dependence on the edge device is large, so that the development of the application transmission process is limited, and the application of the application transmission process to the lightweight mobile terminal device cannot be effectively realized.

At present, with the continuous development of information technology, the computing capability and the storage capability of mobile equipment are greatly improved, so that a deep neural network model can be deployed on the mobile equipment. However, in the deployment of the deep neural network model, since the resources of the mobile terminal are limited, when the deep neural network model is large, the calculation amount, the storage cost and the model complexity are also high, so that the target detection accuracy and the target detection speed of the network model are poor.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a target detection method, equipment and a storage medium based on an Sv2-v3 model, which solve the problem of poor target detection precision and speed of the model.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

A target detection method based on a Sv2-v3 model comprises the following steps:

acquiring a video image dataset containing detection targets;

extracting features of the video image dataset of the detection target by using a first model to obtain a multi-layer feature map information set; extracting the multi-layer feature map information in the multi-layer feature map information set for multiple times, wherein the sizes of the feature maps extracted each time are different;

fusing the extracted feature map information to obtain a multi-scale overall fused feature map set of the detection target;

and determining a boundary frame data set of the detection target according to the multi-scale overall fusion feature map of the detection target, classifying and positioning regression on the boundary frame data set, and outputting the specific position of the detection target after processing to finish target detection.

Further, the steps of extracting and fusing the multiple feature map information are as follows:

s1: extracting a first feature map from the multi-layer feature map information, the first feature map having a first size; extracting a second feature map from the first feature map, the second feature map having a second size, the second size being smaller than the first size; by the method, an Nth feature map is extracted from the N-1 th feature map, an N+1 th feature map is extracted from the Nth feature map, and the last N+1 th feature map is extracted;

wherein the N-1 th feature map has an N-1 th dimension, the N+1 th feature map has an N+1 th dimension, and the N-1 th dimension is greater than the N-1 th dimension, and the N-1 th dimension is greater than the N+1 th dimension; n is a positive integer greater than or equal to 4;

s2: performing feature fusion on the N+1th feature map to obtain an N+1th fusion feature map; upsampling the (N+1) th feature map, and performing feature fusion on the upsampled n+1 th feature map and the Nth feature map to obtain an Nth fusion feature map; upsampling the Nth feature map after feature fusion, and performing feature fusion on the Nth feature map and the Nth-1 feature map to obtain an Nth-1 fusion feature map; and so on until a first fused feature map is obtained.

Further, the first model is based on the Sv2-v3 model for sample data training learning.

Further, the sample data training learning utilizes a focus loss function to assign different weights to the negative sample and the positive sample, so that the model pays more attention to the positive sample to improve the model detection precision, and the focus loss function is FL (P _t )＝-(1-P _t )γlog(P _t )；

Wherein P is _t Probability of being a positive sample for the detection target; gamma is a fine tuning parameter of interest, greater than zero.

Further, the backbone network of the feature extraction is Shuffle Net v2.

Further, the bounding box data set is processed using a K-means++ algorithm.

As another aspect of the present invention, there is also provided a terminal device including at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program including a second model with which the above-described object detection method is performed in place of the above-described first model when the computer program is executed by the processing unit;

the second model is: and after training according to the target detection result, performing compression to generate a second model.

Further, the compression includes quantization and pruning.

As another aspect of the present invention, there is also provided a computer-readable medium storing a computer program executable by an access authentication apparatus, the computer program including a second model that performs the target detection method as described above, in place of the first model, when the computer program is run on the access authentication apparatus;

Further, the compression includes quantization and pruning.

In summary, compared with the prior art, the object detection method, device and storage medium based on the Sv2-v3 model provided by the invention have the following beneficial effects:

according to the target detection method, the device and the storage medium based on the Sv2-v3 model, the feature extraction and the fusion are carried out for extracting the feature map information for a plurality of times, the feature map extracted each time is different in size, the feature fusion of a plurality of scales is carried out, the deep features and the shallow features are fused to enrich the detection information of the target size, the key information of a key region is reserved, and therefore the robustness of the model is improved, and the detection precision and the detection speed of the model are guaranteed.

Secondly, the invention introduces a focusing loss function, distributes different learning weights for the negative sample and the positive sample, reduces the learning weight of the negative sample in the training process of the network model, ensures that the model pays more attention to the positive sample, and improves the learning rate and the detection precision of the model in the training process to a certain extent.

In addition, the invention sets the backbone network in the feature extraction as the Shuffle Net v2, thereby reducing the calculation amount of convolution; and meanwhile, redundant parts in the model are removed by utilizing a pruning and quantization compression method, so that the volume and complexity of the model are reduced, and the detection speed of the model is improved.

In a word, the target detection method, the device and the storage medium based on the Sv2-v3 model not only reduce the calculated amount, the storage volume and the complexity of the model, but also improve the target detection precision and the target detection speed to a greater extent.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of model structure construction of a target detection method, device and storage medium based on a Sv2-v3 model.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The invention provides a target detection method, equipment and a basic flow diagram of a storage medium based on an Sv2-v3 model, which solve the problems of two aspects of target detection precision and detection speed in the prior art, and the method comprises the following steps:

firstly, acquiring a video image data set containing a detection target;

then, extracting features of the video image dataset of the detection target by using a first model to obtain a multi-layer feature map information set; extracting the multi-layer feature map information of each multi-layer feature map information set for multiple times, wherein the sizes of the feature maps extracted each time are different;

secondly, fusing the extracted feature map information to obtain a multi-scale overall fused feature map set of the detection target;

and finally, determining a boundary frame data set of the detection target according to the multi-scale overall fusion feature map of the detection target, classifying and positioning the boundary frame data set, and outputting the specific position of the detection target after processing to finish target detection.

Generally, multi-branch structures consume memory. This is because the results of each branch need to be saved, and the display of each branch cannot be released until the last step; and the introduction of multiple branch structures can constrain the network structure, resulting in the network not being easily expandable. The Sv2-v3 model has a flexible single-path structure, and has a multi-branch structure and an inference model, but the structure realizes decoupling by continuous heavy parameterization, so that a better speed-up ratio can be obtained even after cutting, and the detection speed and the precision are well balanced. Therefore, the first model preferably carries out sample data training learning by taking a rapid and simple Sv2-v3 model as a basic model to construct an optimized network model structure. As one embodiment of the present invention, the parameters of training learning are: the batch size is 16, the total iteration training is 400 times, the momentum is 0.9, the initial learning rate is 0.001, and the attenuation coefficient is 0.9.

First, as one embodiment of the present invention, an image may be acquired through an IP network camera, and then a huge amount of image data is made into a dataset.

And then, inputting the obtained image data set into a first model for feature extraction, and obtaining a multi-layer feature map information set. In order to improve the feature extraction speed, the invention sets the Shuffle Net v2 as a backbone network in feature extraction. While floating point operations per second (FLPs values) are low when operating element by element, their higher multiply-add-accumulate operands (MACs) result in excessive convolution; thereby increasing access time of MACs memory and leading to slow speed of the model; so there should be suitably fewer element-by-element operations to make MACs smaller. The number of characteristic channels of the Shuffle Net v2 convolution layer is the same, MACs are minimum, element-by-element operation is less, and the model speed is the fastest, so that the invention takes the backbone network in the characteristic extraction as the Shuffle Net v2.

The multi-branch structure in the training process is utilized to combine and convert into the re-parameterization of the single-path structure in the training process, so that the accuracy of model training is improved, and the speed in the reasoning process is also accelerated. The 3×3 convolution computation density is higher than other convolution kernels, and therefore, in order to improve the network computation speed in the Sv2-v3 model, the convolution kernel is preferably 3×3. As an embodiment of the present invention, the computing libraries cudnn, inter MKL are utilized.

Extracting the multi-layer feature map information for multiple times according to each multi-layer feature map information in the multi-layer feature map information set, wherein the sizes of the feature maps extracted each time are different.

As an embodiment of the invention, as shown in FIG. 1, a schematic diagram is constructed for a model structure of a target detection method, device and storage medium based on a Sv2-v3 model. And inputting the picture data set, and sequentially carrying out feature extraction on each level to obtain a multi-layer feature map information set. Extracting a first feature map 56×56×48, 56×56×48 having a first size 56×56 from the multi-layer feature map information 112×112×48; extracting a second feature map 28×28×96, 28×28×96 having a second size 28×28 from the first feature map 56×56×48, the second size being smaller than the first size; extracting a third feature map 14×14×192 from the second feature map 28×28×96, the 14×14×192 having a third size 14×14, and the third size being smaller than the second size; extracting the fourth feature map 7 x 1280,7 x 7 x 1280 from the third feature map 14 x 192 has a fourth size 7 x 7, and the fourth dimension is smaller than the third dimension.

And fusing the extracted feature map information, and fusing deep features and shallow features to enrich the detection information of the target size, so as to obtain a multi-scale overall fusion feature map set of the detection target. Specifically, features of the Sv2-v3 model, which are expanded into 4 scales from bottom to top, are fused, and up-sampling is completed by transpose convolution, so that the robustness of the model is improved, and the model has better generalization capability on the suitability of the size of a detection target.

Performing feature fusion on the fourth feature map 7×7×1280 to obtain a fourth fused feature map 7×7×255; upsampling the fourth feature map 7×7×1280 to increase its scale, and feature-fusing the fourth feature map with the third feature map 14×14×192 to obtain a third fused feature map 14×14×255; upsampling the third feature map 14×14×192 after feature fusion to increase its scale, and feature-fusing it with the second feature map 28×28×96 to obtain a second fused feature map 28×28×225; the second feature map 28×28×96 after feature fusion is up-sampled to increase its scale, and is feature-fused with the first feature map 56×56×48 to obtain a first fused feature map 56×56×255.

Finally, a multi-scale overall fusion characteristic atlas of the detection target is obtained, wherein the multi-scale overall fusion characteristic atlas comprises a first fusion characteristic image 56×56×255, a second fusion characteristic image 28×28×225, a third fusion characteristic image 14×14×255 and a fourth fusion characteristic image 7×7×255.

In the figure, 226×226×3 represents the size of the feature map, and 3 represents the number of feature maps; from this column, 7×7, 14×14×255, 28×28×255, 56×56×255 represent the feature size, and 255 represents the feature number, among 7×7, 14×14, 28×28, and 56×56×255. The number of feature maps may vary and is related to the number of convolution kernels.

And finally, determining a boundary frame data set of the detection target according to the multi-scale overall fusion feature map of the detection target, classifying and positioning the boundary frame data set by utilizing the full-connection layer, and outputting the specific position of the detection target after processing to finish target detection.

Specifically, learning deep feature information of different feature map information areas and obtaining bounding box data set information; and analyzing the data set information of the boundary frame and inputting the data set information into a tracking model network, performing NMS operation on the obtained data set information and deep characteristic features, and performing IoU processing on the boundary frame by adopting a K-means++ algorithm under the condition of detecting and extracting the deep characteristic information, and outputting the specific position of a detection target to finish target detection.

It should be noted that, the bounding boxes of each scale branch prediction in the multi-scale global fusion feature map set have different corresponding sizes, for example, the bounding boxes of the first fusion feature map 56×56×255, the second fusion feature map 28×28×225, the third fusion feature map 14×14×255, and the fourth fusion feature map 7×7×255 have different sizes. As the size increases, the perception of each point gradually decreases, so as one embodiment of the present invention, a large target is predicted with 7×7×255, and a small target is predicted with 56×56×255. The deep layer features and the shallow layer features are fused to enrich the detection information of the target size, so that predicted boundary frames are more and more comprehensive, key information of key areas is reserved, and the detection performance is greatly improved; in addition, the multi-scale fusion is adopted to adapt the size of the target, and the detection precision is high no matter how large the target is detected.

As one embodiment of the present invention, the use of the Injedax GTX 2060G GPU,16G memory is done on an x86 server under the Windows10 operating system. The model compression tool framework PocketFlow is used, the model compression and training algorithm of the current main stream is integrated, full-range, automatic and tubular model compression and acceleration are realized, and local efficient processing of user data is realized, the PocketFlow is effectively combined through algorithm combinations of channel pruning, weight sparsification, weight quantization, network distillation, super parameter optimization and the like, compression and acceleration of a deep learning model can be realized with smaller precision loss, the degree of automation is higher, and main software is as follows: python3.6.6, pytorch, opencv4.2, CUDA11.0 and the like, and the software and hardware environment is utilized to detect pedestrians in a video or a camera. Specifically, a software and hardware environment required by an experiment is built, the resolution of a camera is adjusted, and the image size of a video frame is obtained, so that the camera can be suitable for the size of model input. As one embodiment of the present invention, the input image size is 416 pixels×416 pixels.

It should be noted that many bounding boxes are generated in the detection for the object detection algorithm, and most bounding boxes are negative samples (background class), and very few are positive samples. The imbalance of positive and negative samples caused during training learning can cause training to be ineffective, and the positive samples contribute shallow useless information; negative samples lead to model degradation. Therefore, the focus loss function is required to be quoted to screen the negative sample and the positive sample, and the problem of unbalanced sample during training and learning in the stage target detection scene is solved.

In order to promote the learning rate in the detection, the invention introduces a focusing loss function in the training and learning process of the Sv2-v3 model, distributes different weights to the negative sample and the positive sample, reduces the learning weight of the negative sample in the network model training process, leads the model to pay more attention to the positive sample, and promotes the learning rate in the training and learning process and the detection precision of the model to a certain extent; and the pooling layer is used for replacing the convolution layer to perform more operations, so that the accuracy and the speed of the identification of the bounding box and the algorithm are improved, and the range of the detection scale is enlarged.

The model is usually optimized by using a parameter adjusting method, and is mainly used for adjusting the learning rate. Because the learning rate affects the convergence rate of model training, but the optimization degree of the model is limited and has little effect, the optimization of the model performance by using the loss function needs to be improved to a great extent.

The focus loss function reduces the learning weight of a simple background in the network training process through a classical cross entropy loss function, and the performance of the model can be greatly improved. Originally, the soft max classification loss function is equivalent to a standard cross entropy loss function, the weights of all samples are the same, and then the cross entropy of each training sample is directly summed, and the cross entropy CE is shown as a formula (1):

wherein p is the probability that the predicted sample belongs to 1; y is the sample label.

In the cross entropy loss function of class II, when y takes 1 or-1, then P _t The probability that the sample belongs to the positive example is shown as the formula (2):

then formula (3) is obtainable from formulas (1) and (2), formula (3) being:

CE(p,y)＝CE(p _t )＝-log(p _t ) (3)

because the single-stage target detection model can face the problem that the number of positive and negative samples is extremely unbalanced during training, weights are added to the positive and negative samples in the loss, and the cross entropy loss function is adjusted by setting the weights, so that the positive samples can be focused, the more the number of the negative samples is, the smaller the weight is, the fewer the number of the positive samples is, and the larger the weight is, and therefore, the focusing loss function can be as follows:

FL(p _t )＝-(1-p _t ) ^γ log(p _t ) (4)

wherein P is _t Probability of being a positive sample for the detection target; in the formula (1-p) _t ) ^γ Is a weight expression, and can also be called a regulating factor; gamma is a fine tuning parameter of interest, greater than zero, smoothly adjusting the weight of the negative samples, and as gamma increases, the influence of the adjustment factor also increases.

It should be noted that, the focus loss function is equivalent to adding a weight to each sample, and the weight is related to the probability that the Sv2-v3 model predicts that the high sample belongs to the formal class.

If the Sv2-v3 model predicts that the probability that a sample belongs to the true class is large, then this sample is a negative sample for the Sv2-v3 model, at this point P _t The weight tends to be 0 near 1, thus reducing the loss of the negative sampleLosing weight; if the probability that the Sv2-v3 model predicts that the sample belongs to the true class is small, then this sample is a positive sample for the Sv2-v3 model, at this point P _t The weight is close to 1 because of small weight, so that the loss of positive samples is kept to the maximum degree. The focus loss function distinguishes between negative and positive samples for model training, and focuses more on positive samples.

The focusing parameter gamma can smoothly adjust the proportion of the negative sample's reduced weight. When γ=0, the focus loss function is a normal cross entropy loss function; when the value of gamma is larger, the influence of the weight expression is larger, and increasing gamma can effectively reduce the probability that the positive type prediction probability is larger. As an embodiment of the present invention, γ=2 is preferable, and the detection effect is the best.

As another aspect of the present invention, there is also provided a terminal device including at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program including a second model, and when the computer program is executed by the processing unit, the target detection method as described above is performed by replacing the first model with the second model;

wherein the second model is: after training with the result of target detection, compression is performed to generate a second model.

Specifically, the training is performed on the Sv2-v3 model to obtain trained weight parameters, and the trained weight files are subjected to light model compression processing and then are arranged in the mobile terminal equipment. And determining quantization parameters by utilizing a KL divergence method, and performing quantization operation to reduce the parameter quantity of the model. Pruning is to simplify the number of channels of the whole model, remove non-important channels and reduce the volume of the model; thereby improving the prediction speed and accuracy.

The original network is compressed by reducing the number of bits required by the weight parameters, namely, 32-bit floating point operation in the neural network is converted into 8-bit or 16-bit fixed point operation, which not only can realize real-time operation of the network on the mobile device, but also is helpful for deploying cloud computing.

Pruning is to set the parameter smaller than the threshold value to 0 in the training process. Because pruning removes parameters, the network parameters are a sparse matrix. In the training process, small parameters are continuously pruned, and in order to continuously increase the compression rate, the setting of a threshold value is also required to be continuously increased, so that the measurement of an algorithm based on the loss of accuracy and the increase of the compression rate is required, the parameters in a BN layer are used as evaluation indexes for judgment, and a non-important pruning channel is prepared. The total loss function in the algorithm for channel pruning is shown in formula (5):

L＝∑ _(x,y) (f(ω,x),y)+∝∑m(σ)(5)

wherein ω is a weight parameter; oc is a balance constraint parameter; sigma is a scaling factor. Sigma (sigma) _(x,y) (f (ω, x), y) is a training loss function of the network, Σm (σ) is a regular constraint term, and Σ is balanced to achieve sparsity.

After training under channel level regularization conditions, many scaling factors σ are close to zero model, the sparse scaling factors σ are ordered, and the channels closer to 0 are regarded as non-important channels, indicating that the channels can be removed. The volume of the network model is reduced by eliminating non-vital channels.

As an embodiment of the invention, the first model after training is compressed on the premise of not influencing the performance of the model and basically ensuring the target detection precision, compared with the original model, the parameter quantity of the model is reduced by 37.4%, the size of the whole model is compressed by approximately 52.6%, the volume of the model is greatly reduced, and the detection precision and speed are also improved to a certain extent.

Installing the compressed first model in mobile terminal equipment, and specifically operating as follows: firstly, acquiring video image information or real-time monitoring video by using a network camera; then, the collected video data are transmitted to a terminal for processing, and the model is utilized for carrying out video analysis to complete the detection of the position information of the target; based on the detected target position information, judging the target azimuth by using the camera, and positioning the target by rotating the camera, so that the target always processes the center position of the image acquired by the camera; and finally, sending an instruction to the terminal to judge whether the target is tracked or scanned. As an embodiment of the invention, on the premise of basically ensuring the target detection precision, the model parameter is reduced by 37.4%, the size of the whole model is compressed by approximately 52.6%, the detection speed is 67 frames/s, and the task of target detection in the mobile terminal can be well realized.

Compression also includes quantization and pruning, which, as described above for one of the terminal devices, uses the KL-divergence method to determine quantization parameters, and performs quantization operations to reduce the amount of parameters of the model. Pruning is to simplify the number of channels of the whole model, remove non-important channels and reduce the volume of the model; thereby improving the prediction speed and accuracy.

In a word, the invention provides a target detection method based on a Sv2-v3 model, which avoids complex module design through a main network structure and a parameter optimization strategy, can achieve SOTA effect only by simple design and heavy parameterization, solves the problem that the target detection precision and speed of the model are poor and cannot be well applied to the lightweight mobile terminal device, and improves the detection sensitivity of the detection model algorithm to improve the detection sensitivity of the device, so that the device is deployed in the mobile terminal device, not only can rapidly extract features, but also has larger improvement on the final model detection precision, and further from the real-time requirement.

For the foregoing method embodiments, for simplicity of explanation, the methodologies are shown as a series of acts, but one of ordinary skill in the art will appreciate that the present invention is not limited by the order of acts, as some steps may, in accordance with the present invention, occur in other orders or concurrently. Further, those skilled in the art will also appreciate that the embodiments described in the specification are all preferred embodiments, and that the acts and modules referred to are not necessarily required for the present invention.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, such as a division of units, merely a division of logic functions, and there may be additional divisions in actual implementation, such as multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some service interface, device or unit indirect coupling or communication connection, electrical or otherwise.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable memory. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in whole or in part in the form of a software product stored in a memory, comprising several instructions for causing a computer device (which may be a personal computer, a server or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present invention. And the aforementioned memory includes: a U-disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Those of ordinary skill in the art will appreciate that all or a portion of the steps in the various methods of the above embodiments may be performed by hardware associated with a program that is stored in a computer readable memory, which may include: flash disk, read-Only Memory (ROM), random-access Memory (Random Access Memory, RAM), magnetic or optical disk, and the like.

The foregoing is merely exemplary embodiments of the present disclosure and is not intended to limit the scope of the present disclosure. That is, equivalent changes and modifications are contemplated by the teachings of this disclosure, which fall within the scope of the present disclosure. Embodiments of the present disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the disclosure herein. This application is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a scope and spirit of the disclosure being indicated by the claims.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

It will be readily appreciated by those skilled in the art that the foregoing description is merely a preferred embodiment of the invention and is not intended to limit the invention, but any modifications, equivalents, improvements or alternatives falling within the spirit and principles of the invention are intended to be included within the scope of the invention.

Claims

1. A target detection method based on an Sv2-v3 model, the method comprising:

acquiring a video image dataset containing detection targets;

2. The method for detecting targets based on the Sv2-v3 model of claim 1, wherein the steps of extracting and fusing the multiple feature map information are as follows:

3. A method for detecting a target based on a Sv2-v3 model as defined in claim 1, wherein said first model is sample data training learning based on the Sv2-v3 model.

4. A method for detecting objects based on a Sv2-v3 model as defined in claim 3, wherein said sample data training learning utilizes a focus loss function of FL (P _t )＝-(1-P _t ) ^γ log(P _t )；

5. A method of object detection based on the Sv2-v3 model of claim 1, wherein said feature extracted backbone network is Shuffle Net v2.

6. A method of object detection based on the Sv2-v3 model as defined in claim 1, wherein said bounding box dataset is processed using a K-means++ algorithm.

7. Terminal device, characterized by comprising at least one processing unit, and at least one storage unit, wherein the storage unit stores a computer program comprising a second model with which the first model as in claims 1-6 is replaced when the computer program is executed by the processing unit, performing the object detection method as in claims 1-6;

8. A terminal device as in claim 7, wherein said compression comprises quantization and pruning.

9. A computer readable medium, characterized in that it stores a computer program executable by an access authentication device, the computer program comprising a second model, which, when the computer program is run on the access authentication device, performs the object detection method as claimed in claims 1 to 6, in place of the first model as claimed in claims 1 to 6;

10. The computer-readable medium of claim 9, wherein the compressing comprises quantization and pruning.