CN116704436A

CN116704436A - Target detection method and device, electronic equipment and storage medium

Info

Publication number: CN116704436A
Application number: CN202310672081.9A
Authority: CN
Inventors: 张楠; 王健宗; 瞿晓阳
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-05

Abstract

The embodiment of the application relates to the technical field of artificial intelligence, and provides a target detection method and device, electronic equipment and a storage medium, wherein a target boundary frame is generated by detecting a target in a video frame, then whether the target boundary frame exists in front and rear video frames is searched, if not, a first detection boundary frame is generated in the front and rear frames, the correlation degree between the first detection boundary frame and a real target boundary frame is calculated, and if the correlation degree indicates that the current video frame is a difficult case, the video frame is added into a difficult case set; the difficult case set is sent to a server to obtain model update data; and updating the model after the updated receipt is obtained, and detecting the video stream by using the updated model. According to the embodiment of the application, the identification capability of the target detection model on the target and the robustness of the model when the target is identified in the environment with similar objects can be improved.

Description

Target detection method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a target detection method and apparatus, an electronic device, and a storage medium, which may be applied to a financial scenario.

Background

With the development of artificial intelligence technology, the application of artificial intelligence technology in financial fields is becoming more and more popular, and the traditional financial industry is gradually changing to financial science and technology (Fintech). Target detection is an important technical application of artificial intelligence technology, and in the field of security, the target detection can help a monitoring system to discover abnormal conditions in time and provide effective early warning and alarming. The financial industry is also an important point of security work, including banks, postal service, certificates, etc. The safety precaution is an important daily work in the financial industry, and has extremely important significance for the normal operation of enterprises and obtaining good economic and social benefits.

When the target detection model is deployed on the lightweight edge equipment and the edge equipment realizes target detection, the edge equipment generally calculates based on pixel points on an image and assisted by a related algorithm, and when pixel groups between two objects are similar, the existing target detection technology can be used for misidentifying a false object similar to the target object with a certain probability. That is, the object detection model of the edge device may be affected by the similar object, and when the object similar to the object exists in the identified content, the object detection model is unstable and false detection occurs, which may bring unreliable factors to security monitoring in the financial industry.

Disclosure of Invention

The embodiment of the application mainly aims to provide a target detection method and device, electronic equipment and storage medium, aiming at improving the recognition capability of a model on a target and the robustness of the model when the target is recognized in an environment with similar objects, and further improving the reliability of security monitoring in the financial industry.

In order to achieve the above object, a first aspect of an embodiment of the present application provides an object detection method, which is applied to an edge device, and the method includes:

acquiring a video stream through a camera unit;

respectively carrying out target detection on each video frame in the video stream through a target detection model positioned at the edge equipment so as to generate a target boundary frame in each video frame;

traversing each video frame in the video stream and performing the following processing on the currently traversed video frame: if a real target boundary frame and a suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set;

The difficult case set is sent to a server, so that the server trains copies of the target detection model according to the difficult case set to obtain model update data;

receiving model update data sent by the server;

updating the target detection model according to the model updating data to obtain an updated target detection model;

and carrying out target detection on the video stream acquired by the camera unit through the updated target detection model.

In some possible embodiments of the present application, the determining that the real target bounding box and the suspected pseudo target bounding box exist in the current traversal video frame includes:

acquiring a target boundary box existing in the current traversal video frame;

for each acquired target bounding box, determining whether the target bounding box exists in adjacent video frames respectively;

if the target bounding box exists in the adjacent video frames, determining the target bounding box as the real target bounding box;

and if the target boundary box does not exist in the adjacent video frames, determining the target boundary box as the suspected pseudo-target boundary box.

In some possible embodiments of the present application, before calculating the degree of correlation from the first detection bounding box and the real target bounding box, the method further comprises:

Amplifying the first detection bounding box for each of the first M video frames and the last N video frames, generating a second detection bounding box in the video frames;

and calculating a normalized cross-correlation NCC value according to the second detection boundary box and the suspected pseudo-target boundary box, and taking the video frame as a reserved video frame if the NCC value is greater than or equal to a first preset threshold value.

In some possible embodiments of the present application, the calculating the correlation degree according to the first detection bounding box and the real target bounding box includes:

calculating an intersection ratio of the first detection boundary box and the real target boundary box in the reserved video frame aiming at the reserved video frame;

and determining the correlation degree of the first detection boundary box and the real target boundary box according to the intersection ratio.

In some possible embodiments of the present application, the generating a first detection bounding box matched with the suspected pseudo-target bounding box in the first M video frames and the last N video frames of the current traversal video frame includes:

for each video frame in the first M video frames and the last N video frames, searching a region matched with the suspected pseudo-target boundary frame in the video frames through a template matching algorithm;

And determining the first detection boundary box according to the searched area.

In some possible embodiments of the present application, the training the copy of the target detection model according to the difficult case set to obtain model update data includes:

labeling the video frames in the difficult case set by adopting a teacher model to obtain a labeling target boundary frame corresponding to the video frames;

inputting the video frames in the difficult cases into a copy of the target detection model to obtain a prediction target boundary box;

determining a loss value according to the labeling target boundary box and the prediction target boundary box corresponding to each video frame in the difficult case set;

and adjusting model parameters of the copy of the target detection model according to the loss value until a preset training ending condition is met.

In some possible embodiments of the present application, after receiving the model update data sent by the server, before updating the target detection model according to the model update data, the method further includes:

obtaining a target detection backup model;

performing model updating on the target detection backup model according to the model updating data to obtain an updated target detection backup model;

And carrying out target detection on the video stream acquired by the camera unit through the updated target detection backup model.

To achieve the above object, a second aspect of an embodiment of the present application provides an object detection apparatus, including:

the video stream acquisition module is used for acquiring a video stream through the camera unit;

the first target detection module is used for respectively carrying out target detection on each video frame in the video stream through a target detection model of the edge equipment so as to generate a target boundary frame in each video frame;

the difficult-case mining module is used for traversing each video frame in the video stream and executing the following processing on the currently traversed video frame: if a real target boundary frame and a suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set, wherein M is greater than or equal to 0, and N is greater than or equal to 0;

The sending module is used for sending the difficult case set to a server side so that the server side trains the copy of the target detection model according to the difficult case set to obtain model update data;

the receiving module is used for receiving the model update data sent by the server;

the updating module is used for updating the target detection model according to the model updating data to obtain an updated target detection model;

and the second target detection module is used for carrying out target detection on the video stream acquired by the camera unit through the updated target detection model.

To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, including a memory storing a computer program and a processor implementing the method according to the first aspect when the processor executes the computer program.

To achieve the above object, a fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the method of the first aspect.

The application provides a target detection method and device, electronic equipment and storage medium, which are characterized in that a target boundary frame is generated by detecting a target in video frames of an acquired video stream, then whether the target boundary frame exists in front and rear video frames is searched, if not, a first detection boundary frame is generated in the front and rear frames, the correlation degree between the first detection boundary frame and a real target boundary frame is calculated, and if the correlation degree indicates that the current video frame is a difficult case, the current video frame is added into a difficult case set; the difficult case set is sent to a server to obtain model update data sent by the server; and updating the model after the updated receipt is obtained, and detecting the video stream by using the updated model. The detection error existing in the current model is obtained by detecting the unstable target boundary box, the error is sent to the server side retraining model to obtain updated data, the current model is updated by the updated data, the recognition capability of the model on the target and the robustness of the model when the target is recognized in the environment with similar objects are improved, and the reliability of security monitoring in the financial industry is further improved.

Drawings

FIG. 1 is a schematic diagram of steps of a target detection method according to an embodiment of the present application;

FIG. 2 is a schematic diagram of the substeps of step S103 in FIG. 1;

FIG. 3 is a schematic diagram of a detection process according to an embodiment of the present application

FIG. 4 is a schematic diagram of another substep of step S103 in FIG. 1;

FIG. 5 is a schematic diagram of another substep of step S103 in FIG. 1;

FIG. 6 is a schematic diagram of another substep of step S103 in FIG. 1;

FIG. 7 is a schematic diagram of the substep of step S106 in FIG. 1;

FIG. 8 is a schematic diagram of a structure of an object detection device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.

It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

First, several nouns involved in the present application are parsed:

artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.

Normalized cross-correlation (Normalized cross correlation, NCC): the result of the cross-correlation computation method, which is more common in template matching, is used to describe the correlation between two co-dimensional vectors, windows or samples. The range of values is-1 to-1, wherein-1 represents that the two vectors are uncorrelated, 1 represents that the two vectors are correlated, and the larger the NCC value is, the more similar the two vectors are, and the less similar the inverse is.

Cross-ratio algorithm (intersection over union, IOU): is a criterion for measuring the accuracy of detecting the corresponding object in a specific dataset, is a concept used in target detection, and is the overlap ratio of the generated candidate boxes (candidate bound) and the original marker boxes (ground truth bound), i.e. the ratio of their intersection to union. This criterion is used to measure the correlation between true and predicted, the higher the correlation, the higher the value, and most desirably the overlap, i.e. the ratio is 1.

Template matching (Template Matching): advanced computer vision techniques can identify portions of an image that match a predefined template. It is a process of moving the template over the entire image and calculating the similarity between the template and the covered window on the image. Template matching is achieved by two-dimensional convolution. In convolution, the values of the output pixels are summed by multiplying the elements of the two matrices. One matrix represents the image itself and the other matrix is a template, a convolution kernel. Its principle is simple by providing a basic template for comparison, and finding similar templates in the source image. The matching process is to sequentially compare the source images in the data set with the templates in the database through indexes, store the comparison calculation results of the templates and the images in a matrix in the matching process, and give the estimated optimal positions of the templates in the data set through a scoring algorithm.

Based on the above, the embodiment of the application provides a target detection method and device, electronic equipment and storage medium, aiming at improving the recognition capability of a model on a target and the robustness of the model when the target is recognized in an environment with similar objects.

The embodiment of the application provides a target detection method and device, electronic equipment and storage medium, and specifically describes the following embodiment, and first describes a target detection method in the embodiment of the application.

The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

The embodiment of the application provides a target detection method, and relates to the technical field of artificial intelligence. The target detection method provided by the embodiment of the application can be applied to a scene comprising edge equipment and a server side. The edge device, namely edge computing device, is hardware for pushing edge computing to be applied in different industries. They are close to the data source, enable real-time data processing at very high speeds, and serve to accomplish different tasks, depending on the software application or function for which they are configured. The edge computing device can be a terminal such as a smart phone, an unmanned plane, a tablet computer, a notebook computer, a desktop computer and the like, can be an edge server deployed near a data generation place, and can also be a special edge computing device. The server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like.

The application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.

In the embodiments of the present application, when related processing is performed according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of the data comply with related laws and regulations and standards of related countries and regions. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through popup or jump to a confirmation page and the like, and after the independent permission or independent consent of the user is definitely acquired, the necessary relevant data of the user for enabling the embodiment of the application to normally operate is acquired.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating steps of a target detection method according to an embodiment of the present application, where an execution body of the target detection method is an edge device, including but not limited to steps S101 to S107.

In step S101, a video stream is acquired by an image capturing unit.

It should be understood that the form of the camera unit here may be varied, for example the edge device is a terminal with a camera unit; the camera shooting unit is arranged in the edge equipment and integrated with the edge equipment; for example, the image capturing unit is disposed in an external device, and the external device is in communication connection with the edge device, so that a person skilled in the art can select an appropriate image capturing unit according to actual situations to meet requirements, which is not limited in the present application.

It should be understood that the acquisition modes herein are various, and correspond to the arrangement modes of the imaging units, and are not applied for this.

In some possible embodiments of the present application, the edge device obtains a complete video stream through a camera carried by itself, and uploads the video stream to the server.

In step S102, object detection is performed on each video frame in the video stream by using the object detection model located in the edge device, so as to generate an object bounding box in each video frame.

It should be understood that, here, the suspected false object refers to an object having high similarity with the pixel value of the real object, for example, in a frame where a face and a hand are present at the same time, the face in the frame is the real object, the pixel value similarity is high because the skin color is relatively similar, and the hand is mistaken for the face, where the hand is the suspected false object before the analysis.

It should be understood that the target bounding boxes herein are recorded in coordinates, and that the visualization forms of the target bounding boxes herein are various, such as, for example, the detection of different objects, the visualization of all target bounding boxes; those skilled in the art may select a suitable visualization form of the target bounding box according to the actual detection needs and model training needs, as the application is not limited in this regard.

It should be understood that the object bounding box visualization types herein are various, and exemplary, such as, for example, a real object bounding box and a suspected pseudo object bounding box are visualized using bounding boxes of different colors, which the present application is not limited to.

It should be understood that the number of each type of target bounding box is various, and may be one or more, and the number of each target bounding box acquired in the current video frame may be set by the skilled person according to the actual situation, which is not limited by the present application.

Step S103, traversing each video frame in the video stream, and performing the following processing on the currently traversed video frame: if the real target boundary frame and the suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set.

It should be understood that M is various here, if the video frame currently traversed is the first frame, where M is equal to 0; if the currently traversed video frame is not the first frame, M is a preset value greater than 1, which can be set by those skilled in the art according to actual needs, which is not limited by the present application.

It should be appreciated that N is varied here, if the video frame currently traversed is the last frame, where N is equal to 0; if the currently traversed video frame is not the last frame, N is a preset value greater than 1, which can be set by those skilled in the art according to actual needs, which is not limited by the present application.

It should be understood that the algorithm for calculating the correlation degree is various, and exemplary, such as a structural similarity algorithm (Structure Similarity Index Measure, SSIM), and further, such as an cross-correlation algorithm, those skilled in the art may choose to calculate the correlation degree for a suitable algorithm according to actual needs, which is not limited by the present application.

It should be understood that determining whether the currently traversed video frame belongs to according to the correlation degree herein refers to comparing the correlation degree with a preset threshold value to determine.

Step S104, the difficult case set is sent to the server side, so that the server side trains copies of the target detection model according to the difficult case set, and model update data are obtained.

It should be understood that, the number of video frames in the difficult case set is a preset value, when the number of video frames reaches the preset value, the difficult case set is sent to the server, and the preset value can be set by those skilled in the art according to the actual situation, which is not limited by the present application.

It should be understood that the copy herein refers to a copy of the target detection model that has been currently subject to target detection, rather than the original model without any weight parameters adjusted.

It should be understood that the model update data herein refers to updated values of some parameters in the model for the current object detection model.

It should be understood that the difficult case set sent to the server side includes, in addition to the difficult case video frames, one video frame of the corresponding preceding and following frames in the difficult case. The manner of acquiring the video frame from the corresponding front and rear frames is various and exemplary, such as random extraction; and if the video frames are extracted according to the relevance degree sequence, the person skilled in the art can acquire the video frames according to actual needs, and the application is not limited to this.

Step S105, receiving the model update data sent by the server.

And step S106, updating the target detection model according to the model updating data to obtain an updated target detection model.

In some embodiments, before updating the object detection model according to the model update data, the method further includes: obtaining a target detection backup model; performing model updating on the target detection backup model according to the model updating data to obtain an updated target detection backup model; and carrying out target detection on the video stream acquired by the camera unit through the updated target detection backup model.

It will be appreciated that the edge device has two identical object detection models, one for the service model and one for the object detection backup model, which are used to continue providing the object detection service when updating the data. When updating, the service model continues to serve, the target detection backup model updates the model according to the updated data, after updating, the updated target detection backup model is the updated target detection model, the updated target detection backup model is used as a new service model to be put into service, the old service model is stopped from being put into service, and the old service model is updated according to the updated data to be used as a new target detection backup model.

In step S107, the video stream acquired by the image capturing unit is subjected to target detection by the updated target detection model.

It should be understood that the updated object detection model herein refers to an object detection model after model update based on the standby model and the update data, and not an object detection model after model update based on the service model and the update data.

It should be understood that the video streams herein are various, and may be the video streams in step S101, detected using the updated object detection model, or may be different from step S101, in which a new video stream different from an old video stream is acquired using the image capturing unit, and those skilled in the art may use the updated object detection model to perform object detection on a specific video stream according to the actual situation, which is not limited in the present application.

Step S101 to step S107 shown in the embodiment of the present application, a target bounding box is generated by detecting a target in a video frame of an acquired video stream, then whether the target bounding box exists in a front video frame and a rear video frame is searched, if not, a first detection bounding box is generated in the front frame and the rear frame, and the correlation degree between the first detection bounding box and a real target bounding box is calculated, if the correlation degree indicates that the current video frame is a difficult case, the current video frame is added into a difficult case set; the difficult case set is sent to a server to obtain model update data sent by the server; and updating the model after the updated receipt is obtained, and detecting the video stream by using the updated model. And detecting an unstable target boundary box to obtain a detection error in the current model, transmitting the error to the server retraining model to obtain updated data, and updating the current model by using the updated data, so that the recognition capability of the model on the target and the robustness of the model when the target is recognized in an environment with similar objects are improved.

Referring to fig. 2, fig. 2 is a schematic diagram illustrating sub-steps of step S103 in fig. 1. In some possible embodiments of the present application, step S103 includes, but is not limited to, the following substeps.

Step S201, a target bounding box existing in the current traversal video frame is acquired.

It should be appreciated that the object bounding boxes obtained here are not classified into true object bounding boxes and suspected false object bounding boxes.

It should be appreciated that the number of target bounding boxes here is varied, i.e. the acquisition of the target bounding boxes present in the current video frame is all target bounding boxes.

In step S202, for each acquired target bounding box, it is determined whether the target bounding box exists in an adjacent video frame, respectively.

It should be understood that adjacent video frames herein refer to the first M video frames and the last N video frames of the current video frame.

It should be understood that the discrimination as to whether or not the use exists in adjacent video frames, respectively, is based on instability of a false target, since the false target is not recognized as the true target when recognized, the true target can be recognized at every frame occurrence, but the false target is recognized with probability, and an object as the false target is recognized as a target at the current frame, but not recognized at the preceding frame or the following frame, based on which it can be judged that the current target is a target.

In step S203, if the target bounding box exists in the adjacent video frames, the target bounding box is determined as a real target bounding box.

Specifically, when a target bounding box exists in an adjacent video frame, the targets in the target bounding box appear stably in the adjacent video frame, and are regarded as real targets, and the target bounding box that frames the real target is determined as a real target bounding box.

In step S204, if the target bounding box does not exist in the adjacent video frames, the target bounding box is determined as a suspected pseudo-target bounding box.

It should be understood that the absence in an adjacent video frame herein refers to the absence of a target bounding box, and not to the absence of an object in the target bounding box that is present but not detected as a target in the adjacent video frame.

Referring to fig. 3, fig. 3 is a schematic diagram illustrating a detection process according to an embodiment of the application.

In fig. 3, there are three frames, the middle frame is the video frame currently traversed, the left frame is the front frame, the right frame is the rear frame, in this example, the hand is erroneously detected, and the head is the target to be detected. In step S201, the identified target bounding boxes, i.e., the bounding box of the head and the bounding box of the hand, are acquired. In the pictures of the front and rear frames, the hands and the heads appear, the boundary boxes for detecting the heads appear in the front and rear frames, but the boundary boxes for detecting the hands do not appear. Thus, the head bounding box is determined as a true target bounding box and the hand bounding box is determined as a suspected target bounding box.

According to the embodiment of the application, the type of the target boundary frame is determined by whether the target boundary frame exists in the adjacent video frames, so that the mining efficiency of difficult cases is improved, the sample marking time is reduced, and the training efficiency of the model is improved.

Referring to fig. 4, fig. 4 is a schematic diagram illustrating another sub-step of step S103 in fig. 1. In some possible embodiments of the present application, after step S204, generating the first detection bounding box in step S103 includes, but is not limited to, the following substeps.

Step S301, for each of the first M video frames and the last N video frames, searching a region matching with the suspected pseudo-target bounding box in the video frames through a template matching algorithm.

It should be understood that the template used in the template matching algorithm is the target in the suspected pseudo target bounding box in step S204 as the template.

It should be understood that searching for a region matching the boundary box of the suspected pseudo-object herein refers to searching for whether each of the first M video frames and the last N video frames has a target that is the same as or similar to the template according to the template.

Step S302, a first detection boundary box is determined according to the searched area.

It should be understood that the generation manner here is the same as that of the target bounding box in step S102.

It should be appreciated that the number of areas found here is varied, corresponding to the occurrence in the video frame, and the number of first detection bounding boxes here corresponds to the number of areas found.

The template matching algorithm is used for searching the matched areas in the front video frame and the rear video frame, and the first detection boundary box is determined according to the matched areas, so that the efficiency of searching the suspected pseudo-target is improved, and the consumption of computing resources of the edge equipment for searching the suspected pseudo-target is reduced.

Referring to fig. 5, fig. 5 is a schematic diagram illustrating another sub-step of step S103 in fig. 1. In some possible embodiments of the present application, step S103 further includes, but is not limited to, the following steps after step S302 and before calculating the degree of correlation.

Step S401, for each of the first M video frames and the last N video frames, enlarges the first detection bounding box, and generates a second detection bounding box in the video frames.

It should be appreciated that the manner in which the bounding box is enlarged is varied, and exemplary, e.g., the bounding box is first detected according to a preset area method; and if the two sides of the first detection boundary frame are amplified according to a preset proportion, a person skilled in the art can select a proper amplifying mode to amplify the first detection boundary frame according to actual needs, and the application is not limited to this.

It should be understood that the number of bounding boxes for enlarging and generating the first detection bounding box corresponds to the number in step S302, and that all the first detection bounding boxes in step S302 are enlarged and generated respectively.

The following description will proceed with reference to FIG. 3

Taking a later frame as an example, determining a first detection boundary frame according to a template matching algorithm, performing a method based on the first detection boundary frame, and expanding the method into a second detection boundary frame. The preceding frames are the same and are not described in detail here

In some possible embodiments of the application, the boundary of the bounding box is enlarged by 100 pixels according to the first detection, the new boundary being separated from the old boundary by a distance of 100 pixels.

Step S402, calculating normalized cross correlation NCC value according to the second detection boundary frame and the suspected pseudo-target boundary frame, and taking the video frame as the reserved video frame if the NCC value is larger than or equal to a first preset threshold value.

It should be understood that the values of the first preset threshold value herein are various, and those skilled in the art can set specific values according to actual needs, which is not limited by the present application.

It should be understood that, the first preset threshold is used to indicate that the second detection bounding box is identical to the suspected pseudo-target bounding box, and the calculated NCC value is used to verify whether the result of the template matching in step S301 is reliable, and if the NCC value is greater than or equal to the first preset threshold, it indicates that the target in the second detection bounding box and the target of the suspected pseudo-target bounding box are identical, that is, are both suspected pseudo-targets.

And calculating the similarity of the second detection boundary frame and the suspected pseudo-target boundary frame by using a normalized cross correlation algorithm, screening useful frames in the front frame and the rear frame for analysis, and reducing the influence of errors caused by template matching due to great changes among frames on subsequent difficult cases.

Referring to fig. 6, fig. 6 is a schematic diagram illustrating another sub-step of step S103 in fig. 1. In some possible embodiments of the present application, after step S402, calculating the degree of correlation in step S103 includes, but is not limited to, the following steps.

Step S501, for the reserved video frame, calculates the intersection ratio of the first detection bounding box and the real target bounding box in the reserved video frame.

It should be understood that calculating the intersection ratio herein refers to calculating the intersection ratio of all the first detection bounding boxes and the real target bounding boxes in the same video frame, each of which calculates the intersection ratio once with the real target bounding box.

Step S502, determining the correlation degree of the first detection boundary frame and the real target boundary frame according to the intersection ratio.

It should be understood that the determination degree is determined according to a preset threshold, when the intersection ratio is smaller than the threshold, the target in the first detection boundary box is not the same as the real target boundary box, and the suspected false target in the first detection boundary box is determined as the false target; and when the intersection ratio is greater than the threshold value, the target in the first detection boundary box and the real target boundary box are the same target, and the suspected pseudo target in the first detection boundary box is determined to be the real target.

And determining the target correlation degree between the first detection boundary frame and the real target boundary frame by calculating the intersection ratio, so as to determine the difficult cases, improve the efficiency of acquiring the difficult cases and the quality of the acquired difficult cases, and reduce systematic errors of model identification caused by the target phase in the difficult cases during subsequent model training.

Referring to fig. 7, fig. 7 is a schematic diagram illustrating the substeps of step S106 in fig. 1. In some possible embodiments of the present application, step S106 includes, but is not limited to, the following substeps.

And step S601, labeling the video frames in the difficult case set by adopting a teacher model to obtain a labeling target boundary box corresponding to the video frames.

It should be understood that, the teacher model is different from the target detection model, and is only used for labeling video frames in the difficult case set, and labeling target bounding boxes are added in each video frame.

It should be understood that the teacher model is of various kinds, such as the deep v3 model; and then, the model is shown as Xreception 65; those skilled in the art may select an appropriate model as a teacher model according to actual situations, which is not limited by the present application.

Step S602, inputting the video frames in the difficult case set into a copy of the target detection model to obtain a prediction target boundary box.

It should be understood that the predicted target bounding box for each video frame herein is the same as the target bounding box correspondingly generated in step S102.

And step S603, determining a loss value according to the labeling target boundary box and the prediction target boundary box corresponding to each video frame in the difficult case set.

It should be understood that, the loss value herein refers to a loss value obtained by integrating the loss values corresponding to each video frame, that is, the loss value is determined by calculating a labeling target bounding box and a prediction target bounding box according to a preset loss function, calculating the calculated loss value, and determining the final loss value.

It should be understood that the manner of determining the final loss value herein is various, and exemplary, the final loss value is determined by an average value of the loss values, and a person skilled in the art may select an appropriate loss value determining method according to the actual situation, which is not limited by the present application.

Step S604, adjusting model parameters of the copy of the target detection model according to the loss value until a preset training ending condition is met.

It should be understood that the model parameters herein are various, and may be coefficients in the loss function, or weight parameters in each network layer of the model, which is not limited by the present application.

It should be understood that the preset training ending conditions herein are various, and those skilled in the art may set specific training ending conditions according to actual situations, which the present application is not limited to.

The scheme of the embodiment of the application can be applied to security scenes in the financial industry. The method comprises the steps of obtaining a video stream shot by a bank monitoring camera unit, and respectively carrying out target detection on each video frame in the video stream through a target detection model of the edge equipment so as to generate a target boundary frame in each video frame; traversing each video frame in the video stream and performing the following processing on the currently traversed video frame: if a real target boundary frame and a suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set, wherein M is greater than or equal to 0, and N is greater than or equal to 0; the difficult case set is sent to a server, so that the server trains copies of the target detection model according to the difficult case set to obtain model update data; receiving model update data sent by the server; updating the target detection model on the edge equipment according to the model updating data to obtain an updated target detection model; and the edge equipment performs target detection on the video stream acquired by the bank monitoring camera unit through the updated target detection model so as to perform security monitoring on the bank environment.

The method comprises the steps of generating a target boundary frame by detecting a target in a video frame of an acquired video stream, searching whether the target boundary frame exists in a front video frame and a rear video frame, generating a first detection boundary frame in the front frame and the rear frame if the target boundary frame does not exist, calculating the correlation degree between the first detection boundary frame and a real target boundary frame, and adding the current video frame into a difficult case set if the correlation degree indicates that the current video frame is a difficult case; the difficult case set is sent to a server to obtain model update data sent by the server; and updating the model after the updated receipt is obtained, and detecting the video stream by using the updated model. The detection error existing in the current model is obtained by detecting the unstable target boundary box, the error is sent to the server side retraining model to obtain updated data, the current model is updated by the updated data, the recognition capability of the model on the target and the robustness of the model when the target is recognized in the environment with similar objects are improved, and the reliability of security monitoring in the financial industry is further improved.

Referring to fig. 8, fig. 8 is a schematic structural diagram of an object detection device according to an embodiment of the present application, which may implement the above-mentioned object detection method, the device 700 includes:

A video stream obtaining module 701, configured to obtain a video stream through a camera unit;

a first object detection module 702, configured to perform object detection on each video frame in the video stream through an object detection model located at an edge device, so as to generate an object bounding box in each video frame;

the refractory mining module 703 is configured to traverse each video frame in the video stream, and perform the following processing on the currently traversed video frame: if a real target boundary frame and a suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set, wherein M is greater than or equal to 0, and N is greater than or equal to 0;

the sending module 704 is configured to send the difficult case set to a server, so that the server trains a copy of the target detection model according to the difficult case set to obtain model update data;

a receiving module 705, configured to receive model update data sent by the server;

An updating module 706, configured to update the target detection model according to the model update data, to obtain an updated target detection model;

and the second target detection module 707 is configured to perform target detection on the video stream acquired by the image capturing unit through the updated target detection model.

The specific embodiment of the object detection device is substantially the same as the specific embodiment of the object detection method described above, and will not be described herein.

The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the target detection method when executing the computer program. The electronic equipment can be any intelligent edge equipment such as a tablet personal computer, a vehicle-mounted computer and the like.

Referring to fig. 9, fig. 9 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and the electronic device 800 includes:

the processor 801 may be implemented by a general-purpose CPU (central processing unit), a microprocessor, an application-specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the embodiments of the present application;

The memory 802 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 802 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 802, and the processor 801 invokes an object detection method for executing an embodiment of the present disclosure;

an input/output interface 803 for implementing information input and output;

the communication interface 804 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.);

a bus 805 that transfers information between the various components of the device (e.g., the processor 801, the memory 802, the input/output interface 803, and the communication interface 804);

wherein the processor 801, the memory 802, the input/output interface 803, and the communication interface 804 implement communication connection between each other inside the device through a bus 805.

The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the target detection method when being executed by a processor.

The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The embodiment of the application provides a target detection method, a target detection device, electronic equipment and a storage medium, which are used for generating a target boundary frame by detecting a target in a video frame of an acquired video stream, then searching whether the target boundary frame exists in front and rear video frames, if not, generating a first detection boundary frame in the front and rear frames, calculating the correlation degree between the first detection boundary frame and a real target boundary frame, and if the correlation degree indicates that a current video frame is a difficult case, adding the current video frame into a difficult case set; the difficult case set is sent to a server to obtain model update data sent by the server; and updating the model after the updated receipt is obtained, and detecting the video stream by using the updated model. And detecting an unstable target boundary box to obtain a detection error in the current model, transmitting the error to the server retraining model to obtain updated data, and updating the current model by using the updated data, so that the recognition capability of the model on the target and the robustness of the model when the target is recognized in an environment with similar objects are improved. .

The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.

It will be appreciated by persons skilled in the art that the embodiments of the application are not limited by the illustrations, and that more or fewer steps than those shown may be included, or certain steps may be combined, or different steps may be included.

The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.

The terms "first," "second," "third," "fourth," and the like in the description of the application and in the above figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

It should be understood that in the present application, "at least one (item)" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.

In the several embodiments provided by the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the method of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.

The preferred embodiments of the present application have been described above with reference to the accompanying drawings, and are not thereby limiting the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims

1. A method of object detection, characterized in that it is applied to an edge device, said method comprising the steps of:

acquiring a video stream through a camera unit;

traversing each video frame in the video stream and performing the following processing on the currently traversed video frame: if a real target boundary frame and a suspected pseudo target boundary frame exist in the current traversal video frame, respectively generating first detection boundary frames matched with the suspected pseudo target boundary frames in the first M video frames and the last N video frames of the current traversal video frame, calculating the correlation degree according to the first detection boundary frames and the real target boundary frames, and if the current traversal video frame is determined to belong to a difficult case according to the correlation degree, adding the current traversal video frame into the difficult case set, wherein M is greater than or equal to 0, and N is greater than or equal to 0;

Receiving model update data sent by the server;

2. The method according to claim 1, wherein the determining that the real object bounding box and the suspected pseudo object bounding box exist in the current traversal video frame includes:

acquiring a target boundary box existing in the current traversal video frame;

3. The object detection method according to claim 1, wherein before calculating the degree of correlation from the first detection bounding box and the real object bounding box, the method further comprises:

4. The object detection method according to claim 3, wherein said calculating a degree of correlation from the first detection bounding box and the real object bounding box includes:

5. The method according to claim 1, wherein generating a first detection bounding box matching the suspected pseudo-target bounding box in the first M video frames and the last N video frames of the current traversal video frame, respectively, comprises:

6. The method of claim 1, wherein training copies of the object detection model based on the set of difficult cases to obtain model update data comprises:

7. The object detection method according to claim 1, wherein before updating the object detection model based on the model update data, the method further comprises:

obtaining a target detection backup model;

8. An object detection device, the device comprising:

9. An electronic device comprising a memory storing a computer program or instructions and a processor that when executed implements the method of any of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it has stored thereon a computer program or instructions which, when executed by a processor, implement the method according to any of claims 1 to 7.