CN111126112A

CN111126112A - Candidate region determination method and device

Info

Publication number: CN111126112A
Application number: CN201811292150.9A
Authority: CN
Inventors: 虢齐; 张玉双; 楚梦蝶; 冯昊楠; 袁益琴
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2020-05-08
Anticipated expiration: 2038-10-31
Also published as: CN111126112B

Abstract

The application discloses a candidate region determination method and a candidate region determination device. The method comprises the following steps: acquiring a monitoring video frame; inputting a monitoring video frame into a pre-established target detection model, and outputting a category identifier and a coordinate value of a candidate area, wherein the target detection model is formed by training based on a RefineDet network structure, the category identifier is used for indicating whether the candidate area contains a target object, and the target object comprises a sorting behavior main body which is in contact with a sorted object; and if the category identification indicates that the candidate area contains the target object, determining the monitoring video frame as a starting frame for analyzing the violent sorting behavior. According to the technical scheme of the embodiment of the application, the quality of the candidate area can be effectively improved through the RefineDet network structure, so that the accuracy of the violent sorting behavior identification stage is improved.

Description

Candidate region determination method and device

Technical Field

The present application relates generally to the field of computer vision, and more particularly, to a candidate region determination method and apparatus.

Background

Sorting is the operation of stacking the articles in different categories according to the sequence of the categories and the sequence of entering and leaving the warehouse. The timeliness of sorting operations directly affects the development prospects of different service providers.

Currently, in a sorting scenario, the quality of service of a service provider is severely affected by the occurrence of violent sorting activities. In order to better improve the service quality, a video-based target detection and analysis method is provided for violent sorting behaviors. The violent sorting behavior detection can be divided into a candidate region generation part and a violent behavior identification part, the quality of the candidate region generation directly influences the violent behavior identification effect, the spatial position of the candidate region is generally determined in a single-frame video, and then sliding window processing is carried out on the time sequence to obtain the time position of the candidate region.

In the detection of violent sorting behaviors, due to the complexity of a video scene, the number of candidate areas is in direct proportion to the number of pedestrians, the false detection rate of a violent behavior identification stage is increased, and the identification time is too long.

Disclosure of Invention

In view of the above-mentioned drawbacks and deficiencies of the prior art, it is desirable to provide a method, an apparatus, and a storage medium for determining a candidate area of violent sorting behavior, so as to reduce the time consumption for recognition of the violent sorting behavior and improve the recognition accuracy.

In a first aspect, an embodiment of the present application provides a method for determining a candidate area of a violent sorting behavior, where the method includes:

acquiring a monitoring video frame;

inputting a monitoring video frame into a pre-established target detection model, and outputting a category identifier and a coordinate value of a candidate area, wherein the target detection model is formed by training based on a RefineDet network structure, the category identifier is used for indicating whether the candidate area contains a target object, and the target object comprises a sorting behavior main body which is in contact with a sorted object;

and if the category identification indicates that the candidate area contains the target object, determining the monitoring video frame as a starting frame for analyzing the violent sorting behavior.

In a second aspect, an embodiment of the present application provides a forced sorting behavior candidate region determination apparatus, which is characterized in that the apparatus includes:

the video frame acquisition module is used for acquiring monitoring video frames;

the target detection module is used for inputting the monitoring video frame into a pre-established target detection model and outputting a category identifier and a coordinate value of a candidate area, the target detection model is formed by training based on a RefineDet network structure, the category identifier is used for indicating whether the candidate area contains a target object, and the target object comprises a sorting behavior main body which is in contact with a sorted object;

and the determining module is used for determining the monitoring video frame as a starting frame for analyzing violent sorting behaviors if the category identification indicates that the candidate area contains the target object.

According to the technical scheme of the candidate region of the violent sorting behavior, the monitoring video frames received in real time are identified through the pre-established target detection model, the effective candidate region which can be used for the violent sorting behavior identification is obtained, and the quality of the candidate region is improved. In the process of training the target detection model, the video frames containing the target objects are marked, and the candidate areas only containing the target objects are reserved, so that the number of the candidate areas is effectively reduced, and the speed of violent sorting behavior identification is increased.

Further, by determining an end frame for analyzing violent sorting behavior, or when a target object is not identified, continuing to acquire a new monitoring video frame, video image processing efficiency can be improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flow chart illustrating a method for determining a candidate area of violent sorting behavior according to an embodiment of the present application;

FIG. 2 is a schematic flow chart diagram illustrating a method for building an object detection model according to another embodiment of the present application;

fig. 3 is a block diagram illustrating an exemplary structure of a forced sorting behavior candidate region determination apparatus 300 according to an embodiment of the present application;

FIG. 4 is a block diagram illustrating an exemplary architecture of an apparatus 400 for modeling object detection provided in accordance with another embodiment of the present application;

FIG. 5 illustrates a schematic diagram of a computer system suitable for use in implementing a server according to embodiments of the present application.

Detailed Description

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the present invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict. The present application will be described in detail below with reference to the embodiments with reference to the attached drawings.

Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a method for determining a candidate area of a violent sorting behavior according to an embodiment of the present application.

As shown in fig. 1, the method includes:

step 110, acquiring a surveillance video frame.

In the embodiment of the application, after the monitoring video data is acquired from the video storage server or the device equivalent to the video storage server, or the video acquisition device, the video data can be converted into a video sequence, and the video sequence is a frame-by-frame image, namely a monitoring video frame.

The monitoring video data is obtained by the video acquisition device through real-time acquisition, and is related to the sorting scene. The surveillance video frames may contain, among other things, objects, pedestrians, vehicles, etc.

And 120, inputting the monitoring video frame into a pre-established target detection model, and outputting the category identification and the coordinate value of the candidate area.

In the embodiment of the application, the pedestrian detection is carried out on the monitoring video frames frame by frame so as to identify the candidate region. Violent sorting behaviors are identified through analysis of the candidate areas, and the violent sorting behaviors mean that the time from contact to separation of a sorting behavior main body and a sorted object is extremely short. For example a time interval of only a few seconds or even 1 second. In order to capture such a short-time motion change in the video image data, it is necessary to accurately recognize the starting point of the contact of the sorting action body with the object to be sorted. According to the embodiment of the application, the initial position of the violent sorting behavior can be accurately identified through detecting and analyzing the video image data, so that the violent sorting behavior can be more accurately identified, and the efficiency of the violent sorting behavior identification stage is improved.

In the embodiment of the application, the acquired monitoring video frame is input into a pre-established target detection model for target detection, and the category identification and the coordinate value of the candidate area are obtained. Wherein the category identification is used for indicating whether the candidate area contains a target object, and the target object comprises a sorting behavior body which is in contact with the sorted object.

It is contemplated that the subject of the sort activity will typically be a sorting operator and the recipient will typically be an object to be sorted, such as an item or package or a courier, for example. The embodiment of the application provides that video frames are analyzed based on a target detection model, and the video frames only containing sorting personnel or sorted objects are excluded, so that the number of candidate areas is reduced, and the speed of a violent sorting behavior identification stage is increased.

The target detection model is trained on the basis of a RefineDet network structure. The RefineDet network architecture is a new single detector with higher accuracy than the two-stage approach and maintains the high efficiency of the single-stage approach. The RefineDet network structure comprises a positioning refinement module, a target detection module and a transfer connection block. The positioning refinement module aims to filter out improper anchor frames to reduce classified search space, and the transfer connecting block carries out conversion operation on the characteristic diagram output by the positioning refinement module and then inputs the characteristic diagram into the target detection module. The features of different sides are fused by a target detection module, and then multi-layer classification and regression are performed. The quality of the candidate area can be effectively improved through the RefineDet network structure, and therefore the accuracy of the violent sorting behavior identification stage is improved.

The category identification is a flag for indicating the attribute category of the candidate region. For example, may be a number or letter, or a combination of both, that primarily indicates the nature of the candidate region with the identity, e.g., indicates that the subject of the sorting activity is in contact or not in contact with the object being sorted. For example, the identification yes on the candidate area indicates that the body of the sorting behavior in the candidate area has contact with the object to be sorted, and the identification no indicates that the body of the sorting behavior in the candidate area has no contact with the object to be sorted.

Further, the class identification may also include a confidence of the candidate region, where the confidence may be identified by a number, and the confidence represents the discriminative power of the model for the candidate region. For example, label yes on the candidate area: and 0.85, representing the confidence probability of the situation that the candidate region belongs to the contact between the sorting behavior main body and the sorted object. Mark no on the candidate area: and 0.95, representing the confidence probability that the candidate area belongs to the condition that the main body of the sorting action and the sorted object do not contact. If the category identification indicates that the candidate area contains the target object, the surveillance video frame is determined to be a starting frame for analyzing the violent sorting behavior, step 130.

In the embodiment of the present application, if the category identifier indicates that the candidate region includes the target object, that is, the main body of the sorting behavior contacts with the object to be sorted (or called a recipient), the surveillance video frame is determined as the starting frame for analyzing the violent sorting behavior.

Further, after determining that the surveillance video frame is a starting frame for analyzing violent sorting behavior, the method further comprises:

an end frame for analyzing violent sorting behavior is obtained. The end frame may be, for example, a video frame of a predetermined time interval sequentially extracted through a time sliding window as the end frame. For example, a fixed number of frames can be extracted from the beginning frame and back, i.e., consecutive frames are obtained from the surveillance video as objects for violent sorting activities. The fixed number of frames may be, for example, 15 frames. Or extracting the video frame in a fixed time range, and taking the last frame as an end frame or a tail frame.

The extracted candidate regions are used for identifying violent sorting behaviors, namely, the candidate regions required by the violent sorting behaviors are determined.

Further, the embodiment of the present application may further include: and if the category identification indicates that the candidate area does not contain the target object, continuously acquiring a new monitoring video frame.

And if the category identification indicates that the candidate region does not comprise the target object, performing target detection on the next video frame in the monitoring video frame sequence until the monitoring video frame is determined to be the initial frame, and starting a time sliding window to sequentially acquire the video frames with the fixed frame number as the identification object in the violent sorting behavior identification stage.

According to the embodiment of the application, the monitoring video frames are detected frame by frame through the RefineDet network structure, the abnormal initial state of the sorting behavior can be effectively identified, the candidate area of the violent sorting behavior can be determined through extracting the video frames in the preset range, the violent sorting behavior can be accurately and efficiently identified through the identification of the candidate area, the accuracy rate of the detection of the violent sorting behavior is improved, and the speed of the detection of the violent sorting behavior is also improved.

The embodiment of the present application further provides a method for training a referendet network structure as a target detection model, please refer to fig. 2, and fig. 2 shows a flowchart of a method for establishing a target detection model according to another embodiment of the present application.

As shown in fig. 2, the method includes:

step 210, obtaining a historical monitoring video frame sequence;

step 220, labeling each historical monitoring video frame in the historical monitoring video frame sequence according to whether the target object is contained or not;

step 230, preprocessing the marked historical monitoring video frame sequence;

and 240, training a RefineDet network structure by utilizing the preprocessed historical monitoring video frame sequence according to a gradient descent algorithm to obtain a target detection model.

In the embodiment of the application, the abnormal behaviors of the sorting behavior main body are detected in the express sorting scene, so that the service quality of a service provider is improved, and the satisfaction degree of a user is improved. In an express sorting scene, there are many sorters, and at a certain time, some sorters may not operate the sorted objects, and some sorters may be contacting the sorted objects. If human body detection is carried out on all the sorting behavior main bodies, a large number of candidate frames or candidate regions to be identified are extracted, and due to the fact that the spatial positions of the candidate regions are more, the time for classifying the candidate regions is longer, and the detection accuracy is reduced.

The embodiment of the application divides the pedestrians into two categories: the sorting personnel are in contact with the objects to be sorted or the sorting personnel are not in contact with the objects to be sorted. The candidate area for detecting the violent sorting behavior is only from the case where the sorting person comes into contact with the sorted object.

In order to improve the quality of the candidate region of the violent sorting behavior, a RefineDet network structure training target detection model is selected to detect the monitoring video frames collected in real time, so that the candidate region is obtained.

When the target detection model is trained, a large number of historical monitoring video frames or image frames are acquired, and the video frames are labeled, for example, all coordinates in the monitoring video frames, which are in contact with the objects to be sorted, and all coordinates in the monitoring video frames, which are not in contact with the objects to be sorted, can be labeled.

And after the historical monitoring video frame is labeled, preprocessing operations such as normalization, data enhancement and the like are carried out on the labeled image data.

After preprocessing, training a RefineDet network structure by using the preprocessed monitoring video frame sequence as a training set to establish a model architecture;

and updating the RefineDet network structure by using a random gradient descent algorithm so as to obtain a target detection model.

The RefineDet network structure comprises a positioning refinement module, a target detection module and a transfer connection block, and the RefineDet network structure is updated by using a random gradient descent algorithm, so that obtaining a target detection model for example may include:

determining a training set from the preprocessed historical monitoring video frame sequence;

inputting the training set into a positioning refinement module to perform prediction frame screening to obtain a first type of prediction frame; and transmitting the feature graph of the first type of prediction frame to a target detection module through a transfer connection block, wherein the target detection module regresses the first type of prediction frame to obtain an initial detection model.

And iteratively updating the initial detection model by using the minimum loss function to obtain a target detection model.

In the embodiment of the present application, the gradient descent algorithm may be, for example, Batch Gradient Descent (BGD), random gradient descent (SGD), or small batch random gradient descent (MSGD). Preferably, the model training is performed using a small batch gradient descent algorithm.

The small batch gradient descent algorithm is a compromise scheme, and a small batch sample in a training set is selected to calculate the gradient of a loss function, so that the training process can be ensured to be more stable. The method reduces the violent oscillation of parameter updating in SGD.

In the embodiment of the application, the candidate region is extracted by using the RefineDet network structure, so that the quality of the candidate region can be improved, the number of the candidate regions is reduced, and the speed and the accuracy of the violent sorting behavior identification are improved.

It should be noted that while the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

With further reference to fig. 3, fig. 3 shows an exemplary structural block diagram of a forced sorting behavior candidate region determination apparatus 300 according to an embodiment of the present application.

As shown in fig. 3, the apparatus 300 includes:

the video frame acquiring module 310 is configured to acquire a surveillance video frame.

The monitoring video data is obtained by the video acquisition device through real-time acquisition, and is related to the sorting scene. The surveillance video frames may contain objects, pedestrians, vehicles, etc.

And the target detection module 320 is configured to input the surveillance video frame into a pre-established target detection model, and output the category identifier and the coordinate value of the candidate region.

It is contemplated that the subject of the sort activity will typically be a sorting operator and the recipient will typically be an object to be sorted, such as an item or package or a courier, for example. The embodiment of the application provides that the number of the candidate areas is reduced by excluding the video frames only containing the sorting personnel or the sorted objects based on the target detection model, so that the speed of the violent sorting behavior identification stage is improved.

Further, the class identification may also include a confidence of the candidate region, where the confidence may be identified by a number, and the confidence represents the discriminative power of the model for the candidate region. For example, label yes on the candidate area: and 0.85, representing the confidence probability of the situation that the candidate region belongs to the contact between the sorting behavior main body and the sorted object. Mark no on the candidate area: and 0.95, representing the confidence probability that the candidate area belongs to the condition that the main body of the sorting action and the sorted object do not contact.

A determining module 330, configured to determine the surveillance video frame as a starting frame for analyzing the violent sorting behavior if the category identifier indicates that the candidate area contains the target object.

and the end frame acquisition module is used for acquiring an end frame for analyzing the violent sorting behavior. The end frame may be, for example, a video frame of a predetermined time interval sequentially extracted through a time sliding window as the end frame. For example, a fixed number of frames can be extracted from the beginning frame and back, i.e., consecutive frames are obtained from the surveillance video as objects for violent sorting activities. The fixed number of frames may be, for example, 15 frames. Or extracting the video frame in a fixed time range, and taking the last frame as an end frame or a tail frame.

Further, another embodiment of the present application includes:

and the new video acquisition module is used for continuously acquiring a new monitoring video frame if the category identification indicates that the candidate area does not contain the target object.

In the embodiment of the present application, a method for training a referendet network structure as a target detection model is further provided, please refer to fig. 4, and fig. 4 shows an exemplary structural block diagram of an apparatus 400 for establishing a target detection model according to yet another embodiment of the present application.

As shown in fig. 4, the apparatus 400 includes:

a historical video frame obtaining sub-module 410, configured to obtain a historical monitoring video frame sequence;

the labeling submodule 420 is configured to label each historical surveillance video frame in the historical surveillance video frame sequence according to whether the historical surveillance video frame includes a target object;

the preprocessing submodule 430 is configured to preprocess the marked historical monitoring video frame sequence;

and the training submodule 440 is configured to train a RefineDet network structure according to a gradient descent algorithm by using the preprocessed historical monitoring video frame sequence, so as to obtain a target detection model.

Wherein, the RefineDet network structure includes a positioning refinement module, a target detection module and a transfer connection block, and the training submodule 440 includes:

a training set determining unit, configured to determine a training set from the preprocessed historical surveillance video frame sequence;

the model establishing unit is used for inputting the training set into the positioning refining module to carry out prediction frame screening to obtain a first type of prediction frame; and transmitting the feature graph of the first type of prediction frame to a target detection module through a transfer connection block, wherein the target detection module regresses the first type of prediction frame to obtain an initial detection model.

And the parameter updating unit is used for iteratively updating the initial detection model by utilizing the minimum loss function to obtain the target detection model.

It should be understood that the units or modules described in the apparatus 300-400 correspond to the various steps in the method described with reference to fig. 1-2. Thus, the operations and features described above for the method are equally applicable to the apparatus 400 and the units included therein, and are not described in detail here. The apparatus 300-400 may be implemented in a browser or other security applications of the electronic device in advance, or may be loaded into the browser or other security applications of the electronic device by downloading or the like. The corresponding units in the apparatus 300-400 can cooperate with units in the electronic device to implement the solution of the embodiment of the present application.

Referring now to FIG. 5, a block diagram of a computer system 500 suitable for use in implementing a server according to embodiments of the present application is shown.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU)501 that can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM)502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data necessary for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other via a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input portion 506 including a keyboard, a mouse, and the like; an output portion 507 including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The driver 510 is also connected to the I/O interface 505 as necessary. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as necessary, so that a computer program read out therefrom is mounted into the storage section 508 as necessary.

In particular, the processes described above with reference to fig. 1 or 2 may be implemented as computer software programs, according to embodiments of the present disclosure. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method of fig. 1 or 2. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 509, and/or installed from the removable medium 511.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor, and may be described as: a processor includes a video frame acquisition module, a target detection module, and a determination module. The names of these units or modules do not in some cases constitute a limitation on the units or modules themselves, and for example, the video frame acquisition module may also be described as a "module for acquiring surveillance video frames".

As another aspect, the present application also provides a computer-readable storage medium, which may be the computer-readable storage medium included in the foregoing device in the foregoing embodiment; or it may be a separate computer readable storage medium not incorporated into the device. The computer-readable storage medium stores one or more programs for use by one or more processors in performing the forced sorting act candidate area determination method described herein.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the invention herein disclosed is not limited to the particular combination of features described above, but also encompasses other arrangements formed by any combination of the above features or their equivalents without departing from the spirit of the invention as defined above. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

Claims

1. A method for determining candidate regions for violent sorting activities, the method comprising:

acquiring a monitoring video frame;

inputting the monitoring video frame into a pre-established target detection model, and outputting a category identifier and a coordinate value of a candidate area, wherein the target detection model is formed by training based on a RefineDet network structure, the category identifier is used for indicating whether the candidate area contains a target object, and the target object comprises a sorting behavior main body in contact with a sorted object;

and if the category identification indicates that the candidate area contains the target object, determining the monitoring video frame as a starting frame for analyzing violent sorting behaviors.

2. The method of claim 1, wherein the step of pre-establishing an object detection model comprises:

acquiring a historical monitoring video frame sequence;

labeling each historical monitoring video frame in the historical monitoring video frame sequence according to whether the target object is contained or not;

preprocessing the marked historical monitoring video frame sequence;

and training the RefineDet network structure by utilizing the preprocessed historical monitoring video frame sequence according to a gradient descent algorithm to obtain a target detection model.

3. The method of claim 1 or 2, wherein the class identification further comprises a class confidence.

4. The method of claim 3, wherein after determining that the surveillance video frame is a starting frame for analysis of violent sorting behavior, the method further comprises: an end frame for analyzing violent sorting behavior is obtained.

5. The method of claim 1, further comprising:

and if the category identification indicates that the candidate area does not contain the target object, continuously acquiring a new monitoring video frame.

6. An apparatus for determining a candidate area for violent sorting behavior, comprising:

a determining module, configured to determine that the surveillance video frame is a starting frame for analyzing violent sorting behavior if the category identifier indicates that the candidate region contains the target object.

7. The apparatus of claim 6, further comprising means for pre-building the object detection model, the means comprising:

the historical video frame acquisition submodule is used for acquiring a historical monitoring video frame sequence;

the marking submodule is used for marking each historical monitoring video frame in the historical monitoring video frame sequence according to whether the target object is contained or not;

the preprocessing submodule is used for preprocessing the marked historical monitoring video frame sequence;

and the training submodule is used for training the RefineDet network structure by utilizing the preprocessed historical monitoring video frame sequence according to a gradient descent algorithm to obtain a target detection model.

8. The apparatus of claim 7, wherein the training submodule comprises:

the model establishing unit is used for inputting the training set into the positioning refining module to carry out prediction frame screening to obtain a first type of prediction frame; transmitting the feature map of the first type of prediction frame to a target detection module through a transfer connection block, wherein the target detection module performs regression on the first type of prediction frame to obtain an initial detection model;

and the parameter updating unit is used for iteratively updating the initial detection model by utilizing a minimum loss function to obtain the target detection model.

9. The apparatus of any of claims 6-8, wherein after determining that the surveillance video frame is a starting frame for analyzing violent sorting behavior, the apparatus further comprises: and the end frame acquisition module is used for acquiring an end frame for analyzing the violent sorting behavior.

10. The apparatus of claim 6, further comprising: and the new video acquisition module is used for continuously acquiring a new monitoring video frame if the category identification indicates that the candidate area does not contain the target object.