CN111126112B

CN111126112B - Candidate region determination method and device

Info

Publication number: CN111126112B
Application number: CN201811292150.9A
Authority: CN
Inventors: 虢齐; 张玉双; 楚梦蝶; 冯昊楠; 袁益琴
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd
Priority date: 2018-10-31
Filing date: 2018-10-31
Publication date: 2024-04-16
Anticipated expiration: 2038-10-31
Also published as: CN111126112A

Abstract

The application discloses a candidate region determining method and device. The method comprises the following steps: acquiring a monitoring video frame; inputting the monitoring video frame into a pre-established target detection model, and outputting a class identifier and a coordinate value of a candidate region, wherein the target detection model is trained based on REFINEDET network structures, and the class identifier is used for indicating whether the candidate region contains a target object or not, and the target object comprises a sorting behavior main body contacted with a sorted object; if the category identification indicates that the candidate region contains a target object, the surveillance video frame is determined to be the starting frame for analysis of the violent sorting action. According to the technical scheme provided by the embodiment of the application, the quality of the candidate area can be effectively improved through the REFINEDET network structure, so that the accuracy of the identification stage of the violent sorting behavior is improved.

Description

Candidate region determination method and device

Technical Field

The present application relates generally to the field of computer vision, and more particularly to candidate region determination methods and apparatus.

Background

Sorting is to stack the articles according to the variety, warehouse-in and warehouse-out sequence. The timeliness of sorting operations directly affects the development prospects of different service providers.

Currently, in sorting scenarios, the creation of violent sorting actions results in a serious impact on the quality of service of the service provider. In order to better improve the service quality, aiming at violent sorting behaviors, a target detection and analysis method based on videos is provided. The violence sorting behavior detection can be divided into two parts of candidate region generation and violence behavior recognition, the quality of candidate region generation directly influences the violence behavior recognition effect, the spatial position of the candidate region is generally determined in a single frame of video, and then sliding window processing is carried out on time sequence to obtain the time position of the candidate region.

In the violent sorting behavior detection, the number of candidate areas is in direct proportion to the number of pedestrians due to the complexity of video scenes, and the false detection rate of the violent behavior recognition stage is increased, so that the recognition time is too long.

Disclosure of Invention

In view of the foregoing drawbacks or shortcomings in the prior art, it is desirable to provide a method, apparatus, and storage medium for determining candidate areas of violent sorting actions, to reduce the time consumption of violent sorting action recognition, and to improve the accuracy of recognition.

In a first aspect, an embodiment of the present application provides a method for determining a candidate region of a violent sorting behavior, where the method includes:

Acquiring a monitoring video frame;

inputting the monitoring video frame into a pre-established target detection model, and outputting a class identifier and a coordinate value of a candidate region, wherein the target detection model is trained based on REFINEDET network structures, and the class identifier is used for indicating whether the candidate region contains a target object, and the target object comprises a sorting behavior main body contacted with a sorted object;

if the category identification indicates that the candidate region contains a target object, the surveillance video frame is determined to be the starting frame for analysis of the violent sorting action.

In a second aspect, an embodiment of the present application provides a device for determining a candidate region of a violent sorting action, where the device includes:

the video frame acquisition module is used for acquiring a monitoring video frame;

The target detection module is used for inputting the monitoring video frame into a pre-established target detection model, outputting a category identifier and a coordinate value of a candidate region, wherein the target detection model is trained based on REFINEDET network structures, the category identifier is used for indicating whether the candidate region contains a target object, and the target object comprises a sorting behavior main body contacted with a sorted object;

And the determining module is used for determining the monitoring video frame as a starting frame for analyzing the violent sorting behavior if the category identification indicates that the candidate region contains the target object.

According to the technical scheme of the candidate region of the violent sorting behavior, which is provided by the embodiment of the application, the monitoring video frames received in real time are identified through the pre-established target detection model, so that the effective candidate region which can be used for identifying the violent sorting behavior is obtained, and the quality of the candidate region is improved. In the process of training the target detection model, the candidate areas only containing the target objects are reserved by marking the video frames containing the target objects, so that the number of the candidate areas is effectively reduced, and the recognition speed of violent sorting behaviors is improved.

Further, by determining an end frame for analyzing the violent sorting behavior, or continuing to acquire a new monitoring video frame when the target object is not recognized, the video image processing efficiency can be improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

fig. 1 is a schematic flow chart of a method for determining candidate areas of violent sorting actions according to an embodiment of the present application;

FIG. 2 is a schematic flow chart of a method for establishing a target detection model according to another embodiment of the present application;

FIG. 3 illustrates an exemplary block diagram of a violent sorting behavior candidate region determination device 300 provided in one embodiment of the application;

FIG. 4 illustrates an exemplary block diagram of an apparatus 400 for building a target detection model according to yet another embodiment of the present application;

FIG. 5 shows a schematic diagram of a computer system suitable for use in implementing embodiments of the present application.

Detailed Description

The application is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the application and are not limiting of the application. It should be noted that, for convenience of description, only the portions related to the application are shown in the drawings.

It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other. The application will be described in detail below with reference to the drawings in connection with embodiments.

Referring to fig. 1, fig. 1 is a flowchart illustrating a method for determining candidate regions of violent sorting actions according to an embodiment of the present application.

As shown in fig. 1, the method includes:

step 110, a surveillance video frame is acquired.

In the embodiment of the application, after the monitoring video data is acquired from the video storage server or the equivalent equipment or the video acquisition device, the video data can be converted into a video sequence, namely a frame-by-frame image, namely a monitoring video frame.

The monitoring video data is video data related to sorting scenes obtained by a video collecting device in real time. The surveillance video frame may include, among other things, articles, pedestrians, vehicles, etc.

And 120, inputting the monitoring video frame into a pre-established target detection model, and outputting the category identification and the coordinate value of the candidate region.

In the embodiment of the application, the candidate area is identified by detecting pedestrians on a frame-by-frame basis of the monitoring video frame. The violent sorting action is identified through analysis of the candidate areas, and the violent sorting action refers to that the time from contact to separation of a sorting action main body and a sorted object is extremely short. For example, a time interval of only a few seconds or even 1 second. In order to capture such a short-time motion change in video image data, it is necessary to accurately identify the starting point of time at which the sorting action body is in contact with the object to be sorted. According to the method and the device for detecting and analyzing the video image data, the initial position of the violent sorting action can be accurately identified through detecting and analyzing the video image data, so that the violent sorting action can be more accurately identified, and the efficiency of the violent sorting action identification stage is improved.

In the embodiment of the application, the acquired monitoring video frame is input into a pre-established target detection model to carry out target detection, so as to obtain the category identification of the candidate region and the coordinate value of the candidate region. Wherein the category identification is used to indicate whether the candidate region contains a target object that includes a body of sorting action in contact with the sorted object.

Considering sorting action subjects are typically sorting personnel and acceptors are typically sorted objects, such as articles or packages or couriers. According to the method and the device, the video frames are analyzed based on the target detection model, and the video frames with only sorting personnel or sorted objects are eliminated, so that the number of candidate areas is reduced, and the speed of the violent sorting behavior recognition stage is improved.

The target detection model is trained based on REFINEDET network structures. The REFINEDET network structure is a novel single detector, the precision of which is higher than that of a two-stage method, and the high efficiency of the single-stage method is maintained. The REFINEDET network structure comprises a positioning refinement module, a target detection module and a transfer connection block. The positioning refinement module aims at filtering out unsuitable anchor frames to reduce classified search space, and the transfer connection block inputs the feature map output by the positioning refinement module to the target detection module after performing conversion operation. And fusing the characteristics of different sides by the target detection module, and then carrying out multi-layer classification and regression. The quality of the candidate areas can be effectively improved through REFINEDET network structures, so that the accuracy of the identification stage of the violent sorting behaviors is improved.

The category identification is a flag for indicating an attribute category of the candidate region. For example, it may be a number or letter, or a combination of both, that primarily uses the identification to indicate the nature of the candidate area, for example, to indicate whether the sorting action body is in contact with the object to be sorted or not. For example, identification yes on the candidate area indicates that the sorting action body in the candidate area is in contact with the sorted object, and identification no indicates that the sorting action body in the candidate area is not in contact with the sorted object.

Further, the category identification may further include a confidence level of the candidate region, wherein the confidence level may be identified by a number, and the confidence level indicates a distinguishing capability of the model to the candidate region. For example, label yes on candidate region: 0.85, representing the confidence probability of the situation that the candidate region belongs to the sorting action subject and the sorted object belongs to contact. Marking no on the candidate region: 0.95, representing the confidence probability that the candidate region belongs to the case where the sorting action subject and the sorted object belong to no contact. If the category identification indicates that the candidate region contains a target object, then the surveillance video frame is determined to be the starting frame for analysis of the violent sorting action, step 130.

In an embodiment of the present application, if the category identification indicates that the candidate region contains a target object, i.e., the body of the sorting action is in contact with the sorted object (or called a recipient), then the surveillance video frame is determined to be the starting frame for analyzing the violent sorting action.

Further, after determining the surveillance video frame as the starting frame for analyzing the violent sorting behavior, the method further comprises:

An end frame for analysis of the violent sorting action is acquired. The end frame may be, for example, a video frame sequentially extracted by a time sliding window for a predetermined time interval as the end frame. For example, a fixed number of frames may be extracted from the start frame onwards, i.e. successive frames are taken from the surveillance video as the object for the violent sorting action. The fixed number of frames may be, for example, 15 frames. Or extracting video frames in a fixed time range, and taking the last frame as an end frame or an end frame.

The extracted candidate areas are used for violent sorting behavior recognition, namely, the candidate areas required by the violent sorting behavior recognition are determined, the number of the candidate areas can be reduced, the speed of violent sorting behavior recognition is improved, and meanwhile, the quality of the candidate areas can be improved.

Further, the embodiment of the application can further comprise: if the category identification indicates that the candidate region does not contain the target object, continuing to acquire a new surveillance video frame.

And if the category identification indicates that the candidate area does not comprise the target object, performing target detection on the next video frame in the monitoring video frame sequence until the monitoring video frame is determined to be the starting frame, and starting a time sliding window to sequentially acquire video frames with fixed frames as the identification object of the violent sorting behavior identification stage.

According to the embodiment of the application, the starting state of abnormal sorting behaviors can be effectively identified by detecting the monitoring video frames frame by frame through REFINEDET network structure, then the candidate region of the violent sorting behaviors can be determined by extracting the video frames in the preset range, the violent sorting behaviors can be accurately and efficiently identified by identifying the candidate region, the accuracy of detecting the violent sorting behaviors is improved, and the speed of detecting the violent sorting behaviors is also improved.

The embodiment of the application also provides a method for training REFINDDET a network structure as a target detection model, please refer to fig. 2, fig. 2 shows a flow chart of a method for establishing a target detection model according to another embodiment of the application.

As shown in fig. 2, the method includes:

step 210, acquiring a historical monitoring video frame sequence;

step 220, labeling each historical monitoring video frame in the historical monitoring video frame sequence according to whether the historical monitoring video frame contains a target object or not;

Step 230, preprocessing the marked historical monitoring video frame sequence;

And 240, training REFINEDET the network structure according to a gradient descent algorithm by using the preprocessed historical monitoring video frame sequence to obtain a target detection model.

In the embodiment of the application, the abnormal behavior of the sorting behavior main body is detected in the express sorting scene, so that the service quality of a service provider is improved, and the satisfaction degree of a user is improved. In the express sorting scenario, there are a large number of sorters, at some point, some sorters may not be operating on the sorted objects, and some sorters may be touching the sorted objects. If all sorting behavior main bodies are subjected to human body detection, a large number of candidate frames or candidate areas to be identified are extracted, and the time for classifying the candidate areas is long due to the fact that the space positions of the candidate areas are more, so that the detection accuracy is reduced.

In the embodiment of the application, pedestrians are classified into two categories: the sorters are in contact with the objects to be sorted or the sorters are not in contact with the objects to be sorted. The candidate areas for detecting violent sorting actions are only from the case where the sorting personnel are in contact with the objects to be sorted.

In order to improve the quality of the candidate region of the violent sorting behavior, REFINEDET network structure training target detection models are selected to detect the monitoring video frames acquired in real time, so that the candidate region is obtained.

When the target detection model is trained, a large number of historical monitoring video frames or image frames are acquired, and the video frames are marked, for example, all coordinates contacted with the sorted objects and coordinates not contacted with the sorted objects in the monitoring video frames can be marked.

And after marking the historical monitoring video frame, carrying out preprocessing operations such as normalization and data enhancement on the marked image data.

After preprocessing, training REFINEDET a network structure by using the preprocessed monitoring video frame sequence as a training set, and establishing a model framework;

And updating REFINEDET the network structure by using a random gradient descent algorithm so as to obtain a target detection model.

The REFINEDET network structure includes a positioning refinement module, a target detection module and a transfer connection block, and updates the REFINEDET network structure by using a random gradient descent algorithm, so that the obtaining a target detection model may include, for example:

determining a training set from the preprocessed historical monitoring video frame sequence;

Inputting the training set into a positioning refinement module for screening prediction frames to obtain a first type of prediction frame; and transmitting the feature map of the first type of prediction frame to a target detection module through a transfer connecting block, wherein the target detection module carries out regression on the first type of prediction frame to obtain an initial detection model.

And iteratively updating the initial detection model by using the minimized loss function to obtain a target detection model.

In the embodiment of the application, the gradient descent algorithm may be, for example, batch Gradient Descent (BGD), random gradient descent (SGD), or small batch random gradient descent (MSGD). Preferably, the model training is performed using a small batch gradient descent algorithm.

The small-batch gradient descent algorithm is a compromise scheme, and the gradient of a loss function is calculated by selecting a small-batch sample in a training set, so that the training process can be ensured to be more stable. This approach reduces the severe concussion of parameter updates in SGD.

In the embodiment of the application, the REFINEDET network structure is utilized to extract the candidate areas, so that the quality of the candidate areas can be improved, the number of the candidate areas is reduced, and the speed and accuracy of identification by utilizing violent sorting behaviors are improved.

It should be noted that although the operations of the method of the present invention are depicted in the drawings in a particular order, this does not require or imply that the operations must be performed in that particular order or that all of the illustrated operations be performed in order to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.

With further reference to fig. 3, fig. 3 illustrates an exemplary block diagram of a violence sorting action candidate region determination device 300 provided by an embodiment of the present application.

As shown in fig. 3, the apparatus 300 includes:

The video frame acquisition module 310 is configured to acquire a surveillance video frame.

The monitoring video data is video data related to sorting scenes obtained by a video collecting device in real time. The surveillance video frames may contain items, pedestrians, vehicles, etc.

The target detection module 320 is configured to input the surveillance video frame into a pre-established target detection model, and output a category identifier and a coordinate value of the candidate region.

Considering sorting action subjects are typically sorting personnel and acceptors are typically sorted objects, such as articles or packages or couriers. According to the method and the device, the number of candidate areas is reduced by excluding the video frames with only sorting personnel or sorted objects based on the target detection model, so that the speed of the violent sorting behavior recognition stage is improved.

Further, the category identification may further include a confidence level of the candidate region, wherein the confidence level may be identified by a number, and the confidence level indicates a distinguishing capability of the model to the candidate region. For example, label yes on candidate region: 0.85, representing the confidence probability of the situation that the candidate region belongs to the sorting action subject and the sorted object belongs to contact. Marking no on the candidate region: 0.95, representing the confidence probability that the candidate region belongs to the case where the sorting action subject and the sorted object belong to no contact.

A determining module 330 is configured to determine that the surveillance video frame is a start frame for analyzing the violent sorting behavior if the category identification indicates that the candidate region contains the target object.

And the ending frame acquisition module is used for acquiring an ending frame for analyzing the violent sorting behavior. The end frame may be, for example, a video frame sequentially extracted by a time sliding window for a predetermined time interval as the end frame. For example, a fixed number of frames may be extracted from the start frame onwards, i.e. successive frames are taken from the surveillance video as the object for the violent sorting action. The fixed number of frames may be, for example, 15 frames. Or extracting video frames in a fixed time range, and taking the last frame as an end frame or an end frame.

Further, still another embodiment of the present application includes:

And the new video acquisition module is used for continuously acquiring a new monitoring video frame if the category identification indicates that the candidate area does not contain the target object.

In an embodiment of the present application, a method for training REFINDDET a network structure as a target detection model is also provided, please refer to fig. 4, fig. 4 shows an exemplary block diagram of an apparatus 400 for building a target detection model according to another embodiment of the present application.

As shown in fig. 4, the apparatus 400 includes:

A historical video frame acquisition sub-module 410, configured to acquire a historical surveillance video frame sequence;

The labeling sub-module 420 is configured to label each historical monitoring video frame in the historical monitoring video frame sequence according to whether the historical monitoring video frame contains the target object;

the preprocessing sub-module 430 is configured to preprocess the annotated historical surveillance video frame sequence;

The training sub-module 440 is configured to train REFINEDET the network structure according to the gradient descent algorithm by using the preprocessed historical monitoring video frame sequence, so as to obtain the target detection model.

Wherein REFINEDET the network structure includes a location refinement module, a target detection module, and a transfer connection block, and the training sub-module 440 includes:

the training set determining unit is used for determining a training set from the preprocessed historical monitoring video frame sequence;

The model building unit is used for inputting the training set into the positioning refinement module to screen the prediction frames to obtain a first type of prediction frame; and transmitting the feature map of the first type of prediction frame to a target detection module through a transfer connecting block, wherein the target detection module carries out regression on the first type of prediction frame to obtain an initial detection model.

And the parameter updating unit is used for iteratively updating the initial detection model by using the minimized loss function to obtain the target detection model.

It should be understood that the units or modules described in the apparatuses 300-400 correspond to the various steps in the methods described with reference to fig. 1-2. Thus, the operations and features described above with respect to the method are equally applicable to the apparatus 400 and the units contained therein, and are not described in detail herein. The apparatus 300-400 may be implemented in advance in a browser or other security application of the electronic device, or may be loaded into a browser or security application of the electronic device by means of downloading, etc. The respective units in the apparatus 300-400 may cooperate with units in an electronic device to implement aspects of embodiments of the present application.

Referring now to FIG. 5, there is illustrated a schematic diagram of a computer system 500 suitable for use in implementing a server of an embodiment of the present application.

As shown in fig. 5, the computer system 500 includes a Central Processing Unit (CPU) 501, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 502 or a program loaded from a storage section 508 into a Random Access Memory (RAM) 503. In the RAM 503, various programs and data required for the operation of the system 500 are also stored. The CPU 501, ROM 502, and RAM 503 are connected to each other through a bus 504. An input/output (I/O) interface 505 is also connected to bus 504.

The following components are connected to the I/O interface 505: an input section 506 including a keyboard, a mouse, and the like; an output portion 507 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like; a storage portion 508 including a hard disk and the like; and a communication section 509 including a network interface card such as a LAN card, a modem, or the like. The communication section 509 performs communication processing via a network such as the internet. The drive 510 is also connected to the I/O interface 505 as needed. A removable medium 511 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 510 as needed so that a computer program read therefrom is mounted into the storage section 508 as needed.

In particular, according to embodiments of the present disclosure, the process described above with reference to fig. 1 or 2 may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method of fig. 1 or 2. In such an embodiment, the computer program may be downloaded and installed from a network via the communication portion 509, and/or installed from the removable media 511.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules involved in the embodiments of the present application may be implemented in software or in hardware. The described units or modules may also be provided in a processor, for example, as: a processor includes a video frame acquisition module, a target detection module, and a determination module. The names of these units or modules do not in any way limit the units or modules themselves, and for example, the video frame acquisition module may also be described as a "module for acquiring surveillance video frames".

As another aspect, the present application also provides a computer-readable storage medium, which may be a computer-readable storage medium contained in the foregoing apparatus in the foregoing embodiment; or may be a computer-readable storage medium, alone, that is not assembled into a device. The computer-readable storage medium stores one or more programs for use by one or more processors in performing the method of determining candidate regions of violent sorting behavior described in the present application.

The above description is only illustrative of the preferred embodiments of the present application and of the principles of the technology employed. It will be appreciated by persons skilled in the art that the scope of the application referred to in the present application is not limited to the specific combinations of the technical features described above, but also covers other technical features formed by any combination of the technical features described above or their equivalents without departing from the inventive concept described above. Such as the above-mentioned features and the technical features disclosed in the present application (but not limited to) having similar functions are replaced with each other.

Claims

1. A method for determining candidate areas of violent sorting actions, the method comprising:

Acquiring a monitoring video frame;

Inputting the monitoring video frame into a pre-established target detection model, and outputting a category identifier and a coordinate value of a candidate region, wherein the target detection model is trained based on REFINEDET network structures, and the category identifier is used for indicating whether a sorting behavior main body in the candidate region is in contact with a sorted object or not;

Determining the surveillance video frame as a starting frame for analyzing violent sorting actions if the category identification indicates that the sorting action subject in the candidate area is in contact with the sorted object;

the step of pre-establishing the target detection model comprises the following steps:

Acquiring a historical monitoring video frame sequence;

Labeling each historical monitoring video frame in the historical monitoring video frame sequence according to whether the sorting behavior main body is contacted with the sorted object or not;

preprocessing the marked historical monitoring video frame sequence;

Training the REFINEDET network structure according to a gradient descent algorithm by utilizing the preprocessed historical monitoring video frame sequence to obtain a target detection model.

2. The method of claim 1, wherein the category identification further comprises a category confidence.

3. The method of claim 2, wherein after determining that the surveillance video frame is a start frame for analyzing violent sorting behavior, the method further comprises: an end frame for analysis of the violent sorting action is acquired.

4. The method according to claim 1, characterized in that the method further comprises:

If the category identification indicates that the sorting action subject in the candidate area is not in contact with the sorted object, continuing to acquire a new monitoring video frame.

5. A violent sorting action candidate region determination device, characterized in that the device comprises:

The target detection module is used for inputting the monitoring video frame into a pre-established target detection model and outputting a category identification and a coordinate value of a candidate area, the target detection model is trained based on REFINEDET network structures, and the category identification is used for indicating whether a sorting behavior main body in the candidate area is in contact with a sorted object or not;

A determining module for determining the surveillance video frame as a starting frame for analyzing violent sorting actions if the category identification indicates that the sorting action subject in the candidate area is in contact with the sorted object;

The method comprises a module for pre-establishing the target detection model, wherein the module comprises the following steps:

Acquiring a historical monitoring video frame sequence;

preprocessing the marked historical monitoring video frame sequence;

6. The apparatus of claim 5, wherein the training submodule comprises:

the model building unit is used for inputting the training set into the positioning refinement module to screen the prediction frames to obtain a first type of prediction frame; transmitting the feature map of the first type of prediction frames to a target detection module through a transfer connecting block, and enabling the target detection module to carry out regression on the first type of prediction frames to obtain an initial detection model;

7. The apparatus of any of claims 5-6, wherein after determining that the surveillance video frame is a start frame for analyzing violent sorting behavior, the apparatus further comprises: and the ending frame acquisition module is used for acquiring an ending frame for analyzing the violent sorting behavior.

8. The apparatus of claim 5, wherein the apparatus further comprises:

and the new video acquisition module is used for continuously acquiring a new monitoring video frame if the category identification indicates that the sorting action main body in the candidate area is not contacted with the sorted object.