CN111178267A

CN111178267A - Video behavior identification method for monitoring illegal fishing

Info

Publication number: CN111178267A
Application number: CN201911395639.3A
Authority: CN
Inventors: 不公告发明人
Original assignee: Chengdu Shuzhilian Technology Co Ltd
Current assignee: Chengdu Shuzhilian Technology Co Ltd
Priority date: 2019-12-30
Filing date: 2019-12-30
Publication date: 2020-05-19

Abstract

The invention discloses a video behavior identification method for monitoring illegal fishing, which comprises the steps of collecting video data of a monitored water area, cutting the video data to obtain a multi-frame picture group and generating a real label frame; inputting the multi-frame picture group into a convolutional neural network, extracting picture characteristics and generating a characteristic picture; extracting a feature map of a 6-layer SSD network, and then generating a default frame at each point of the feature map; matching each default frame with a real tag frame; gathering the matched default frames, inhibiting and screening through a non-maximum value, and outputting the screened default frames to obtain a target positioning picture with a final target area; and inputting the target positioning picture into the behavior recognition model, and judging whether the target positioning picture is an illegal fishing behavior. The method can effectively remove the video frames irrelevant to behavior recognition in the video under the complex environment to obtain the specific position of the target in the video, so as to ensure the accuracy of a video behavior recognition processing area and further improve the accuracy and the real-time performance of behavior recognition.

Description

Video behavior identification method for monitoring illegal fishing

Technical Field

The invention belongs to the technical field of video behavior recognition, and particularly relates to a video behavior recognition method for monitoring illegal fishing.

Background

Illegal fishing is a strict practice in China at present, but still many people can illegally fish by virtue of their own interests, and the ecological environment is seriously damaged. When illegal fishing is performed in rivers and seas, video monitoring can be adopted at present, but due to the limitation of environmental conditions, the detection target is usually small, and the influence of the background environment is very large; it is still very difficult to identify smaller fishing targets in complex and large background environments.

At present, video behavior recognition generally classifies pre-segmented short videos, and videos in a real environment are generally not pre-segmented and contain a large amount of irrelevant information. In practical application, however, the complex environmental information has a great influence on behavior recognition, and the accuracy of behavior recognition is greatly reduced.

Disclosure of Invention

In order to solve the problems, the invention provides a video behavior recognition method for monitoring illegal fishing, which can effectively remove video frames irrelevant to behavior recognition in a video under a complex environment to obtain the specific position of a target in the video, so as to ensure the accuracy of a video behavior recognition processing area and further improve the accuracy and the real-time performance of behavior recognition.

In order to achieve the purpose, the invention adopts the technical scheme that: a video behavior recognition method for monitoring illegal fishing detects the specific position of an illegal fishing target in a video and carries out behavior recognition on the target, and comprises the following steps:

collecting video data of a monitored water area, cutting the video data to obtain a multi-frame picture group, marking a target in the picture group, and generating a real label frame;

inputting the multi-frame picture group into a convolutional neural network, extracting picture characteristics and generating a characteristic picture;

generating a default frame, extracting a feature map of a 6-layer SSD network, and then generating the default frame at each point of the feature map, wherein the number of the default frames at each layer is different, but each point has the default frame;

matching default frames, and matching each default frame with a real label frame;

screening the default frame, gathering the matched default frame, inhibiting screening through a non-maximum value, and outputting the screened default frame to obtain a target positioning picture with a final target area;

and (4) behavior identification, namely inputting the target positioning picture into a behavior identification model and judging whether the target positioning picture is an illegal fishing behavior.

Further, in the process of matching the default boxes, each default box is matched with a real tag box, each default box selects the real tag box corresponding to the largest IOU value, one default box can only correspond to one real tag box, and one real tag box corresponds to a plurality of default boxes.

Further, in the process after the default box is matched;

if the ith default box is matched to the jth real tag box, calculating the softmax loss value of the default box belonging to the background category, the softmax loss function is as follows:

wherein, VⁱIs the output of the preceding stage output unit of the classifier, i represents the category index, and C is the total number of categories; s_iThe ratio of the index of the current element to the sum of the indexes of all elements, namely the confidence corresponding to the detected default frame;

and obtaining the confidence degrees corresponding to all the detected default frames to prepare for screening the default frames.

Further, screening all generated default frames by using a maximum suppression method to obtain a final target area;

the non-maximum value inhibition algorithm firstly removes the frame with the confidence coefficient lower than a preset threshold value, selects the frame with the maximum confidence coefficient from the rest frames, then compares the intersection ratio IOU of the other frames and the frame, inhibits the frame which is not the local maximum value if the intersection ratio IOU is larger than the preset threshold value, then selects the frame with the maximum confidence coefficient from the rest frames, repeats the steps until no default frame is inhibited, and finally obtains the frame which is the result frame;

and determining the position of the target in the picture according to the obtained target area to obtain a target positioning picture.

Further, all the generated default frames are screened by using a maximum suppression method, and a final calculation formula of the target area is obtained as follows:

wherein N is_tIs a suppression threshold of the maximum value, b_iIs the default box of detection, S_iIs b_iConfidence corresponding to the box, M is the union of all box detections.

Further, the target positioning picture is input into the behavior recognition model, and whether the behavior recognition model is an illegal fishing behavior or not is judged, and the method comprises the following steps:

sequentially sequencing the pictures obtained by target positioning according to a time sequence, continuously extracting 16 pictures as a picture group, and overlapping 8 continuous pictures between two adjacent picture groups;

inputting the picture groups into a behavior recognition model, and extracting picture characteristics of each picture group through a convolution layer of a network to generate a characteristic graph;

connecting all the characteristics of the full-connection layer of the model, and sending an output value to a softmax classifier;

and outputting the category of the target behavior through the softmax layer of the model.

Further, the behavior recognition model comprises 8 convolutional layers, 5 pooling layers and 3 full-connection layers, and the model is trained by utilizing time information;

the first layer of convolutional layers uses 1 x 3 convolutional kernels, steps 0 x 1, and pooling layers with kernel size 1 x 2, steps 1 x 2; the second layer of convolution layers uses convolution kernels of 3 x 3, steps 1 x 1, and pooling layers with kernel size of 1 x 2, steps 1 x 2 are performed; the rest of the convolution layers use 3 × 3 convolution kernels with a step size of 1 × 1, and pooling layers with kernel sizes of 2 × 2 and a step size of 2 × 2 are performed; the seventh layer and the eighth layer are full connection layers, and the ninth layer is a softmax layer, so that a final classification result is obtained.

Further, in order to maintain early time information settings, 3D pooling layers are employed in the behavior recognition model, the kernel sizes of the first and second pooling layers being 1 x 2, step size 1 x 2; all the other 3D pooling layers were 2 x 2 with step size of 2 x 2; the first number of pooling layers is time depth, pooling over individual frames when time depth is 1, pooling between multiple frames when time depth is greater than 1, the former being advantageous to preserve the time characteristics of the initial stage.

The beneficial effects of the technical scheme are as follows:

firstly, positioning a target in a video to obtain a specific position of the target in the video; and then analyzing the position area by using a behavior recognition algorithm, and finally judging the behavior of the target. The efficiency of video behavior recognition is greatly improved, behavior recognition is not performed on invalid targets and invalid areas in the video, and meanwhile, behavior recognition is performed on specific area positions of the targets under the condition that the valid targets exist, so that the influence of complex backgrounds on behavior recognition is reduced, and the accuracy and the real-time performance of behavior recognition are greatly improved.

The method comprises the steps of cutting a video into a frame and a frame, inputting the frame and the frame into a network, obtaining target areas through model detection, then cutting the image areas, inputting the cut image into a rear-end behavior recognition model of the model, training the network, and finishing behavior recognition; the background of the behavior recognition area is purified through the front-end processing of the model, and the improvement of recognition accuracy is facilitated.

Drawings

Fig. 1 is a flow chart of a video behavior recognition method for monitoring illegal fishing according to the present invention.

Fig. 2 is a schematic structural diagram of a model used in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described with reference to the accompanying drawings.

In this embodiment, referring to fig. 1, the present invention provides a video behavior recognition method for monitoring illegal fishing, which detects specific positions of illegal fishing targets in a video and performs behavior recognition on the targets, and includes the steps of:

As an optimization scheme of the above embodiment, in the process of matching default frames, each default frame is matched with a real tag frame, each default frame is selected to correspond to the real tag frame with the largest IOU value, one default frame can only correspond to one real tag frame, and one real tag frame corresponds to a plurality of default frames.

In the process after matching the default box;

As an optimization scheme of the above embodiment, all generated default frames are screened by using a maximum suppression method to obtain a final target area;

Screening all the generated default frames by using a maximum suppression method to obtain a final target area by the following calculation formula:

As the optimization scheme of the embodiment, the method for inputting the target positioning picture into the behavior recognition model to judge whether the target positioning picture is an illegal fishing behavior comprises the following steps:

The behavior recognition model comprises 8 convolutional layers, 5 pooling layers and 3 full-connection layers, and is trained by utilizing time information;

In order to maintain early temporal information settings, 3D pooling layers were employed in the behavior recognition model, the kernel sizes of the first and second pooling layers being 1 x 2, step 1 x 2; all the other 3D pooling layers were 2 x 2 with step size of 2 x 2; the first number of pooling layers is time depth, pooling over individual frames when time depth is 1, pooling between multiple frames when time depth is greater than 1, the former being advantageous to preserve the time characteristics of the initial stage.

The following illustrates an exemplary video behavior recognition method for monitoring illegal fishing, which is used by the present invention, to detect the specific position of the illegal fishing target in the video and perform behavior recognition on the target. The network in fig. 2 is based on a convolutional network, which generates a fixed size bounding box set and object class scores in the box, then uses a non-maximization suppression step to generate final detection, and then performs behavior recognition on the detected target region, and the specific steps are as follows:

1. inputting 300 × 3 pictures into a network, extracting picture features based on a VGG-16 network, and generating a feature map; extracting feature maps of 6 convolutional layers in the network, and generating default frames on each point of the feature maps, wherein the number of the default frames on each layer is different, but each point has the default frame; where the convolutional layer 6 has 38 × 4 — 5776 default frames, the convolutional layer 7 has 19 × 6 — 2166 default frames, the convolutional layer 8 has 10 × 6 — 600 default frames, the convolutional layer 9 has 5 × 6 — 150 default frames, the convolutional layer 10 has 3 × 4 — 36 default frames, the convolutional layer 11 has 1 × 4 — 4 default frames, and there are 8732 default frames in total, and then we send these default frames to the NMS (maximum suppression) module to obtain the final detection result, i.e. the specific position of the target (person) in the video.

2. Adjusting the size of the detected picture obtained by target positioning, wherein the size is changed from 300 × 3 to 224 × 3, and the input picture is a three-channel picture with 224 × 3 pixels; the first layer uses 1 x 3 convolution kernels with step size 0 x 1 and pooling layers with kernel size 1 x 2 and step size 1 x 2; the second layer uses a convolution kernel of 3 x 3 with a step size of 1 x 1 and pooled layers with kernel size of 1 x 2 with a step size of 1 x 2; the remaining layers used 3 x 3 convolution kernels with step size 1 x 1 and pooled layers with kernel size 2 x 2; the sixth layer is a full connection layer and comprises 8192 hidden layers and 4096 hidden layers respectively, namely only 4096 characteristic values are left by the full connection layer of the seventh layer; and finally, the eighth layer is a full connection layer, and a final classification result is obtained.

The foregoing shows and describes the general principles and broad features of the present invention and advantages thereof. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, which are described in the specification and illustrated only to illustrate the principle of the present invention, but that various changes and modifications may be made therein without departing from the spirit and scope of the present invention, which fall within the scope of the invention as claimed. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims

1. A video behavior identification method for monitoring illegal fishing is characterized in that the specific position of an illegal fishing target in a video is detected and behavior identification is carried out on the target, and the method comprises the following steps:

2. The video behavior recognition method for monitoring illegal fishing according to claim 1, wherein in the process of matching default boxes, each default box is matched with a real tag box, each default box is selected to correspond to the real tag box with the largest IOU value, one default box can only correspond to one real tag box, and one real tag box corresponds to a plurality of default boxes.

3. The video behavior recognition method for monitoring illegal fishing according to claim 2, wherein in the process after matching the default frame;

4. The video behavior recognition method for monitoring illegal fishing according to claim 3, wherein all the generated default frames are screened by using a maximum suppression method to obtain a final target area;

5. The video behavior recognition method for monitoring illegal fishing according to claim 4, wherein the maximum suppression method is used to screen all the generated default frames to obtain the final target area according to the formula:

6. The video behavior recognition method for monitoring illegal fishing according to claim 1, wherein the target positioning picture is inputted into the behavior recognition model to determine whether it is illegal fishing behavior, comprising the steps of:

7. The video behavior recognition method for monitoring illegal fishing according to claim 6, wherein the behavior recognition model comprises 8 convolutional layers, 5 pooling layers and 3 full-link layers, and the model is trained by using time information;

8. The video behavior recognition method for monitoring illegal fishing according to claim 7, characterized in that in order to maintain the early time information setting, 3D pooling layers are used in the behavior recognition model, the kernel size of the first and second pooling layers is 1 x 2, step 1 x 2; all the other 3D pooling layers were 2 x 2 with step size of 2 x 2; the first number of pooling layers is time depth, pooling over individual frames when time depth is 1, pooling between multiple frames when time depth is greater than 1, the former being advantageous to preserve the time characteristics of the initial stage.