CN111552837A

CN111552837A - Animal video tag automatic generation method based on deep learning, terminal and medium

Info

Publication number: CN111552837A
Application number: CN202010382574.5A
Authority: CN
Inventors: 刘露; 蔺昊
Original assignee: Shenzhen Inveno Technology Co ltd
Current assignee: Shenzhen Inveno Technology Co ltd
Priority date: 2020-05-08
Filing date: 2020-05-08
Publication date: 2020-08-18

Abstract

The invention provides an animal video tag automatic generation method based on deep learning, a terminal and a medium, wherein the method comprises the following steps: extracting a plurality of key frame images in a video to be detected, and inputting the key frame images into a feature extraction model; inputting the characteristic information output by the characteristic extraction model into a trained target detection algorithm model; and recording the position and the category of the target object output by the target detection algorithm model in the video to be detected, and defining the category of the target object as an animal tag of the video to be detected. The method improves the identification efficiency and the identification accuracy.

Description

Animal video tag automatic generation method based on deep learning, terminal and medium

Technical Field

The invention belongs to the technical field of video tags, and particularly relates to a method, a terminal and a medium for automatically generating an animal video tag based on deep learning.

Background

The automatic animal video tag generation system detects whether an animal exists in a video and what the animal is, and accordingly tags the video. The method commonly used in the existing automatic generation system of animal video tags comprises an interframe difference method and a traditional computer vision image processing method.

Referring to fig. 1 and 2, the inter-frame difference method is based on the difference between pixel values of two images of adjacent frames or images spaced by several frames of video to obtain the absolute value of the brightness difference between the two frames of images, and then thresholding is performed to extract the motion region in the images, so as to deduce the animal region appearing in the video. The method has simple logic and high processing speed. But it cannot be used in a moving camera, and also cannot be used for identifying a static object or an object with a slow or very fast moving speed, and if the surface of the target animal has a large area with similar gray scale values, the identification effect is not good. More importantly, the method can only be used for identifying whether an animal exists in the video, but cannot identify what the animal is, or even cannot ensure the correctness of the identification result, so that the method has great limitation on the use scene.

Referring to fig. 3, 4, and 5, the conventional computer vision image processing method requires artificial design of features for each animal in a training data set, and then training classifier recognition using the extracted features. Since detecting an animal in a video frame requires locating the animal in the video frame image and then identifying the animal's category. Therefore, besides the classification function, the identification model also needs a positioning function. During training, in order to enable the model obtained through final training to be capable of adapting to pictures with different scales, the pictures are firstly scaled into a plurality of pictures according to different aspect ratios, then the whole image is traversed by adopting a method that rectangular frames with different scales and aspect ratios slide in the image, and a position area containing a target which may appear is obtained through the exhaustive strategy. And then extracting a characteristic matrix from the image of each position region obtained by the above strategies. And finally, using the extracted feature matrix for training a classifier. After the model is trained, when the model is actually applied, the video frames need to be extracted at fixed time intervals, and then the model is used for identifying the animal class contained in the image of each frame. And finally, integrating the recognition results of all the extracted video frames as the recognition result of the whole video.

Conventional computer vision image processing methods are capable of identifying the animal class that may be included in the video. However, the sliding window approach will generate a large number of redundant windows, and will also increase the burden of subsequent feature extraction and recognition, which seriously affects the processing efficiency. Moreover, the expression capability of the feature matrix extracted by the artificially designed feature extraction template is weak, and the classifier generally uses weak classifiers such as SVM or Adaboost, so that the recognition accuracy of the final model is low.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides the animal video tag automatic generation method, the terminal and the medium based on deep learning, so that the identification efficiency and the identification accuracy are improved.

In a first aspect, a method for automatically generating animal video tags based on deep learning includes the following steps:

extracting a plurality of key frame images in a video to be detected, and inputting the key frame images into a feature extraction model;

inputting the characteristic information output by the characteristic extraction model into a trained target detection algorithm model;

and recording the position and the category of the target object output by the target detection algorithm model in the video to be detected, and defining the category of the target object as an animal tag of the video to be detected.

Preferably, the feature extraction model is formed by a convolutional neural network and is trained by an ImageNet classification data set.

Preferably, the target detection algorithm model is obtained by training the following method:

acquiring a training set consisting of a plurality of training pictures, and marking the position and the category of an object in each training picture;

realizing a target detection algorithm based on TensorFlow framework programming;

training the target detection algorithm by using the training set;

and saving the trained target detection algorithm as the target detection algorithm model.

Preferably, the object detection algorithm model comprises a fast RCNN algorithm model.

Preferably, the extracting a plurality of key frame images in the video to be detected and inputting the key frame images into the feature extraction model specifically includes:

extracting a plurality of frame images in a video to be detected at a preset time interval, and performing de-duplication processing on the extracted frame images by using a perceptual hash algorithm to obtain the key frame images;

and inputting the key frame image into a feature extraction model.

Preferably, the object detection algorithm model comprises a YOLOv2 algorithm model.

extracting a frame of image from a video to be detected according to a preset time interval;

comparing the new frame image with the cached key frame image by using a perceptual hash algorithm; if the comparison result is smaller than the preset difference threshold value, discarding the new frame image; if the comparison result is greater than or equal to the difference threshold value, defining a new frame image as the key frame image, and inputting the key frame image into a feature extraction model;

the key frame image is buffered.

Preferably, the recording of the position and the category of the target object in the video to be detected, which is output by the target detection algorithm model, and the defining of the category of the target object as the animal tag of the video to be detected specifically includes:

recording the position and the category of the target object in each key frame image output by a fast RCNN algorithm model or a YOLOv2 algorithm model;

counting the occurrence frequency of each type of animal in all key frame images, and sequencing the occurrence frequency of each type of animal in the video to be detected according to a descending order arrangement mode to obtain the animal label of the video to be detected.

In a second aspect, a terminal comprises a processor, an input device, an output device, and a memory, the processor, the input device, the output device, and the memory being interconnected, wherein the memory is configured to store a computer program, the computer program comprising program instructions, and the processor is configured to invoke the program instructions to perform the method of the first aspect.

In a third aspect, a computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to perform the method of the first aspect.

According to the technical scheme, the animal video tag automatic generation method, the terminal and the medium based on deep learning provided by the invention can improve the identification efficiency and the identification accuracy.

Drawings

In order to more clearly illustrate the detailed description of the invention or the technical solutions in the prior art, the drawings that are needed in the detailed description of the invention or the prior art will be briefly described below. Throughout the drawings, like elements or portions are generally identified by like reference numerals. In the drawings, elements or portions are not necessarily drawn to scale.

Fig. 1 is a flowchart of an animal video detection method based on an inter-frame difference method in the background art.

Fig. 2 is a flowchart of an animal video detection method based on an inter-frame difference method in the implementation of the background art.

Fig. 3 is a flowchart of a conventional computer vision image processing method provided in the background art.

Fig. 4 is a flowchart of a method for training a model in a conventional computer vision image processing method provided in the background art.

Fig. 5 is a flowchart of a video tag generation method in a conventional computer vision image processing method provided in the background art.

Fig. 6 is a main step of the automatic generation method of the animal video tag provided by the invention.

FIG. 7 is a flowchart of a training method of a target detection model according to the present invention.

Fig. 8 is a flow of a tag generation system of the fast RCNN algorithm according to a second embodiment of the present invention.

Fig. 9 is a flow chart of a label generation system of the YOLOv2 algorithm according to the second embodiment of the present invention.

Fig. 10 is a frame image animal identification result of the fast RCNN algorithm according to the second embodiment of the present invention.

Fig. 11 is a frame image animal recognition result of YOLOv2 algorithm according to the second embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in detail below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and therefore are only examples, and the protection scope of the present invention is not limited thereby. It is to be noted that, unless otherwise specified, technical or scientific terms used herein shall have the ordinary meaning as understood by those skilled in the art to which the invention pertains.

It will be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It is also to be understood that the terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in the specification of the present invention and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a [ described condition or event ] is detected" may be interpreted contextually to mean "upon determining" or "in response to determining" or "upon detecting [ described condition or event ]" or "in response to detecting [ described condition or event ]".

The first embodiment is as follows:

an animal video tag automatic generation method based on deep learning, referring to fig. 6, includes the following steps:

Specifically, the method for automatically generating the animal video tag provided by the embodiment includes a feature extraction model and a target detection model. The feature extraction model is formed by a convolutional neural network and is obtained by training an ImageNet classification data set. The characteristic extraction model is used for extracting characteristic information of key frame images in the video to be detected. The target detection model comprises two functional modules, namely a locator and a classifier, wherein the locator is used for locating the position of the target object in the key frame image, and the locator outputs the width and the height of the target object and the coordinates of the target object in the key frame image. The classifier is used for classifying the target object positioned by the positioner and outputting the category of the target object. The animal video tag automatic generation method based on deep learning improves recognition efficiency and recognition accuracy.

Example two:

the second embodiment further defines a training method of the target detection model on the basis of the first embodiment.

Referring to fig. 7, the target detection model is trained by the following method:

training the target detection algorithm by using the training set;

Specifically, the training pictures in the training set may be determined according to the service condition and the use condition of a specific user. For example, a proper number of pictures are screened out according to the animal pictures appearing in the business provided by the user, the positions and the types of the animals in the pictures are marked, and all the marked pictures are used as training pictures. In the training process, the method can also continuously adjust the parameters of the target detection model according to the comparison between the obtained position and the obtained category and the marking information of the training picture, thereby continuously optimizing the positioning and classifying capability of the model. The method can regularly store the target detection model obtained by training in the training process until the training is stopped, and takes the optimal model as a final result. Two training methods for the target detection model are given below.

1. The fast RCNN algorithm.

The target detection model comprises a fast RCNN algorithm model written based on a tensorflow framework. Referring to fig. 8, the extracting a plurality of key frame images from a video to be detected and inputting the key frame images into a feature extraction model specifically includes:

and inputting the key frame image into a feature extraction model.

Specifically, the method writes a fast RCNN algorithm model based on a tensoflow frame, and trains the fast RCNN algorithm model by using the training set. Although the fast RCNN algorithm model has accurate identification precision, the complexity is high, so the identification speed is low, and the real-time effect cannot be achieved. When a fast RCNN algorithm model is used, frame images in a video to be detected are extracted at fixed time intervals, then a perceptual hash algorithm is used for carrying out de-duplication processing on the extracted frame images, only a series of key frame images with large differences are left, then characteristic information of the key frame images is input into the fast RCNN algorithm model, and the fast RCNN algorithm model outputs all animal types and positions in each key frame image.

2. YOLOv2 algorithm model.

The target detection model comprises a YOLOv2 algorithm model written based on the tenserflow framework. Referring to fig. 9, the extracting a plurality of key frame images from a video to be detected and inputting the key frame images into a feature extraction model specifically includes:

extracting a frame of image from a video to be detected according to a preset time interval; (ii) a

the key frame image is buffered.

Specifically, the method is based on a Yolov2 algorithm model written by a tenserflow framework, and then the Yolov2 algorithm model is trained by using the training set. The YOLOv2 algorithm model is characterized in that on the premise of keeping the same recognition accuracy as that of a Faster RCNN algorithm model, the recognition efficiency is greatly improved to 40 FPS-67 FPS, the requirement of video real-time processing can be met, and adjustment can be made between accuracy and speed as required. When the video is actually processed, many times, adjacent frame images in the video do not have great difference, and it is not necessary to identify animals in each frame image. Therefore, in practical application, the method only needs to cache the recently identified key frame image, then utilizes the perceptual hash algorithm to compare the latest frame image with the cached key frame image, and abandons the detection of the new frame image if the latest frame image has little difference with the cached key frame image. If the difference is large, the animal and the animal category contained in the new frame image are located by using a YOLOv2 algorithm model.

Specifically, the method obtains the animal tags in the video by counting the occurrence frequency of each type of animals in all frame images and arranging the occurrence frequency of each type of animals in the video to be detected in a descending order.

FIG. 10 shows the animal recognition results of frame images of the fast RCNN algorithm model. In fig. 10, there are two dogs, the coordinate positions of the dogs located according to the model in the recognition result are drawn, two boxes are drawn on the coordinate positions, and the animals in each box are labeled according to the classification result, for example, the dogs are labeled. In specific implementation, the boxes and the animal types do not need to be marked like 10, and only the animal types and the animals in each type in the image need to be recorded. Fig. 11 shows the result of animal recognition of frame images of YOLOv2 model.

For the sake of brief description, the method provided by the embodiment of the present invention may refer to the corresponding contents in the foregoing method embodiments.

Example three:

third embodiment on the basis of the above embodiments, a terminal is provided.

A terminal comprising a processor, an input device, an output device and a memory, the processor, the input device, the output device and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method described above.

It should be understood that in the embodiments of the present invention, the Processor may be a Central Processing Unit (CPU), and the Processor may also be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The input device may include a touch pad, a fingerprint sensor (for collecting fingerprint information of a user and direction information of the fingerprint), a microphone, etc., and the output device may include a display (LCD, etc.), a speaker, etc.

The memory may include both read-only memory and random access memory, and provides instructions and data to the processor. The portion of memory may also include non-volatile random access memory. For example, the memory may also store device type information.

For a brief description, the embodiment of the present invention may refer to the corresponding content in the foregoing method embodiments.

Example four:

embodiment four on the basis of the above-described embodiments, a medium is provided.

A computer-readable storage medium, in which a computer program is stored, the computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the above-mentioned method.

The computer readable storage medium may be an internal storage unit of the terminal according to any of the foregoing embodiments, for example, a hard disk or a memory of the terminal. The computer readable storage medium may also be an external storage device of the terminal, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the terminal. Further, the computer-readable storage medium may also include both an internal storage unit and an external storage device of the terminal. The computer-readable storage medium is used for storing the computer program and other programs and data required by the terminal. The computer readable storage medium may also be used to temporarily store data that has been output or is to be output.

For the sake of brief description, the media provided by the embodiments of the present invention, and the portions of the embodiments that are not mentioned, refer to the corresponding contents in the foregoing method embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit and scope of the present invention, and they should be construed as being included in the following claims and description.

Claims

1. An animal video tag automatic generation method based on deep learning is characterized by comprising the following steps:

2. The method for automatically generating animal video tags based on deep learning of claim 1,

the feature extraction model is formed by a convolutional neural network and is obtained by training an ImageNet classification data set.

3. The method for automatically generating animal video tags based on deep learning of claim 1, wherein the target detection algorithm model is trained by the following method:

training the target detection algorithm by using the training set;

4. The method for automatically generating animal video tags based on deep learning of claim 3,

the target detection algorithm model comprises a fast RCNN algorithm model.

5. The method for automatically generating animal video tags based on deep learning of claim 4, wherein the extracting a plurality of key frame images from the video to be detected and inputting the key frame images into the feature extraction model specifically comprises:

and inputting the key frame image into a feature extraction model.

6. The method for automatically generating animal video tags based on deep learning of claim 3,

the target detection algorithm model comprises a YOLOv2 algorithm model.

7. The method for automatically generating animal video tags based on deep learning of claim 6, wherein the extracting a plurality of key frame images from a video to be detected and inputting the key frame images into a feature extraction model specifically comprises:

the key frame image is buffered.

8. The method for automatically generating animal video tags based on deep learning according to claim 5 or 7, wherein the step of recording the position and the category of the target object output by the target detection algorithm model in the video to be detected and the step of defining the category of the target object as the animal tags of the video to be detected specifically comprises the steps of:

9. A terminal, comprising a processor, an input device, an output device, and a memory, the processor, the input device, the output device, and the memory being interconnected, wherein the memory is configured to store a computer program comprising program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 1-7.

10. A computer-readable storage medium, characterized in that the computer storage medium stores a computer program comprising program instructions that, when executed by a processor, cause the processor to perform the method according to any of claims 1-7.