CN108846364B

CN108846364B - FPGA-based video feature detection method and system

Info

Publication number: CN108846364B
Application number: CN201810653311.6A
Authority: CN
Inventors: 张良; 徐杰; 陈训逊; 何跃鹰; 党向磊; 李建强; 张晓明; 刘刚; 朱缓; 郭敬林
Original assignee: Shenzhen Surfilter Technology Development Co ltd; National Computer Network and Information Security Management Center
Current assignee: Shenzhen Surfilter Technology Development Co ltd; National Computer Network and Information Security Management Center
Priority date: 2018-06-22
Filing date: 2018-06-22
Publication date: 2022-05-03
Anticipated expiration: 2038-06-22
Also published as: CN108846364A

Abstract

The invention provides a video feature detection method based on an FPGA (field programmable gate array), which comprises the following steps of: selecting a feature point cluster of a video stream in a video library; training the feature point cluster to obtain a classification network; and realizing a classification network by utilizing FPGA solidification so as to compare video characteristics. The neural network architecture is realized based on the FPGA, SIFT features and SURF features are approximate, and video feature detection is realized. The traditional SIFT algorithm and the SURF algorithm are compared in a mode of searching a feature library, but the invention actually completes the feature generation and comparison process on the FPGA through a neural network, thereby removing the step of searching the feature library and improving the comparison efficiency. According to the invention, SIFT and SURF algorithms are optimized by combining a deep learning technology, so that the method is suitable for large-scale system application, and the calculation process is accelerated by adopting an FPGA hardware technology, therefore, the link of searching a mass feature library is avoided, and the inspection efficiency is improved.

Description

FPGA-based video feature detection method and system

Technical Field

The invention relates to the technical field of video processing, in particular to a video feature detection method and system based on an FPGA.

Background

With the popularization of internet applications, a large number of video applications are active on the internet. Therefore, the video real-time detection technology under large throughput is a necessary management means for the application.

Aiming at massive video detection under high throughput, the traditional video detection method puts severe requirements on computing capacity and network transmission capacity. Taking high definition video as an example (reference: video vertical resolution 720p or 1080i), each processing unit should detect no less than 150 videos in real time. The standard resolution of 720P is 1280 × 720. After the receiving side video is decompressed, the original video file is restored, and the color depth is usually 32 bits (8 bits each of the three primary colors of red, green and blue, and 8 bits of brightness information). When the frame rate is calculated as frame rate 5 (about 20 is required for human eyes, but the frame rate can be properly reduced during detection), the scale of 150 channels of 1 second videos is:

1280 × 720 × 32bit × 5 × 150 ═ 22118400000bit converted to byte size, i.e. 22118400000/8 ═ 2764,800,000 byte, which is about 2.7 GB.

In addition, when the algorithm such as SIFT is used for carrying out the operation such as Gaussian pyramid, the data volume of the intermediate result is increased by more than 10 times. The data of this scale puts a strict requirement on computing power and network transmission capability, and the adoption of server cluster and GPU processing cannot realize whole-process pipelining and has high power consumption. Further, after the features are generated, comparison needs to be performed in a massive feature library, and when the number of the features exceeds 10 hundred million, query cost is huge, so that the traditional scheme is difficult to adopt in a large-scale system.

Disclosure of Invention

The invention aims to provide a video feature detection method and system based on an FPGA.

In one aspect, an embodiment of the present invention provides a video feature detection method based on an FPGA, including the following steps:

selecting a feature point cluster of a video stream in a video library;

training the feature point cluster to obtain a classification network;

and realizing the classification network by utilizing FPGA solidification so as to compare the video characteristics.

In the FPGA-based video feature detection method of the present invention, the step of selecting a feature point cluster of a video stream in a video library includes:

extracting a plurality of key frames of the video stream;

generating corresponding SIFT feature points and SURF feature points for each key frame;

comparing the SIFT feature points and the SURF feature points in the same frame of image, and selecting a pixel point set with the SIFT feature points and the SURF feature points overlapped;

and carrying out cluster classification and labeling on the pixel point set to generate the feature point cluster.

In the FPGA-based video feature detection method of the present invention, in the step of generating corresponding SIFT feature points and SURF feature points for each of the key frames, the SIFT feature points are generated by:

performing scale space extreme point detection on the key frame to determine SIFT feature points of the key frame;

and accurately positioning the SIFT feature points, and determining the pixel coordinates of the SIFT feature points.

In the FPGA-based video feature detection method of the present invention, in the step of generating corresponding SIFT feature points and SURF feature points for each of the key frames, the SURF feature points are generated by:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

and accurately positioning the SURF characteristic points, and determining the pixel coordinates of the SURF characteristic points.

In the method for detecting video features based on the FPGA of the present invention, the step of training the feature point cluster to obtain the classification network includes:

constructing a classification network architecture based on a Darknet network architecture;

and training by taking the key frame corresponding to the pixel point in the feature point cluster set as a training set to obtain the weight of the classification network.

Correspondingly, the invention also provides a video feature detection system based on the FPGA, which comprises:

the characteristic point cluster generating module is used for selecting a characteristic point cluster of a video stream in a video library;

the classification network generating module is used for training the feature point cluster set to obtain a classification network;

and the video feature comparison module is used for realizing the classification network by utilizing FPGA (field programmable gate array) solidification so as to compare the video features.

In the video feature detection system based on FPGA of the present invention, the feature point cluster generating module includes:

an extraction unit for extracting a plurality of key frames of the video stream;

a feature point generating unit, configured to generate, for each key frame, a corresponding SIFT feature point and SURF feature point;

the comparison unit is used for comparing the SIFT feature points and the SURF feature points in the same frame of image and selecting a pixel point set with the SIFT feature points and the SURF feature points overlapped;

and the characteristic point cluster generating unit is used for carrying out cluster classification on the pixel point set and marking the pixel point set so as to generate the characteristic point cluster.

In the FPGA-based video feature detection system of the present invention, the feature point generating unit includes an SIFT feature point generating subunit configured to:

In the FPGA-based video feature detection system of the present invention, the feature point generating unit includes an SURF feature point generating subunit configured to:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

In the FPGA-based video feature detection system of the present invention, the classification network generation module includes:

a classification network architecture construction unit, configured to construct an architecture of the classification network based on a Darknet network architecture;

and the training unit is used for training by taking the key frames corresponding to the pixel points in the feature point cluster set as a training set to obtain the weight of the classification network.

The embodiment of the invention has the following beneficial effects: the method comprises the steps of selecting a feature point cluster of a video stream in a video library; training the feature point cluster to obtain a classification network; and realizing the classification network by utilizing FPGA solidification so as to compare the video characteristics. The neural network architecture is realized based on the FPGA, SIFT features and SURF features are approximate, and video feature detection is realized. The traditional SIFT algorithm and the SURF algorithm are compared in a mode of searching a feature library, but the invention actually completes the feature generation and comparison process on the FPGA through a neural network, thereby removing the step of searching the feature library and improving the comparison efficiency. According to the invention, SIFT and SURF algorithms are optimized by combining a deep learning technology, so that the method is suitable for large-scale system application, and the calculation process is accelerated by adopting an FPGA hardware technology, therefore, the link of searching a mass feature library is avoided, the inspection efficiency is improved, and the real-time performance and the accuracy of Internet video feature detection are further realized.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for detecting video features based on an FPGA according to an embodiment of the present invention;

fig. 2 is a flowchart of step S1 shown in fig. 1;

fig. 3 is a flowchart of step S2 shown in fig. 1;

fig. 4 is a schematic diagram of a video feature detection system based on FPGA according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of the feature point cluster generation module shown in FIG. 4;

fig. 6 is a schematic diagram of the classification network generation module shown in fig. 4.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example one

The embodiment provides a video feature detection method based on an FPGA. Referring to fig. 1, the method for detecting video features based on FPGA includes the following steps:

step S1: selecting a feature point cluster of a video stream in a video library;

at present, limited by computing power, video features are usually extracted from key frame image features of a video, and feature comparison is then implemented on the basis of the extracted features. The traditional image features are classified into global features and local features, the global features refer to the overall attributes of the image, and common global features include color features, texture features and shape features, such as intensity histograms and the like. The local features are features extracted from local regions of the image, and include edges, corners, lines, curves, regions with special attributes, and the like. Common local features include two main description modes, namely a corner class and a region class. Compared with global image features such as line features, texture features and structural features, the local image features have the characteristics of abundant content in the image, small correlation degree among the features, no influence on detection and matching of other features due to disappearance of partial features under the shielding condition and the like.

Among many local feature descriptors, SIFT and SURF are widely applied, and the core problems of local image feature description are invariance (robustness) and distinguishability. When using local image feature descriptors, it is often the case that various image transformations are handled robustly. Therefore, invariance is the first problem to consider when constructing and designing a feature descriptor. In the wide baseline matching, the invariance of the feature descriptors to the visual angle change, the invariance to the scale change, the invariance to the rotation change and the like need to be considered; in shape recognition and object retrieval, the invariance of feature descriptors to shape needs to be considered. However, the differentiability of the feature descriptors is often contradictory to the invariance, that is, a feature descriptor with a plurality of invariances has a slightly weaker ability to differentiate local image contents; however, if a feature descriptor is very easy to distinguish different local image contents, its robustness is often low. Thus, multiple methods need to be used simultaneously. Specifically, SIFT and SURF features are chosen in the present application.

Therefore, as shown in fig. 2, step S1 includes:

step S11: extracting a plurality of key frames of the video stream;

step S12: generating corresponding SIFT feature points and SURF feature points for each key frame;

specifically, the SIFT algorithm and the SURF algorithm are both large in calculation amount, so in order to increase the processing speed, in the present application, the feature point comparison is realized by pixel coordinates. Therefore, in the present application, the clipped SIFT and SURF algorithms are implemented, and neither feature point description is implemented.

Optionally, the SIFT feature points are generated by the following steps:

Optionally, the SURF feature points are generated by:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

Step S13: comparing the SIFT feature points and the SURF feature points in the same frame of image, and selecting a pixel point set with the SIFT feature points and the SURF feature points overlapped;

step S14: and carrying out cluster classification and labeling on the pixel point set to generate the feature point cluster.

Specifically, the K-means method is adopted to classify and label the overlapped pixel points in 32 × 32 standard, if the number of the clusters is large, the top 15 are reserved according to the order of the number of the overlapped feature points.

Step S2: training the feature point cluster to obtain a classification network;

according to the universal approximation theorem (universal approximation term), a feedforward neural network can approximate any Borel measurable function from one finite dimensional space to another finite dimensional space with any arbitrary precision if it has a linear output layer and at least one hidden layer of an activation function (e.g., a logical sigmoid activation function) with any kind of "squeeze" property, as long as a sufficient number of hidden units are given to the network. From this theorem, it can be seen that the image shallow feature can actually be realized by some kind of convolutional neural network. Both the SIFT and SURF algorithms are shallow features and thus can be approximated by neural networks.

Specifically, as shown in fig. 3, step S2 includes:

step S21: constructing a classification network architecture based on a Darknet network architecture;

step S22: and training by taking the key frame corresponding to the pixel point in the feature point cluster set as a training set to obtain the weight of the classification network.

Specifically, a 19-layer neural network structure is constructed on the basis of a Darknet network architecture, and a keyframe labeled by a coincident pixel point cluster in a video library is used as a training set to train to obtain a classification network. The weight training adopts a GPU mode, so that parameters are conveniently adjusted, and the parameters are transplanted to the FPGA after being fixed.

Step S3: and realizing the classification network by utilizing FPGA solidification so as to compare the video characteristics.

An FPGA (Field-Programmable Gate Array), which is a product of further development based on Programmable devices such as PAL, GAL, CPLD, etc. The circuit is used as a semi-custom circuit in the field of Application Specific Integrated Circuits (ASIC), not only overcomes the defects of the custom circuit, but also overcomes the defect that the number of gate circuits of the original programmable device is limited. FPGAs are generally slower than ASICs (application specific integrated circuits) and achieve the same functionality over a larger area than ASIC circuits. But has the advantages of being fast to produce, and belonging to a hardware reconfigurable architecture, and can be used as a small-batch replacement for application specific chips (ASICs). Therefore, after the classification network is generated, in order to improve the processing speed, the classification network is realized through FPGA solidification, and the video feature comparison is realized by adopting an FPGA assembly line, so that the step of searching a feature library is omitted, and the comparison efficiency is improved.

The method comprises the steps of selecting a feature point cluster of a video stream in a video library; training the feature point cluster to obtain a classification network; and realizing the classification network by utilizing FPGA solidification so as to compare the video characteristics. The neural network architecture is realized based on the FPGA, SIFT features and SURF features are approximate, and video feature detection is realized. The traditional SIFT algorithm and the SURF algorithm are compared in a mode of searching a feature library, but the invention actually completes the feature generation and comparison process on the FPGA through a neural network, thereby removing the step of searching the feature library and improving the comparison efficiency. According to the invention, SIFT and SURF algorithms are optimized by combining a deep learning technology, so that the method is suitable for large-scale system application, and the calculation process is accelerated by adopting an FPGA hardware technology, therefore, the link of searching a mass feature library is avoided, the inspection efficiency is improved, and the real-time performance and the accuracy of Internet video feature detection are further realized.

Example two

The embodiment provides a video feature detection system based on an FPGA. Referring to fig. 4, the FPGA-based video feature detection system includes:

a feature point cluster generating module 10, configured to select a feature point cluster of a video stream in a video library;

specifically, as described above, SIFT and SURF features are selected in the present application. Therefore, as shown in fig. 5, the feature point cluster generating module 10 includes:

an extracting unit 110, configured to extract a plurality of key frames of the video stream;

a feature point generating unit 120, configured to generate, for each key frame, a corresponding SIFT feature point and SURF feature point;

the comparison unit 130 is configured to compare the SIFT feature points and the SURF feature points in the same frame of image, and select a pixel point set where the SIFT feature points and the SURF feature points coincide;

a feature point cluster generating unit 140, configured to perform cluster classification on the pixel point set and label the pixel point set to generate the feature point cluster.

Specifically, the SIFT algorithm and the SURF algorithm are both large in calculation amount, so in order to increase the processing speed, in the present application, the feature point comparison is realized by pixel coordinates. Therefore, in the present application, the clipped SIFT and SURF algorithms are implemented, and neither feature point description is implemented. Therefore, the feature point generating unit includes a SIFT feature point generating sub-unit and a SURF feature point generating sub-unit.

Further, the SIFT feature point generating subunit is configured to:

Further, the SURF feature point generating subunit is configured to:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

A classification network generating module 20, configured to train the feature point cluster to obtain a classification network;

as described above, both the SIFT and SURF algorithms are shallow features and thus can be approximated by a neural network. Therefore, as shown in fig. 6, the classification network generating module 20 includes:

a classification network architecture construction unit 210, configured to construct an architecture of the classification network based on a Darknet network architecture;

and the training unit 220 is configured to train by using the keyframes corresponding to the pixel points in the feature point cluster set as a training set, so as to obtain the weight of the classification network.

And the video feature comparison module 30 is used for implementing the classification network by utilizing FPGA (field programmable gate array) solidification so as to compare video features.

Specifically, after the classification network is generated, in order to improve the processing speed, the classification network is realized through FPGA solidification, and the video feature comparison is realized by adopting an FPGA assembly line, so that the step of searching a feature library is omitted, and the comparison efficiency is improved.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

While the invention has been described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video feature detection method based on FPGA is characterized by comprising the following steps:

extracting a plurality of key frames of a video stream;

performing cluster classification on the pixel point set and labeling to generate a feature point cluster;

training the feature point cluster to obtain a classification network;

training by taking a key frame corresponding to a pixel point in the feature point cluster set as a training set to obtain the weight of the classification network;

utilizing FPGA (field programmable gate array) solidification to realize the classification network for video feature comparison;

in the step of generating corresponding SIFT feature points and SURF feature points for each of the key frames, the SIFT feature points are generated by:

accurately positioning the SIFT feature points, and determining pixel coordinates of the SIFT feature points;

in the step of generating corresponding SIFT feature points and SURF feature points for each of the key frames, the SURF feature points are generated by:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

2. A video feature detection system based on FPGA is characterized by comprising:

the video feature comparison module is used for realizing the classification network by utilizing FPGA (field programmable gate array) solidification so as to compare video features;

the feature point cluster generating module includes:

an extraction unit for extracting a plurality of key frames of a video stream;

the characteristic point cluster generating unit is used for carrying out cluster classification on the pixel point set and marking the pixel point set to generate a characteristic point cluster;

the feature point generating unit comprises an SIFT feature point generating subunit, configured to:

the feature point generating unit includes a SURF feature point generating subunit configured to:

constructing a Hessian matrix;

generating a scale space;

determining the SURF feature points using non-maxima suppression;

accurately positioning the SURF characteristic points, and determining pixel coordinates of the SURF characteristic points;

the classification network generation module includes: