CN106598226A

CN106598226A - UAV (Unmanned Aerial Vehicle) man-machine interaction method based on binocular vision and deep learning

Info

Publication number: CN106598226A
Application number: CN201611030533.XA
Authority: CN
Inventors: 侯永宏; 叶秀峰; 侯春萍; 刘春源; 陈艳芳
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2016-11-16
Filing date: 2016-11-16
Publication date: 2017-04-26
Anticipated expiration: 2036-11-16
Also published as: CN106598226B

Abstract

The invention relates to a UAV (Unmanned Aerial Vehicle) man-machine interaction method based on binocular vision and deep learning. An embedded image processing platform and a binocular camera are carried on a UAV. The embedded image processing platform is connected to a flight controller through an interface. The UAV communicates with a ground station through the platform. The platform is provided with a Graphics Processing Unit (GPU), a convolutional neural network deep learning algorithm is run and parallel computing for an image is carried out during video capture of the camera. According to the invention, a convolutional neutral network is transplanted to the embedded platform with a dedicated GPU and the run speed becomes faster through parallel computing for speeding up the computation.

Description

It is a kind of based on binocular vision and the unmanned plane man-machine interaction method of deep learning

Technical field

This method belongs to multimedia signal processing field, and in particular to the skill such as computer vision, deep learning, man-machine interaction Art, it is especially a kind of based on binocular vision and the unmanned plane man-machine interaction method of deep learning.

Background technology

Human-computer interaction technology be accompanied by the birth of computer and produce and with computer software and hardware development and by Gradually develop, the appearance of new technique constantly simplifies the flow process of man-machine interaction.In recent years, with the appearance of artificial intelligence technology And development, the continuous progressive and innovation of related software and hardware technology, how to realize that more convenient man-machine interaction becomes study hotspot, Various novel human-machine interaction technologies are continued to bring out.At the same time, the rise of relevant industries such as inexpensive SUAV (UAV) with Popularization, in the urgent need to some, more easily people reduces the threshold that unmanned plane is operated with the interactive controlling mode of unmanned plane so that Unmanned plane is increasingly widely applied.

The man-machine interaction of unmanned plane is shown and mainly controlled using special equipments such as remote control, rocking bar, ground station softwares System.Wherein the most importantly operated using remote control, although the operation difficulty of remote control is sent out with unmanned air vehicle technique Exhibition is lowered significantly, however, the remote control of heaviness still brings very big inconvenience to operator.By mobile phone, computer and specially With the earth station of software sharing in many cases, man-machine interaction is caused to be more convenient from certain depth.In recent years, it is new man-machine Exchange method emerges in an endless stream, and occurs in that and wears special auxiliary equipment, using the measured value or EEG signals of body part motion Simplify the control mode of unmanned plane as control signal.But the control mode for relying on special auxiliary equipment still suffers from flower Fei Gao, using troublesome problem.

In the middle of the unmanned plane popularized on the market, photographic head is all equipped with mostly.Photographic head it is cheap so that vision Solution becomes takes photo by plane, computer vision navigation, the first-selection of vision avoidance.The function of these photographic head is made full use of, is utilized Computer vision, unmanned plane exchange method is carried out with more universality by image recognition gesture.It is existing to be regarded based on computer Feel man-machine interaction method, due to the restriction of software and hardware, tend not to enough interact in scope remote enough, and easily receive To environmental disturbances, it is impossible to apply in outdoor scene.In order to improve the identification of the Gesture Recognition Algorithm in outdoor environment unmanned plane Precision, the present invention carries out gesture identification using the method for deep learning first, the motion of the control unmanned plane that uses gesture, and simplifies nothing Man-machine manipulation difficulty.

Traditional action recognition algorithm computation complexity is high, due to lacking necessary accelerating algorithm, recognition speed is slow, accurately Rate is low.

The content of the invention

The present invention is based on binocular camera, by carrying embedded platform on unmanned plane, constructs one based on calculating The unmanned plane man-machine interactive system of machine vision and deep learning, system is provided can be according to the avigator's for specifying on the ground The heading of gesture control aircraft.

1. hardware system is constituted

The system by carry the unmanned aerial vehicle platform of geographical position acquisition module, embedded image processing platform, photographic head, 4, face station part is constituted.

Unmanned aerial vehicle platform is multi-rotor unmanned aerial vehicle, and unmanned plane is positioned by geographical position acquisition module, flight control Device can control unmanned plane in outdoor autonomous hovering.Embedded image processing platform and photographic head, the system are carried on unmanned plane In photographic head catch for subsequent treatment high-definition picture.

Embedded platform is the platform that can provide enough operational capabilities for image procossing with graphic process unit (GPU). This platform is responsible for the acquisition process and action recognition of image, in the middle of practical application, can be served as by high-performance mobile terminal handler, The system connects flight controller by interface, and thus platform serves as the communication on unmanned plane and ground.Platform carries operating system Operation processing routine.

Earth station is responsible for monitoring the state of four-axle aircraft, for specifying avigator and checking the result of real-time operation, Can be served as by notebook or intelligent terminal.

2. action recognition framework

Action recognition framework mainly includes：Video pre-filtering, the instruction for generating color texture figure and convolutional neural networks model Practice and classify.

1) according to the one-view image of unmanned plane passback, unmanned plane avigator is selected, method is using mouse or touch Screen, body (upper body) region of avigator is irised out to come；

2) selected region is expanded into frame according to person ratio and selects the whole body region of people, while from another In the middle of viewpoint, the person region of respective regions is selected, and extracted frame by frame in the video sequence using track algorithm Go out avigator region and avigator is tracked, and according to the position of setting regions, by whole human body institute Region image segmentation out；

3) Stereo matching, according to the Stereo Matching Algorithm based on Block- matching, by the image between the left and right viewpoint for splitting Matched, the stereopsises figure after being split, the personage comprising avigator and background in the middle of disparity map；

4) disparity map for obtaining is normalized.And use threshold value wiping out background.Obtain clean character image；

5) character image of adjacent two frame is carried out into difference, obtains difference image sequence.

6) according to the picture sequence for generating, successively the difference image using different colours representative not in the same time, is encoded, The character image that the picture of about 2s or so is produced is superimposed as into color texture figure.Using the method for sliding window, take every 5 frames The picture of about 2s or so；

7) gather substantial amounts of, the color texture figure of selected action is done under various circumstances to neutral net by different operating person It is trained.Training is carried out in special purpose workstation.After the completion of training, training parameter is uploaded to into embedded image processing platform.

8) on embedded image processing platform, using the parameter for training to Real-time Collection and generate color texture Figure is classified.

9) on embedded image processing platform, tracking, segmentation, Stereo matching asks for color texture figure, and classification exists respectively Different threads is carried out simultaneously, so as to the disposal ability for substantially utilizing processor.Meanwhile, by about the calculating of image, profit Accelerated with graphic process unit so that processing speed meets the requirement of real-time.

Advantages of the present invention and beneficial effect：

1st, convolutional neural networks are transplanted to and are configured with the embedded platform with dedicated graphics processors (GPU) by the present invention On, accelerated so as to accelerate the speed of service by parallel computation.

2nd, the present invention extracts operator region using target tracking algorism from video sequence, efficiently solves nothing The problems such as man-machine in-flight camera drift and complex background are disturbed, while reducing operand.The method and method for distinguishing phase Relatively, with working range it is wide, accuracy rate is high the characteristics of.

3rd, the present invention will be accelerated about the calculating of image using graphic process unit so that processing speed meets real-time Requirement.

Description of the drawings

Fig. 1 is the hardware system connection diagram of this method；

Fig. 2 is step 2 Stereo matching design sketch in embodiment；

Fig. 3 is the image in embodiment after step 3 wiping out background；

Fig. 4 is color texture figure composition principle figure；

Fig. 5 is the process chart of this method.

Specific embodiment

Below in conjunction with the accompanying drawings and by specific embodiment the invention will be further described, and following examples are descriptive , it is not determinate, it is impossible to which protection scope of the present invention is limited with this.

It is a kind of based on binocular vision and the unmanned plane man-machine interaction method of deep learning, comprise the following steps that：

1) during system start-up, according to video camera display content, the single viewpoint frame shown by earth station selects avigator, profit Avigator is tracked with fast iterative algorithm, is extracted with avigator in the video of high-resolution according to tracking result Centered on low-resolution video sequence.

2) using the Stereo Matching Algorithm based on Block- matching, Stereo matching is carried out to the part of low resolution, this part Stereo matching, is accelerated by graphic process unit.Simultaneously the parameter of this step provides minimax parallax value D_maxWith D_min.It is three-dimensional Matching effect is shown in accompanying drawing 2.

3) result of Stereo matching is processed：The step is mainly used in wiping out background noise.First by three-dimensional Image pixel by pixel normalization after matching somebody with somebody, this step operand is big, is accelerated using graphic process unit (gpu), is calculated using multiple gpu The multithreading of core accelerates to complete.Normalization formula is as follows：

Wherein D_maxAnd D_minThe maximum and minimum value of gray value is represented respectively, and d is the gray value of current pixel.D' takes less In the maximum on the right.

Secondly, with the method for thresholding, background noise is filtered：

Wherein threadhold is threshold value, and the selection of threshold value is relevant with resolution of video camera and camera spacing, needs to pass through Statistics determination is carried out to parallax distribution.In the middle of the experimental demonstration system of the present invention, we determine that value is 225 according to experiment. There was only the character image of people information through the image of thresholding.Image after filtering is shown in accompanying drawing 3.

4) by through doing difference between two frames before and after the character image sequence of thresholding, the video sequence of difference is obtained.It is false The deep video sequence that setting tool has n frames is:d₁,d₂,…,d_n, wherein d_iRepresent the depth map of the i-th frame.Due to picture on depth image The value of vegetarian refreshments represents distance of the location of pixels relative to camera lens, therefore for two adjacent frame depth images, can be by meter The difference of pixel point value of same pixel position is calculated describing action message.Here, adjacent two frame is done the knot drawn after difference Fruit is expressed as：m₁,m₂,…,m_n-1。

5) in order in a figure, with the temporal characteristics for representing action, the present invention is using color come coding depth figure sequence Row, by HSV color spaces change colourity H, with different colors represent difference depth not in the same time.Assume h_maxAnd h_min Represent the span of the colourity in HSV color spaces in experiment.Then in i-th depth map, all difference depth that calculate Location of pixels uses colourity H_iTo be encoded：

6) in additive process, for location of pixels p (x, y), if it gets the m in depth difference_iIt is upper that there is depth Changing value z_i, according to the change in depth value sequence z of the location of pixels in whole video sequence₁,z₂,…,z_iThe depth of maximum can be obtained Degree changing value z_max=z_kSo as to specify final color allocation H of the location of pixels_k.All pixels position on to whole pictures Do the above operation after, can by whole compression of video sequence into a coloury color texture picture, wherein pixel The locus of value describe the space characteristics of action sequence, and the corresponding color value of pixel is the temporal characteristics of action sequence. The colored schematic diagram of synthesis is shown in accompanying drawing 4.

7) after obtaining color texture figure, picture is learnt by convolutional neural networks (CNN) and is classified to complete to move The identification (experiment adopts network structure for Alexnet) of work.

In order to reach the requirement of real-time, present invention utilizes time interval during camera captures video, and it is embedded The parallel processing capability of formula system, parallel computation is carried out while video is caught to image.In image procossing and neutral net Identification process in the middle of, accelerated using graphic process unit.The track algorithm adopted in the present invention is fast in order to improve operation Degree, following range is only limited to operator's face part, in subsequent treatment, further according to tracing area bigger region is intercepted.And depth Figure calculating aspect, employs the algorithm based on Block- matching of speed, and it is per second that Stereo matching frame per second can reach about 16 frames.Finally Generating classification results can be with the speed of 2 frame per second.System software structure is shown in accompanying drawing 5.

The picture that unmanned plane catches often drifts about and rocks for the picture that still camera catches with camera. The present invention needs the corresponding data collection for gathering in different environments and generating according to use environment.The demonstration system of the present invention The training dataset that system is generated, with camera drift, rocks and is walked about with background characters, if the video by acquiring 5 actions It is dry, it is that each action generates about 2000 color texture pictures, in addition, also generating a class comprising 3000 color texture figures Non-controlling instruction.

It is below the experimental result on data set of the invention and explanation：Action gesture is converted to into action command.First Neutral net is trained using large-scale work station and training result is uploaded to into embedded image processing platform.In outdoor environment Under, operator is doing respectively 20 times, 100 to each control instruction at a certain distance totally in the range of unmanned plane 6-13 rice Operational order, period walks about and violate-action with left and right.Test to system shows, in the range of 10m, the standard of system identification Really rate can reach more than 90 percent, and recognition effect is reliable effectively.Recognition result sees attached list 1.

Table 1

In the middle of the gatherer process of the color texture figure of training set, in order to avoid over-fitting cause figure action when Between it is different in size and can not correctly be identified.Color texture figure needs all to be done with the action of different time length to synthesize.Tool Body is to take different frame numbers.In the middle of the demo system of the present invention, the method that we adopt is that, respectively with 30,40,50 frames are length Degree is synthesized.Simultaneously in order to avoid over-fitting situation, also using the method for rotating image and resolution conversion to training data Collection is extended.

Above-described is only the preferred embodiment of the present invention, it is noted that for one of ordinary skill in the art For, on the premise of without departing from inventive concept, some deformations and improvement can also be made, these belong to the protection of the present invention Scope.

Claims

1. a kind of based on binocular vision and the unmanned plane man-machine interaction method of deep learning, it is characterised in that：Take on unmanned plane Embedded image processing platform and binocular camera are carried, the embedded image processing platform connects flight controller by interface, With ground by the Platform communication, platform carries graphic process unit to unmanned plane, runs convolutional neural networks deep learning algorithm, Photographic head carries out parallel computation while catching video to image.

2. according to claim 1 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The exchange method is concretely comprised the following steps：

1) according to the one-view image of unmanned plane passback, unmanned plane avigator is selected, using mouse or Touch screen, will be navigated The upper body region of member is irised out and；

2) selected region is expanded into frame according to person ratio and selects the whole body region of people, while from another viewpoint It is central, the person region of respective regions is selected, and extract neck frame by frame in the video sequence using track algorithm Boat person region and avigator is tracked, and according to the position of setting regions, by whole human body location The image segmentation in domain is out；

3) Stereo matching, according to the Stereo Matching Algorithm based on Block- matching, the image between the left and right viewpoint for splitting is carried out Personage comprising avigator and background in the middle of matching, the stereopsises figure after being split, disparity map；

4) disparity map for obtaining is normalized, and uses threshold value wiping out background, obtain clean character image；

5) character image of adjacent two frame is carried out into difference, obtains difference image sequence；

6) according to the picture sequence for generating, successively the difference image using different colours representative not in the same time, is encoded, will be big The character image that the picture of about 2s or so is produced is superimposed as color texture figure；Using the method for sliding window, take about every 5 frames The picture of 2s or so；

7) gather the substantial amounts of color texture figure for doing selected action under various circumstances by different operating person is carried out to neutral net Training, training is carried out in special purpose workstation, after the completion of training, training parameter is uploaded to into embedded image processing platform；

8) on embedded image processing platform, using the parameter for training to Real-time Collection and generate color texture figure enter Row classification, identification；

9) action command that will identify that is sent to flight controller, instructs unmanned plane to move.

3. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 2) following range of track algorithm is only limited to operator's face part.

4. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 3) Stereo Matching Algorithm by graphic process unit accelerate.

5. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 4) normalization by graphic process unit accelerate.

6. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 8) classify, recognize by graphic process unit acceleration.

7. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 7) train the color texture figure of collection respectively with 30,40,50 frames are synthesized for length.

8. according to claim 2 based on binocular vision and the unmanned plane man-machine interaction method of deep learning, its feature exists In：The step 7) training dataset is extended using the method for rotating image and resolution conversion.