CN109840598A

CN109840598A - A kind of method for building up and device of deep learning network model

Info

Publication number: CN109840598A
Application number: CN201910352498.0A
Authority: CN
Inventors: 陈海波
Original assignee: DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Current assignee: DeepBlue AI Chips Research Institute Jiangsu Co Ltd
Priority date: 2019-04-29
Filing date: 2019-04-29
Publication date: 2019-06-04
Anticipated expiration: 2039-04-29
Also published as: CN109840598B

Abstract

The application discloses the method for building up and device of a kind of deep learning network model, belongs to technical field of image processing, comprising: each video frame in the video flowing that will acquire, which is input in deep learning network model, carries out feature extraction；The difference between two video frames is determined according to the feature of the feature of the video frame of extraction and the previous video frame adjacent with video frame；If it is determined that difference is not within a preset range, the then parameter of percentage regulation learning network model, and the step of coming back for feature extraction, difference in determining video flowing between any two adjacent video frames within a preset range when, establish deep learning network model, here, the feature of the previous video frame adjacent with the video frame is considered when extracting the feature of each video frame, and feature difference between the two is required to stablize within a preset range, in this way, the accumulated error between video frame can be reduced to the greatest extent, to improve accuracy when determining video acquisition environment using the feature of each video frame.

Description

A kind of method for building up and device of deep learning network model

Technical field

This application involves technical field of image processing more particularly to a kind of method for building up and dress of deep learning network model It sets.

Background technique

With the fast development of artificial intelligence, there are more and more robots, for mobile robot, how really The pose for determining robot is all to realize its free-moving key technology all the time.

In the prior art, in the pose for determining robot, the video flowing of robot acquisition is first obtained, video is then analyzed The feature of each video frame in stream builds figure with the acquisition environment to video flowing, and then combines the inertia measurement installed in robot The IMU data of unit (Inertial measurement unit, IMU) acquisition determine the pose of robot, on the whole, When robot ambulation distance is such as less than 20 meters shorter, the program can preferably determine the pose of robot, but when walking Distance farther out when, yaw and ambient enviroment due to robot can not be predicted in advance, thus using video flowing feature determine Inevitably there is a large amount of accumulated error when acquiring environment in video flowing, these accumulated errors are in subsequent determining robot It can be also evaluated when pose, therefore, the pose of the determination robot finally determined is inaccurate.

Summary of the invention

The embodiment of the present application provides the method for building up and device of a kind of deep learning network model, to reduce in video flowing Accumulated error between each video frame improves the accuracy when the feature using video flowing determines video acquisition environment.

In a first aspect, a kind of method for building up of deep learning network model provided by the embodiments of the present application, comprising:

To each video frame in the video flowing of acquisition, the video frame is input in deep learning network model and carries out feature It extracts；

Two are determined according to the feature of the feature of the video frame of extraction and the previous video frame adjacent with the video frame Difference between video frame；

If it is determined that the difference does not stabilize in preset range, then the parameter of the deep learning network model is adjusted, returning will The video frame is input in deep learning network model the step of carrying out feature extraction；

Difference in determining the video flowing between any two adjacent video frames it is stable within a preset range when, will be current Deep learning network model be determined as establish deep learning network model.

In the embodiment of the present application, each video frame in video flowing that will acquire be input in deep learning network model into Row feature extraction, later, according to the feature of the feature of the video frame of extraction and the previous video frame adjacent with the video frame Determine the difference between two video frames, however, it is determined that difference does not stabilize in preset range, then percentage regulation learning network model Parameter, return by video frame be input in deep learning network model carry out feature extraction the step of, until determining video Difference in stream between any two adjacent video frames it is stable within a preset range when, by current deep learning network model It is determined as the deep learning network model established, in the program, removes first frame in video flowing extracting using deep learning network Except each video frame feature when, it is contemplated that the feature of the previous video frame adjacent with the video frame, and both require Between feature difference stablize within a preset range, in this way, the accumulated error between video frame can be reduced to the greatest extent, exist to improve Accuracy when video acquisition environment is determined using the feature of video frame each in video flowing.

Second aspect, a kind of deep learning network model provided by the embodiments of the present application establish device, comprising:

The video frame is input to deep learning for each video frame in the video flowing to acquisition by characteristic extracting module Feature extraction is carried out in network model；

Difference determining module, for the feature and the previous video adjacent with the video frame according to the video frame of extraction The feature of frame determines the difference between two video frames；

Module is adjusted, for if it is determined that the difference does not stabilize in preset range, then adjusting the deep learning network model Parameter, return by the video frame be input in deep learning network model carry out feature extraction the step of；

Determining module is stable in default model for the difference in determining the video flowing between any two adjacent video frames When enclosing interior, current deep learning network model is determined as to the deep learning network model established.

The third aspect, a kind of robot pose provided by the embodiments of the present application determine method, comprising:

The video frame currently acquired is input in the deep learning network model of foundation and carries out feature extraction；

The feature of N number of key frame by the feature of the video frame of extraction with acquisition time the latest is compared one by one, if really Feature Duplication between the fixed video frame and each key frame is below default Duplication, it is determined that the video frame is to close Key frame, initial N number of key frame are N number of video frame collected at first, and N is the integer greater than zero；

LS-SVM sparseness is carried out to target video stream according to the N number of key frame of acquisition time the latest, the target video stream is by institute State video frame and the top n video frame composition adjacent with the video frame；

According to the corresponding inertia measurement list of video frame each in the target video stream and the target video stream after LS-SVM sparseness First IMU data determine the pose of robot.

Fourth aspect, a kind of robot pose determining device provided by the embodiments of the present application, comprising:

Characteristic extracting module carries out feature for being input to the video frame currently acquired in the deep learning network model of foundation It extracts；

Key frame determining module, the feature of the video frame for that will extract and the spy of the N number of key frame of acquisition time the latest Sign is compared one by one, however, it is determined that and the feature Duplication between the video frame and each key frame is below default Duplication, Then determine that the video frame is key frame, initial N number of key frame is N number of video frame collected at first, and N is whole greater than zero Number；

LS-SVM sparseness module, for carrying out LS-SVM sparseness to target video stream according to the N number of key frame of acquisition time the latest, The target video stream is made of the video frame and the top n video frame adjacent with the video frame；

Pose determining module, for according to each video frame in the target video stream and the target video stream after LS-SVM sparseness Corresponding Inertial Measurement Unit IMU data determine the pose of robot.

5th aspect, a kind of electronic equipment provided by the embodiments of the present application, comprising: at least one processor, and with institute State the memory of at least one processor communication connection, in which:

Memory is stored with the instruction that can be executed by least one processor, which is executed by least one described processor, So that at least one described processor is able to carry out any of the above-described kind of method.

6th aspect, a kind of computer-readable medium provided by the embodiments of the present application are stored with computer executable instructions, The computer executable instructions are for executing any of the above-described kind of method.

In addition, second aspect technical effect brought by any design method into fourth aspect can be found in first aspect Technical effect brought by middle difference implementation, details are not described herein again.

These aspects or other aspects of the application can more straightforward in the following description.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present application, constitutes part of this application, this Shen Illustrative embodiments and their description please are not constituted an undue limitation on the present application for explaining the application.In the accompanying drawings:

A kind of structural schematic diagram for computing device that Fig. 1 is applied to by method provided by the embodiments of the present application；

Fig. 2 is the flow chart of the method for building up of deep learning network model provided by the embodiments of the present application；

Fig. 3 is a kind of structural schematic diagram of deep learning network provided by the embodiments of the present application；

Fig. 4 is the method flow diagram of determining robot pose provided by the embodiments of the present application；

Fig. 5 is the method flow diagram of another determining robot pose provided by the embodiments of the present application；

Fig. 6 is the method for building up provided by the embodiments of the present application for realizing deep learning network model and/or determining robot The hardware structural diagram of the electronic equipment of the method for pose；

Fig. 7 is the structural schematic diagram for establishing device of deep learning network model provided by the embodiments of the present application；

Fig. 8 is the structural schematic diagram of the device of determining robot pose provided by the embodiments of the present application.

Specific embodiment

In order to reduce the accumulated error in video flowing between each video frame, raising is determining video using the feature of video flowing Accuracy when environment is acquired, the embodiment of the present application provides the method for building up and device of a kind of deep learning network model.

Preferred embodiment of the present application is illustrated below in conjunction with Figure of description, it should be understood that described herein Preferred embodiment is only used for describing and explaining the application, is not used to limit the application, and in the absence of conflict, this Shen Please in embodiment and embodiment in feature can be combined with each other.

Method provided by the present application can be applied in a variety of computing devices, and Fig. 1 gives a kind of structure of computing device Schematic diagram, here, computing device 10 shown in FIG. 1 are only an example, not to the function of the embodiment of the present application and use Range band carrys out any restrictions.

As shown in Figure 1, computing device 10 is showed in the form of universal computing device, the component of computing device 10 may include But be not limited to: (including storage is single at least one processing unit 101, at least one storage unit 102, the different system components of connection Member 102 and processing unit 101) bus 103.

Bus 103 indicates one of a few class bus structures or a variety of, including memory bus or Memory Controller, Peripheral bus, processor or the local bus using any bus structures in a variety of bus structures.

Storage unit 102 may include the readable medium of form of volatile memory, such as random-access memory (ram) 1021 and/or cache memory 1022, it can further include read-only memory (ROM) 1023.

Storage unit 102 can also include program/utility with one group of (at least one) program module 1024 1025, such program module 1024 includes but is not limited to: operating system, one or more application program, other program moulds It may include the realization of network environment in block and program data, each of these examples or certain combination.

Computing device 10 can also be with one or more external equipment 104(such as keyboard, sensing equipment etc.) it communicates, may be used also Enable a user to the equipment interacted with computing device 10 communication with one or more, and/or with enable the computing device 10 Any equipment (such as router, modem etc.) communicated with one or more of the other calculating equipment communicates.This Kind communication can be carried out by input/output (I/O) interface 105.Also, computing device 10 can also pass through network adapter 106 is logical with one or more network (such as Local Area Network, wide area network (WAN) and/or public network, such as internet) Letter.As shown in Figure 1, network adapter 106 is communicated by bus 103 with other modules for computing device 10.It should be appreciated that Although being not shown in Fig. 1, other hardware and/or software module can be used in conjunction with computing device 10, including but not limited to: micro- generation Code, device driver, redundant processing unit, external disk drive array, RAID system, tape drive and data backup are deposited Storage system etc..

It will be appreciated by those skilled in the art that Fig. 1 is only the citing of computing device, the limit to computing device is not constituted It is fixed, it may include perhaps combining certain components or different components than illustrating more or fewer components.

As shown in Fig. 2, the flow chart of the method for building up for deep learning network model provided by the embodiments of the present application, including Following steps:

S201: video flowing is obtained.

In the specific implementation, for each video frame in video frame, also obtaining has the corresponding IMU data of the video frame.

S202: each video frame in video flowing being input in deep learning network model and carries out feature extraction, according to The feature of the video frame and the feature of the previous video frame adjacent with the video frame extracted determine between the two video frames Difference, however, it is determined that the difference between two video frames do not stabilize in preset range, then percentage regulation learning network model Parameter, so that the difference between the two video frames is stablized within a preset range.

It in the specific implementation, can be according to the previous video adjacent with the video frame to each video frame in video flowing The corresponding IMU data of frame estimate the pose of the video frame, and then according to the pose of the video frame of estimation and the video The depth map of frame determines the three-dimensional data of the video frame, and the feature of the video frame is determined further according to the three-dimensional data of the video frame, For example, the three-dimensional data of the video frame to be determined directly as to the feature of the video frame.In view of the data volume of three-dimensional data is larger, Operation is got up can be slow, under a kind of possible embodiment, can also carry out at dimensionality reduction to the three-dimensional data of the video frame Reason, obtains the 2-D data of the video frame, then, the processing of bilinearity difference is carried out to the 2-D data of the video frame, by two-wire The feature of property difference treated 2-D data the is determined as video frame, in this way, arithmetic speed can be than very fast.

In practical application, which is generally characterized by being indicated with matrix, and each pixel in video frame is in square Corresponding numerical value can be found in battle array, therefore, in order to determine the difference between the video frame and previous video frame, to the view Each pixel in frequency frame can calculate the feature between the pixel in the pixel and previous video frame in same position Difference (for representing the difference of the data of the pixel point feature in video frame on same position i.e. in matrix), according to feature difference and is somebody's turn to do The weight of pixel determines difference of two video frames on the pixel, and then according to two video frames on each pixel Difference determines the difference between two video frames.

For example, determining the difference between two video frames according to following loss function:

Wherein,For softmax function, softmax function is pre-selected by technical staff,It represents in video frame All pixels point, x indicate that the feature of pixel of two video frames on same position is poor,Represent pixel in this position Weight because each pixel is of equal importance in video frame, for each pixel setting weight be the same.

If it is determined that the difference between two video frames does not stabilize in preset range, then adjustable depth learning network model Parameter, re-start feature extraction and difference using deep learning network model adjusted and determine, until making the two Difference between video frame is stablized within a preset range.

S203: determine difference in video frame between any two adjacent video frames it is stable within a preset range when, will Each test video frame in test video stream, which is input in deep learning network model, carries out feature extraction, according to being somebody's turn to do for extraction The feature of the feature of test video frame and the previous test video frame adjacent with the test video frame determines two test videos Difference between difference and the two predetermined test video frames is compared difference between frame, ties according to comparing Fruit determines error of the deep learning network model when extracting the feature of current test video frame.

S204: it is determined according to error of the deep learning network model in extracting test video stream when each test video frame The accuracy rate of deep learning network model.

S205: judging whether the accuracy rate of deep learning network model reaches default accuracy rate, if so, into S207； Otherwise, into S206.

S206: deep learning network model is trained according to new video flowing, and returns to S203.

S207: current deep learning network model is determined as to the deep learning network model established.

S208: feature is carried out to each video frame in collected video flowing using the deep learning network model established It extracts.

In the embodiment of the present application, the feature of video frame is extracted using deep learning network, the feature of extraction is richer, and When extracting the feature of each video frame in video flowing in addition to first frame using deep learning network, it is contemplated that with the video The feature of the adjacent previous video frame of frame, and feature difference between the two is required to stablize within a preset range, in this way, can use up Amount reduces the accumulated error between video frame, so that improving the feature of each video frame in using video flowing determines video acquisition ring Accuracy when border.

The above process is introduced below with reference to specific embodiment.

As shown in figure 3, being a kind of structural schematic diagram of deep learning network provided by the embodiments of the present application, include Siamese layers, layer, 3D grid generation layer, projection layer and bilinear interpolation layer are estimated in pose transformation, wherein deep learning network The input of model is video flowing, and Siamese layers of output feature is 6 freedom degree pose estimate vectors of current video frame, position Pose estimate vector is transformed to transformation matrix by appearance transformation estimation layer, and transformation matrix combines the depth information output 3 of the video frame Gridding information is tieed up, 3 dimension gridding informations are dropped to 2 dimensional planes, then export the video frame by bilinear interpolation layer by projection layer Feature.

Further, the feature of the feature of video frame and the previous video frame adjacent with the video frame is input to loss Function layer determines the difference between the two video frames using loss function layer, however, it is determined that the difference between two video frames is not Stablize within a preset range, then the parameter in adjustable depth learning network model, and feature is carried out to the video frame again and is mentioned It takes, calculate difference between the two video frames, until the difference between the two video frames is stablized within a preset range.

Difference in determining video flowing between any two video frame it is stable after the preset range, can be to depth Learning network model is tested.

Specifically, each test video frame in test video stream is input in deep learning network model and carries out feature It extracts, is determined according to the feature of the feature of the test video frame of extraction and the previous test video frame adjacent with test video frame Difference between the difference and the two predetermined video frames is compared really difference between two test video frames Error of the depthkeeping degree learning network model when extracting the feature of present frame, and then extracted respectively according to deep learning network model The error when feature of video frame determines the accuracy rate of deep learning network model, judges the accuracy rate of deep learning network model Whether default accuracy rate is reached, if so, deep learning network model to be determined as to the deep learning network model established；If It is no, then deep learning network model is trained according to new video flowing, to establish the deep learning for reaching default accuracy rate Network model.

After determining that the accuracy rate of deep learning network model reaches default accuracy rate, the loss letter in Fig. 3 can remove Several layers, so that it is satisfactory for extracting the deep learning network model of video frame feature to obtain accuracy rate.Later, available The deep learning network model of foundation carries out feature extraction to each video frame in collected video flowing.

In the embodiment of the present application, extracted in video flowing each of in addition to first video frame using deep learning network When the feature of video frame, it is contemplated that the feature of the previous video frame adjacent with the video frame, and require feature between the two Difference is stablized within a preset range, in this way, the accumulated error between video frame can be reduced to the greatest extent, improves each in utilizing video flowing The feature of video frame determines accuracy when video flowing acquires environment, also, special using the video frame that deep learning network extracts Sign is more abundant, to further approach the true acquisition environment of video flowing.

Below for determining robot pose, the application for the deep learning network model established in the above process is carried out It introduces.

As shown in figure 4, determining the flow chart of method for determining robot pose provided by the embodiments of the present application, including following Step:

S401: the video frame currently acquired is input in the deep learning network model of foundation and carries out feature extraction.

When walking, the phase chance being mounted in robot acquires video in real time, whenever camera collects one for robot It is input to when a video frame in the deep learning network model of foundation and carries out feature extraction, in this way, can farthest reduce Accumulated error between each video frame improves the accuracy that figure is built to acquisition environment.

S402: the feature of N number of key frame by the feature of the video frame of extraction with acquisition time the latest compares one by one It is right, however, it is determined that the feature Duplication between the video frame and each key frame is below default Duplication, it is determined that the video frame For key frame.

Wherein, initial N number of key frame is N number of video frame collected at first, and N is the integer greater than zero.

S403: LS-SVM sparseness is carried out to target video stream according to the N number of key frame of acquisition time the latest, wherein target Video flowing is made of the video frame and the top n video frame adjacent with the video frame.

Specifically, can determine whether the video frame that acquisition time is earliest in target video stream is key frame, if so, determining In the N number of key frame of acquisition time the latest it is invisible but in target video stream visible feature, abandon target video stream in Include this feature and the video frame of acquisition time the latest；If it is not, then abandoning the video of acquisition time the latest in target video stream Frame abandons the video frame currently acquired in S401.

S404: according to the corresponding IMU number of video frame each in the target video stream and target video stream after LS-SVM sparseness According to the pose for determining robot.

Specifically, building optimization item function, optimization item function mainly includes two parts, after a part is LS-SVM sparseness The characteristic error item of each video frame in target video stream, another part are each video frames in target video stream after LS-SVM sparseness The pose of robot can be obtained by least square method optimization aim majorized function for imu error item.

For example, the optimization item of building are as follows:

Wherein, i indicates camera numbers, and I indicates camera total number, the video frame after K expression LS-SVM sparseness in target video stream Sum,Indicate the error of j-th of feature in k-th of video frame of i-th of camera acquisition, r indicates the mistake of video frame feature Poor item includes the pose data of robot in error term,It indicatesTransposition,Indicate i-th of camera acquisition The matrix of j-th of feature in k-th of video frame,Indicate IMU data when k-th of video frame of acquisition, s indicates the error of IMU ,It indicatesTransposition,Indicate the calculated information matrix of IMU data.

In the embodiment of the present application, the feature of video frame is extracted using deep learning network, the feature of extraction is richer, and When extracting the feature of video frame using deep learning network, it is contemplated that the spy of the previous video frame adjacent with the video frame Sign, and feature difference between the two is required to stablize within a preset range, in this way, the accumulation that can be reduced to the greatest extent between video frame misses The pose of difference, the determination robot finally determined is more accurate.

As shown in figure 5, being the method flow diagram of another determining robot pose provided by the embodiments of the present application, including following Step:

S501: mobile-robot system initialization.

Determination of the initialization including camera parameter, the acquisition of camera image, the acquisition and parsing of IMU sensing data, entirely The initialization of local figure, local map.

S502: acquisition video flowing.

S503: each video frame in video flowing is input to trained deep learning network model progress feature and is mentioned It takes.

S504: the matched characteristics of image of current video frame is inserted into local map, the position integrated by IMU The insertion position for determining feature determines 2 category feature points, the latest frame of current location and the global past saved in each position The N number of key frame calculated.

S505: judge whether current video frame is key frame.

The region crossed over by the feature calculation match point of current video frame, if the region of current video frame with adopt the latest Feature Duplication in N number of key frame of collection between the region of each key frame is lower than 50%, then by current video frame labeled as pass Key frame；Otherwise, it determines current video frame is not key frame.

S506: LS-SVM sparseness is carried out to target video stream using the N number of key frame acquired the latest, target video stream is by working as Preceding video frame and the top n video frame adjacent with current video frame composition.

Wherein, the N number of video frame initially acquired is considered as key frame, subsequent with the increase for acquiring frame number, key frame Also it can dynamically update.

In the specific implementation, LS-SVM sparseness is carried out to target video stream using the N number of key frame acquired the latest to refer to, it will After current video frame is inserted into target video stream, however, it is determined that the video frame of foremost is not key frame in target video stream, then Abandon current video frame；If it is determined that the video frame of foremost is key frame in target video stream, then find N number of what is acquired the latest It is invisible but in the visible feature of target video stream in key frame, abandon target video stream include this feature and shooting time most The video frame in evening, in this way, the subsequent number that can reduce video frame when determining the pose of robot using target video stream, reduction Data processing amount improves pose and determines speed.

S507: according to the corresponding IMU number of video frame each in the target video stream and target video stream after LS-SVM sparseness According to the pose of calculating robot.

It is shown in Figure 6, it is the structural schematic diagram of a kind of electronic equipment provided by the embodiments of the present application, the electronic equipment packet Include the physical devices such as transceiver 601 and processor 602, wherein processor 602 can be a central processing unit (central processing unit, CPU), microprocessor, specific integrated circuit, programmable logic circuit, large-scale integrated Circuit or for digital processing element etc..Transceiver 601 carries out data transmit-receive for electronic equipment and other equipment.

The electronic equipment can also include that memory 603 is used for the software instruction that storage processor 602 executes, and may be used also certainly To store some other data of electronic equipment needs, such as the identification information of electronic equipment, the encryption information of electronic equipment, user Data etc..Memory 603 can be volatile memory (volatile memory), such as random access memory (random-access memory, RAM)；Memory 603 is also possible to nonvolatile memory (non-volatile Memory), such as read-only memory (read-only memory, ROM), flash memory (flash memory), hard disk (hard disk drive, HDD) or solid state hard disk (solid-state drive, SSD) or memory 603 are can to use In carry or storage have instruction or data structure form desired program code and can by computer access it is any its His medium, but not limited to this.Memory 603 can be the combination of above-mentioned memory.

Specifically connecting between above-mentioned processor 602, memory 603 and transceiver 601 is not limited in the embodiment of the present application Connect medium.The embodiment of the present application in Fig. 6 only between memory 603, processor 602 and transceiver 601 pass through bus 604 It is illustrated for connection, bus is indicated in Fig. 6 with thick line, the connection type between other components, is only to carry out schematically Illustrate, does not regard it as and be limited.The bus can be divided into address bus, data/address bus, control bus etc..For convenient for expression, Fig. 6 In only indicated with a thick line, it is not intended that an only bus or a type of bus.

Processor 602 can be the processor of specialized hardware or runs software, when processor 602 can be with runs software, Processor 602 reads the software instruction that memory 603 stores, and under the driving of the software instruction, executes previous embodiment Involved in deep learning network model method for building up and/or robot pose determine method.

When the method provided in the embodiment of the present application is realized with software or hardware or software and hardware combining, electronic equipment In may include multiple functional modules, each functional module may include software, hardware or its combination.Specifically, referring to Fig. 7 institute Show, is the structural schematic diagram for establishing device of deep learning network model provided by the embodiments of the present application, including feature extraction mould Block 701, difference determining module 702, adjustment module 703, determining module 704.

The video frame is input to depth for each video frame in the video flowing to acquisition by characteristic extracting module 701 Feature extraction is carried out in degree learning network model；

Difference determining module 702, for according to the feature of the video frame of extraction and adjacent with the video frame previous The feature of video frame determines the difference between two video frames；

Module 703 is adjusted, for if it is determined that the difference does not stabilize in preset range, then adjusting the deep learning network mould The parameter of type returns the video frame being input in deep learning network model the step of carrying out feature extraction；

Determining module 704 is stable pre- for the difference in determining the video flowing between any two adjacent video frames If when in range, current deep learning network model to be determined as to the deep learning network model established.

Under a kind of possible embodiment, if also obtaining has the video to each video frame in the video flowing The corresponding Inertial Measurement Unit IMU data of frame, then described in the characteristic extracting module 701 is specifically used for being extracted according to following steps The feature of video frame:

The pose of the video frame is estimated according to IMU data corresponding with the previous video frame that the video frame is adjacent；

The three-dimensional data of the video frame is determined according to the depth map of the pose of the video frame and the video frame；

The feature of the video frame is determined according to the three-dimensional data of the video frame.

Under a kind of possible embodiment, the characteristic extracting module 701 is specifically used for:

Dimension-reduction treatment is carried out to the three-dimensional data of the video frame, obtains the 2-D data of the video frame；

The processing of bilinearity difference is carried out to the 2-D data of the video frame；

Determine bilinearity difference treated that 2-D data is the feature of the video frame.

Under a kind of possible embodiment, the difference determining module 702 is specifically used for:

For each pixel in the video frame, calculate in the pixel and the previous video frame in same position Feature between pixel is poor, determines two video frames on the pixel according to the weight of the feature difference and the pixel Difference；

The difference between two video frames is determined in the difference on each pixel according to two video frames.

Further include test module 705 under a kind of possible embodiment, the test module 705 is used for:

It, will be in test video stream before the deep learning network model for being determined as current deep learning network model to establish Each test video frame be input in the deep learning network model and carry out feature extraction, regarded according to the test of extraction The feature of the feature of frequency frame and the previous test video frame adjacent with the test video frame determine two test video frames it Between difference, the difference between the difference and predetermined described two test video frames is compared, according to comparison As a result error of the deep learning network model when extracting the feature of current test video frame is determined；

It is determined according to error of the deep learning network model in extracting the test video stream when each test video frame The accuracy rate of the deep learning network model；

Judge whether the accuracy rate of the deep learning network model reaches default accuracy rate；

If so, the deep learning network model to be determined as to the deep learning network model established；If it is not, then according to new Video flowing is trained the deep learning network model, to establish the deep learning network model for reaching default accuracy rate.

Under a kind of possible embodiment, the characteristic extracting module 701 is also used to:

After the deep learning network model for being determined as current deep learning network model to establish, the depth is utilized It practises network model and feature extraction is carried out to each video frame in collected video flowing.

It is shown in Figure 8, it is the structural schematic diagram of robot pose determining device provided by the embodiments of the present application, including spy Levy extraction module 801, key frame determining module 802, LS-SVM sparseness module 803, pose determining module 804.

Characteristic extracting module 801, for being input to the video frame currently acquired in the deep learning network model of foundation Carry out feature extraction；

Key frame determining module 802, feature and the N number of key frame of acquisition time the latest of the video frame for that will extract Feature is compared one by one, however, it is determined that the feature Duplication between the video frame and each key frame is below default overlapping Rate, it is determined that the video frame is key frame, and initial N number of key frame is N number of video frame collected at first, and N is greater than zero Integer；

LS-SVM sparseness module 803, for being carried out at rarefaction according to the N number of key frame of acquisition time the latest to target video stream Reason, the target video stream are made of the video frame and the top n video frame adjacent with the video frame；

Pose determining module 804, for according to each view in the target video stream and the target video stream after LS-SVM sparseness The corresponding Inertial Measurement Unit IMU data of frequency frame determine the pose of robot.

Under a kind of possible embodiment, the LS-SVM sparseness module 803 is specifically used for:

Judge whether the video frame that acquisition time is earliest in the target video stream is key frame；

If so, determine it is invisible in N number of key frame but in the target video stream visible feature, described in discarding It include this feature and the video frame of acquisition time the latest in target video stream；It is acquired in the target video stream if it is not, then abandoning The video frame of time the latest.

It is schematical, only a kind of logical function partition to the division of module in the embodiment of the present application, it is practical to realize When there may be another division manner, in addition, each functional module in each embodiment of the application can integrate at one It manages in device, is also possible to physically exist alone, can also be integrated in two or more modules in a module.Modules Mutual coupling can be to be realized through some interfaces, these interfaces are usually electrical communication interface, but are also not excluded for It may be mechanical interface or other form interfaces.Therefore, module can be or can not also as illustrated by the separation member It is to be physically separated, both can be located in one place, may be distributed on same or distinct device different location.On It states integrated module both and can take the form of hardware realization, can also be realized in the form of software function module.

The embodiment of the present application also provides a kind of computer readable storage medium, it is stored as holding needed for executing above-mentioned processor Capable computer executable instructions, it includes the programs for execution needed for executing above-mentioned processor.

In some possible embodiments, each side of the method for building up of deep learning network model provided by the present application Face is also implemented as a kind of form of program product comprising program code, when described program product is transported on an electronic device When row, said program code is for making the electronic equipment execute the various exemplary according to the application of this specification foregoing description Step in the method for building up of the deep learning network model of embodiment.

Described program product can be using any combination of one or more readable mediums.Readable medium can be readable letter Number medium or readable storage medium storing program for executing.Readable storage medium storing program for executing for example may be-but not limited to-electricity, magnetic, optical, electromagnetic, red The system of outside line or semiconductor, device or device, or any above combination.The more specific example of readable storage medium storing program for executing (non exhaustive list) includes: the electrical connection with one or more conducting wires, portable disc, hard disk, random access memory (RAM), read-only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc Read memory (CD-ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.

The program product of the foundation for deep learning network model of presently filed embodiment can use portable Compact disk read-only memory (CD-ROM) and including program code, and can run on the computing device.However, the journey of the application Sequence product is without being limited thereto, and in this document, readable storage medium storing program for executing can be any tangible medium for including or store program, the journey Sequence can be commanded execution system, device or device use or in connection.

Readable signal medium may include in a base band or as the data-signal that carrier wave a part is propagated, wherein carrying Readable program code.The data-signal of this propagation can take various forms, including --- but being not limited to --- electromagnetism letter Number, optical signal or above-mentioned any appropriate combination.Readable signal medium can also be other than readable storage medium storing program for executing it is any can Read medium, the readable medium can send, propagate or transmit for by instruction execution system, device or device use or Program in connection.

The program code for including on readable medium can transmit with any suitable medium, including --- but being not limited to --- Wirelessly, wired, optical cable, RF etc. or above-mentioned any appropriate combination.

Can with any combination of one or more programming languages come write for execute the application operation program Code, described program design language include object oriented program language-Java, C++ etc., further include conventional Procedural programming language-such as " C " language or similar programming language.Program code can be fully in user It calculates and executes in equipment, partly executes on a user device, being executed as an independent software package, partially in user's calculating Upper side point is executed on a remote computing or is executed in remote computing device or server completely.It is being related to far Journey calculates in the situation of equipment, and remote computing device can pass through the network of any kind --- including Local Area Network or extensively Domain net (WAN)-be connected to user calculating equipment, or, it may be connected to external computing device (such as utilize Internet service Provider is connected by internet).

It should be noted that although being referred to several unit or sub-units of device in the above detailed description, this stroke It point is only exemplary not enforceable.In fact, according to presently filed embodiment, it is above-described two or more The feature and function of unit can embody in a unit.Conversely, the feature and function of an above-described unit can It is to be embodied by multiple units with further division.

In addition, although describing the operation of the application method in the accompanying drawings with particular order, this do not require that or Hint must execute these operations in this particular order, or have to carry out shown in whole operation be just able to achieve it is desired As a result.Additionally or alternatively, it is convenient to omit multiple steps are merged into a step and executed by certain steps, and/or by one Step is decomposed into execution of multiple steps.

It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.

The application is process of the reference according to method, apparatus (system) and computer program product of the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.

These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.

These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.

Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications may be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the application range.

Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.

Claims

1. a kind of method for building up of deep learning network model characterized by comprising

2. the method as described in claim 1, which is characterized in that if also obtaining has to each video frame in the video flowing The corresponding Inertial Measurement Unit IMU data of the video frame, then control the deep learning network model and mentioned according to following steps Take the feature of the video frame:

3. method according to claim 2, which is characterized in that determine the video frame according to the three-dimensional data of the video frame Feature, comprising:

4. the method as described in claim 1, which is characterized in that according to the feature of the video frame of extraction and with the video The feature of the adjacent previous video frame of frame determines the difference between two video frames, comprising:

5. the method as described in claim 1, which is characterized in that current deep learning network model is determined as to the depth established It spends before learning network model, further includes:

Each test video frame in test video stream is input in the deep learning network model and carries out feature extraction, root It is true according to the feature of the test video frame of extraction and the feature of the previous test video frame adjacent with the test video frame Difference between fixed two test video frames, by the difference between the difference and predetermined described two test video frames It is compared, mistake of the deep learning network model when extracting the feature of current test video frame is determined according to comparison result Difference；

6. the method as described in claim 1, which is characterized in that current deep learning network model is determined as to the depth established It spends after learning network model, further includes:

Feature extraction is carried out to each video frame in collected video flowing using the deep learning network model.

7. a kind of robot pose determines method characterized by comprising

8. the method for claim 7, which is characterized in that according to acquisition time N number of key frame the latest to target video Stream carries out LS-SVM sparseness, comprising:

9. a kind of deep learning network model establishes device characterized by comprising

10. device as claimed in claim 9, which is characterized in that if also obtaining has to each video frame in the video flowing The corresponding Inertial Measurement Unit IMU data of the video frame, then the characteristic extracting module is specifically used for being mentioned according to following steps Take the feature of the video frame:

11. device as claimed in claim 10, which is characterized in that the characteristic extracting module is specifically used for:

12. device as claimed in claim 9, which is characterized in that the difference determining module is specifically used for:

13. device as claimed in claim 9, which is characterized in that further include test module, the test module is used for:

14. device as claimed in claim 9, which is characterized in that the characteristic extracting module is also used to:

15. a kind of robot pose determining device characterized by comprising

16. device as claimed in claim 15, which is characterized in that the LS-SVM sparseness module is specifically used for:

17. a kind of electronic equipment characterized by comprising at least one processor, and it is logical at least one described processor Believe the memory of connection, in which:

The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one It manages device to execute, so that at least one described processor is able to carry out as described in claim 1 to 6 or 7 to 8 any claims Method.

18. a kind of computer-readable medium, is stored with computer executable instructions, which is characterized in that the computer is executable Instruction is for executing the method as described in claim 1 to 6 or 7 to 8 any claims.