CN107340852A

CN107340852A - Gestural control method, device and terminal device

Info

Publication number: CN107340852A
Application number: CN201610694510.2A
Authority: CN
Inventors: 栾青; 钱晨; 刘文韬
Original assignee: Beijing Sensetime Technology Development Co Ltd
Current assignee: Beijing Sensetime Technology Development Co Ltd
Priority date: 2016-08-19
Filing date: 2016-08-19
Publication date: 2017-11-10

Abstract

The embodiment of the present invention provides a kind of gestural control method, device and terminal device.Methods described includes：Gestures detection is carried out to currently playing video image；When detecting that gesture matches with prearranged gesture, determine that business object to be shown shows position in the video image；The business object is drawn using computer graphics mode in the position that shows.Using the embodiment of the present invention, the system resource of Internet resources and/or client can be saved, and add interest for video image, will not also bother user normally watches video simultaneously, so as to reduce user to the dislike of the business object showed in video image, and the notice of spectators can be attracted to a certain extent, improve the influence power of business object.

Description

Gestural control method, device and terminal device

Technical field

The present invention relates to the information processing technology, more particularly to a kind of gestural control method, device and terminal device.

Background technology

With the development of Internet technology, people use internet viewing video, thus, internet video more and more Business opportunity is provided for many new business.Because internet video can turn into important service traffics entrance, thus be considered as It is the high-quality resource of advertisement implantation.

By way of implantation, the wide of duration mainly is fixed in some time insertion of video playback for existing video ads Accuse, or advertisement is placed in the region of video playback and its neighboring area fixed position.

But, on the one hand, this video ads mode not only takes Internet resources, also takes the system resource of client； On the other hand, this video ads mode often bothers the normal video viewing experience of spectators, causes spectators to dislike, it is impossible to reach The advertising results of anticipation.

The content of the invention

It is an object of the present invention to provide a kind of scheme of gesture control.

A kind of one side according to embodiments of the present invention, there is provided gestural control method.Methods described includes, to currently playing Video image carry out gestures detection；When detecting that gesture matches with prearranged gesture, determine business object to be shown in institute State in video image and show position；The business object is drawn using computer graphics mode in the position that shows.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the determination is to be shown Business object shows position in the video image, including：Extract human hand candidate corresponding with the gesture detected The characteristic point of hand in region；According to the characteristic point of the hand, it is determined that corresponding to be shown with the gesture detected Business object shows position in the video image.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, it is described according to the hand Characteristic point, it is determined that business object to be shown corresponding with the gesture detected shows position in the video image Put, including：According to the characteristic point of the hand and the type of the business object to be shown, it is determined that with the hand that detects The corresponding business object to be shown of gesture shows position in the video image.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, it is described according to the hand Characteristic point and the business object to be shown type, it is determined that corresponding with the gesture detected described to be shown Business object shows position in the video image, including：According to the characteristic point of the hand and the industry to be shown The type of business object, it is determined that the business object to be shown corresponding with the gesture detected is in the video image Multiple show position；From it is the multiple show select at least one to show position in position.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the determination is to be shown Business object shows position in the video image, including：From the gesture prestored and the corresponding relation for showing position In, obtain the prearranged gesture corresponding to target show position as the corresponding business to be shown of the gesture with detecting Object shows position in the video image.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the business object is bag Special efficacy containing semantic information, the video image are live class video image.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the business object includes The special efficacy of following at least one form comprising advertising message：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the position that shows includes At least one of：Body in video image beyond the hair zones of personage, forehead region, cheek region, chin area, head The area in setting range in background area, video image in body region, video image centered on the region where hand Region set in advance in domain, video image.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the class of the business object Type includes at least one of：Forehead patch type, cheek patch type, chin patch type, virtual hat-type, virtual clothes Fill type, virtual dressing type, virtual headwear type, virtual hair decorations type, virtual jewellery type.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the gesture includes following At least one：Wave, scissors hand, clench fist, ask hand, applause, palm to open, palm closure, perpendicular thumb, rifle posture of waving, pendulum V Word hand and pendulum OK hands.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, it is described to currently playing Video image carries out gestures detection, including：The video image is detected using the first convolutional network of training in advance, described in acquisition The fisrt feature information of video image and the information of forecasting of human hand candidate region, the fisrt feature information are believed including hand-characteristic Breath；The second convolutional network mould using the information of forecasting of the fisrt feature information and the human hand candidate region as training in advance The second feature information of type, and the video figure is carried out according to the second feature information using the second convolution network model The gestures detection of picture, obtain the gestures detection result of the video image；Wherein, the second convolution network model and described One convolution network model sharing feature extract layer.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, it is described to currently playing Before video image carries out gestures detection, methods described, which also includes methods described, also to be included：According to containing human hand markup information Sample image trains the first convolution network model, and the human hand for obtaining the first convolution network model for the sample image is waited The information of forecasting of favored area；Correct the information of forecasting of the human hand candidate region；According to the revised human hand candidate region Information of forecasting and the sample image train the second convolution network model, wherein, the second convolution network model and described First convolution network model sharing feature extract layer, and keep the feature in the second convolution network model training process The parameter constant of extract layer.

Alternatively, with reference to any gestural control method provided in an embodiment of the present invention, wherein, the determination is to be shown Business object shows position in the video image, including：By the gesture and training in advance, for from video figure As the 3rd convolutional network model for showing position of detection business object, it is determined that corresponding to be shown with the gesture detected Business object show position.

A kind of another aspect according to embodiments of the present invention, there is provided gesture control device.Described device includes：Gestures detection Module, for carrying out gestures detection to currently playing video image；Show position determination module, for detect gesture with When prearranged gesture matches, determine that business object to be shown shows position in the video image；Business object draws mould Block, for drawing the business object using computer graphics mode in the position that shows.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Module includes：Feature point extraction unit, hand in the corresponding human hand candidate region of the gesture for extracting with detecting Characteristic point；Show position determination unit, for the characteristic point according to the hand, it is determined that corresponding with the gesture detected Business object to be shown shows position in the video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Unit, for the characteristic point according to the hand and the type of the business object to be shown, it is determined that described with detecting The corresponding business object to be shown of gesture shows position in the video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Unit, for the characteristic point according to the hand and the type of the business object to be shown, it is determined that described with detecting Multiple in the video image of the corresponding business object to be shown of gesture show position；Show position from the multiple Put middle selection and at least one show position.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Module, for when it is determined that the gesture detected matches with corresponding prearranged gesture, it is determined that with the prearranged gesture phase Treat as the gesture with detecting is corresponding the position that shows of the business object to be shown answered in the video image The business object of display shows position in the video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Module, for from the gesture prestored and the corresponding relation for showing position, obtaining target exhibition corresponding to the prearranged gesture Existing position shows position as the corresponding business object to be shown of the gesture with detecting in the video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the business object is bag Special efficacy containing semantic information, the video image are live class video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the business object includes The special efficacy of following at least one form comprising advertising message：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the position that shows includes At least one of：Body in video image beyond the hair zones of personage, forehead region, cheek region, chin area, head The area in setting range in background area, video image in body region, video image centered on the region where hand Region set in advance in domain, video image.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the class of the business object Type includes at least one of：Forehead patch type, cheek patch type, chin patch type, virtual hat-type, virtual clothes Fill type, virtual dressing type, virtual headwear type, virtual hair decorations type, virtual jewellery type.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the gesture includes following At least one：Wave, scissors hand, clench fist, ask hand, applause, palm to open, palm closure, perpendicular thumb, rifle posture of waving, pendulum V Word hand and pendulum OK hands.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, the gestures detection mould Block, for detecting the video image using the first convolutional network of training in advance, obtain the fisrt feature of the video image Information and the information of forecasting of human hand candidate region, the fisrt feature information include hand-characteristic information；By the fisrt feature Second feature information of the information of forecasting of information and the human hand candidate region as the second convolution network model of training in advance, And the gestures detection of the video image is carried out according to the second feature information using the second convolution network model, obtain The gestures detection result of the video image；Wherein, the second convolution network model and the first convolution network model are total to Enjoy feature extraction layer.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, described device also includes： Human hand area determination module, for training the first convolution network model according to the sample image containing human hand markup information, obtain Information of forecasting of the first convolution network model for the human hand candidate region of the sample image；Correcting module, for repairing The information of forecasting of just described human hand candidate region；Convolution model training module, for according to the revised human hand candidate regions The information of forecasting in domain and the sample image train the second convolution network model, wherein, the second convolution network model and institute The first convolution network model sharing feature extract layer is stated, and the spy is kept in the second convolution network model training process Levy the parameter constant of extract layer.

Alternatively, with reference to any gesture control device provided in an embodiment of the present invention, wherein, it is described to show position determination Module, for by the gesture and training in advance, for from the show position the 3rd of video images detection business object Convolutional network model, it is determined that business object to be shown corresponding with the gesture detected shows position.

A kind of another aspect according to embodiments of the present invention, there is provided terminal device.The terminal device includes：Processor, Memory, communication interface and communication bus, the processor, the memory and the communication interface pass through the communication bus Complete mutual communication；The memory is used to deposit an at least executable instruction, and the executable instruction makes the processing Device is performed and operated corresponding to gestural control method as provided above.

Another aspect according to embodiments of the present invention, additionally provides a kind of computer-readable recording medium, the computer Readable storage medium storing program for executing is stored with：For carrying out the executable instruction of gestures detection to currently playing video image；For examining When measuring gesture and being matched with prearranged gesture, determine that business object to be shown shows holding for position in the video image Row instruction；For in the executable instruction for showing position and using computer graphics mode to draw the business object.

Gestural control method, device and the terminal device provided according to embodiments of the present invention, by being regarded to currently playing Frequency image carries out human hand and gestures detection, and determine it is corresponding with the gesture detected show position, and then in video image Business object to be shown is drawn in the above-mentioned position that shows using computer graphics mode, so when business object is used to show advertisement When, compared with traditional video ads mode, on the one hand, the business object is combined with video playback, is passed without by network The defeated additional ad video data unrelated with video, has saved the system resource of Internet resources and/or client；On the other hand, Business object is combined closely with the gesture in video image, has both remained the main shape of video main body (such as main broadcaster) in video image As and action, interest is added again for video image, while will not also bother user and normally watch video, so as to reduce User can attract the notice of spectators to a certain extent to the dislike of the business object showed in video image, carry The influence power of high business object.

Brief description of the drawings

Fig. 1 is a kind of flow chart for the gestural control method for showing according to embodiments of the present invention one；

Fig. 2 be show according to embodiments of the present invention two a kind of first convolution network model and the second convolution network model The flow chart of acquisition methods；

Fig. 3 is a kind of flow chart for the gestural control method for showing according to embodiments of the present invention three；

Fig. 4 is a kind of flow chart for the gestural control method for showing according to embodiments of the present invention four；

Fig. 5 is a kind of structured flowchart for the gesture control device for showing according to embodiments of the present invention five；

Fig. 6 is a kind of structured flowchart for the gesture control device for showing according to embodiments of the present invention six；

Fig. 7 is a kind of structural representation for the terminal device for showing according to embodiments of the present invention seven.

Embodiment

The exemplary embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.

Embodiment one

Fig. 1 is the flow chart for the gestural control method for showing according to embodiments of the present invention one.By being filled including gesture control The computer system put performs methods described.

Reference picture 1, in step S110, gestures detection is carried out to currently playing video image.

Wherein, video image can be just live live video image or recorded the video of completion In video image, can also be video image just in recording process etc..Gesture can include wave, scissors hand, clench fist, Hold in the palm hand, the closure of palm or opening etc..

In force, by taking net cast as an example, at present, net cast platform is including multiple, as the live platform of Chinese prickly ash, YY are straight Platform etc. is broadcast, each live platform includes multiple live rooms, and can include at least one main broadcaster in each live room, Main broadcaster can be by bean vermicelli from the camera of terminal device (such as mobile phone, tablet personal computer or PC etc.) to the live room where it Live video image.Main body in above-mentioned video image is usually a high priest (i.e. main broadcaster) and simple background, main broadcaster Usually region shared in video image is larger.When needs Insert service object (such as advertisement) during net cast When, can obtain current video it is live during video image as pending video image.

In addition, video image can also be the video image in the short-sighted frequency for recorded completion, for such a situation, user The short-sighted frequency can be played using its terminal device, during broadcasting, terminal device can obtain each frame video image As pending video image.

In addition, in the case of video image is the video image just in recording process, during recording, terminal Equipment can obtain each frame video image of recording as pending video image.

Further, play the terminal device of video image or terminal device that main broadcaster uses in be provided with to video figure The mechanism of the gestures detection in human hand candidate region as where carrying out human hand detection and human hand, can be to working as by above-mentioned mechanism Each frame video image (i.e. above-mentioned pending video image) of preceding broadcasting is detected, and is determined in pending video image Whether the hand information of main broadcaster is included, if including obtaining the video image, if do not included, can abandon the video Image does not do any processing to the video image, and obtains next frame video image and continue above-mentioned processing.Wherein, hand Portion's information may include but be not limited to finger state and position, the state of palm and position, hand close up and opened.

Video image for including hand information (human hand in other words), where can detecting human hand from the video image Human hand candidate region, wherein, human hand candidate region can be the minimum square that whole human hand candidate region can be covered in video image Shape region or the region of other shapes (such as ellipse).A kind of feasible processing procedure, which can be that terminal device obtains, to be worked as The preceding frame video image played, can be from the video figure by mechanism set in advance as pending video image The image for including human hand candidate region is intercepted out as in, it is then possible to by mechanism set in advance to human hand candidate region Image is analyzed and feature extraction, obtains the characteristic of various pieces in human hand candidate region (including finger, palm etc.), By the analysis to this feature data, determine that the gesture in video image in human hand candidate region belongs to and wave, scissors hand, hold It is any in the gestures such as fist, support hand, the closure of palm or opening.

In addition, in order to subsequently more quickly and accurately determine that business object to be shown shows position in video image, The position that shows of business object can be limited by hand position, wherein, hand position can be above-mentioned human hand candidate The center in region or multiple marginal positions of the rectangular area of human hand candidate region or elliptical region etc. determine Coordinate position etc..For example, behind region where hand can be determined in video image, the human hand candidate region is divided Analysis calculates, and determines the center of the human hand candidate region as hand position, specifically such as, human hand candidate region is rectangle region Domain, then the catercorner length of the rectangular area can be calculated, cornerwise centre position can be chosen as hand position, from And the hand position determined based on human hand candidate region can be obtained.Wherein, except the centre bit of human hand candidate region can be used Put outside as hand position, multiple marginal positions of the rectangular area of human hand candidate region or elliptical region etc. can also be passed through As hand position, specific processing may refer to the above-mentioned content using center as hand position, will not be repeated here.

In step S120, when detecting that gesture matches with prearranged gesture, determine business object to be shown in video figure Show position as in.

Wherein, business object to be shown is the object created according to certain business demand, such as advertisement etc..Show Position can be the center of designated area in video image, or can be multiple marginal positions in above-mentioned designated area Coordinate position etc..

In force, the characteristic of a variety of different gestures can be prestored, and different gestures is carried out corresponding Mark, to distinguish the implication representated by each gesture.Can be from pending video figure by above-mentioned steps S110 processing Detect the gesture in human hand and human hand candidate region and the human hand candidate region where human hand as in, will can detect The gesture of hand is compared with each gesture prestored respectively, if wrapped in a variety of different gestures prestored The gesture identical gesture of hand is included and detects, then the gesture that can determine to detect matches with corresponding prearranged gesture.

In order to improve the degree of accuracy of matching, above-mentioned matching result can be determined by way of calculating, for example, can set Matching algorithm calculate any two gesture between matching degree, it is, for example, possible to use detect gesture characteristic and in advance The characteristic of any gesture of storage carries out matching primitives, obtains matching degree numerical value between the two.Through the above way The matching degree numerical value that the gesture detected is calculated respectively between each gesture for prestoring, from obtained matching degree The matching degree numerical value of maximum is chosen in numerical value, can be true if the maximum matching degree numerical value exceedes predetermined matching threshold The gesture of hand of the gesture prestored corresponding to fixed maximum matching degree numerical value with detecting matches.If the maximum Matching degree numerical value is not less than predetermined matching threshold, then it fails to match, that is, the gesture of the hand detected is not prearranged gesture, this When, above-mentioned steps S110 processing can be continued executing with.

Further, when it is determined that the gesture detected matches with corresponding prearranged gesture, can first determine to match Hand gesture representated by implication, can it is set in advance it is multiple show choose in position it is related to its implication or accordingly The position that shows show position in video image as business object to be shown.In addition, for above-mentioned steps S110's The situation of the hand position determined in processing, can also it is set in advance it is multiple show choose in position with its implication and Hand position is related or show position shows position as business object to be shown in video image accordingly.For example, By taking net cast as an example, when detecting that main broadcaster hold in the palm the gesture of hand, the upper area of human hand candidate region can be chosen To be associated therewith or show position accordingly.In another example when detecting the gesture that main broadcaster waves, can by palm area or its Background area is chosen for associated therewith or shows position accordingly.

In step S130, business object is drawn using computer graphics mode showing position.

For example, by taking net cast as an example, when detecting that main broadcaster hold in the palm the gesture of hand, can in video image main broadcaster Human hand candidate region in palm upper area in corresponding business object drawn (as with pre- using computer graphics mode Determine display advertising of commodity sign etc.), if bean vermicelli is interested in the business object, where can clicking on the business object Region, the terminal device of bean vermicelli can obtain network linking corresponding to the business object, and be entered and this by the network linking The related page of business object, bean vermicelli can obtain the resource related to the business object in the page.

Wherein, drawing for business object the mode such as can be drawn or is rendered by appropriate graph image and realize, including But it is not limited to：Drawn etc. based on OpenGL, OpenCL or Unity graph drawing engine.OpenGL and OpenCL are defined One across programming language, the professional graphic package interface of cross-platform DLL specification, it is unrelated with hardware, can facilitate Ground carries out the drafting of 2D or 3D graph images.By OpenGL, OpenCL or Unity, it can not only realize that 2D effects such as 2D is pasted The drafting of paper, the drafting of 3D special efficacys and the drafting of particle effect etc. can also be realized.

Gestural control method provided in an embodiment of the present invention, by carrying out human hand and gesture to currently playing video image Detection, and determine it is corresponding with the gesture detected show position, and then in the above-mentioned position that shows of video image using calculating Machine plotting mode draws business object to be shown, so when business object is used to show advertisement, with traditional video ads Mode is compared, on the one hand, the business object is combined with video playback, without passing through extra wide unrelated with video of network transmission Video data is accused, has saved the system resource of Internet resources and/or client；On the other hand, in business object and video image Gesture combine closely, both remained main image and the action of video main body (such as main broadcaster) in video image, be video figure again As adding interest, while it will not also bother user and normally watch video, so as to reduce user to being opened up in video image The dislike of existing business object, and the notice of spectators can be attracted to a certain extent, improve the influence power of business object.

Embodiment two

Fig. 2 is the acquisition of the first convolution network model and the second convolution network model that show according to embodiments of the present invention two The flow chart of method.

The step S110 processing that gestures detection is carried out to currently playing video image can adopt in above-described embodiment one Realized with corresponding feature extraction algorithm or using neural network model such as convolutional network model etc..With convolution in the present embodiment Exemplified by network model, to the human hand candidate region where video image progress human hand and gestures detection, therefore, can be with training in advance For the first convolution network model of human hand candidate region in detection image and for from the of human hand candidate region detection gesture Two convolutional network models.Wherein, gesture includes：Wave, scissors hand, clench fist, ask hand, applause, palm to open, be palm closure, perpendicular Thumb, rifle posture of waving, pendulum V words hand and pendulum OK hands, business object is to include the special efficacy of semantic information, the business object bag Include the special efficacy of following at least one form comprising advertising message：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect.

The gestural control method of the present embodiment can be by arbitrarily having the equipment of data sampling and processing and transfer function to hold OK, including but not limited to mobile terminal and PC etc., the present invention implement not limit this.

Reference picture 2, in step S210, the first convolutional network mould is trained according to the sample image containing human hand markup information Type, obtain information of forecasting of the first convolution network model for the human hand candidate region of sample image.

Wherein, the sample image containing human hand markup information can be derived from the video image of image capture device, by Image forms one by one, or single a two field picture or piece image, can also derive from other equipment, so Operation is labeled in sample image afterwards.Multiple candidate regions can be specifically marked in sample image.The present embodiment to containing There are source and access approaches of the sample image of human hand markup information etc. not to limit.In the embodiment of the present invention, human hand candidate regions Domain is identical with the human hand candidate region that the above is mentioned.

The information of forecasting of human hand candidate region can include：The positional information of human hand region in sample image, example Such as, coordinate points information or pixel information；The integrity degree information of human hand in human hand region, for example, human hand region Include a complete human hand or only include a finger；Specific gesture information in human hand region, for example, gesture Type, etc..The present embodiment is not limited the particular content of the information of forecasting of human hand candidate region.

In force, because its bigger data volume of the resolution ratio of image is also bigger, it is follow-up carry out human hand candidate region and During gestures detection, required computing resource is more, and detection speed is slower, in consideration of it, in a kind of specific implementation side of the present invention In formula, above-mentioned sample image can be the image for meeting default resolution condition.For example, above-mentioned default resolution condition can be with It is：The longest edge of image is no more than 640 pixels, and most short side is no more than 480 pixels etc..

After obtaining sample image, (can be by way of manually marking) human hand time can be marked in every sample image The information of favored area and gesture, obtain being labeled with multiple sample images of human hand candidate region.Wherein, the human hand candidate regions of mark Domain can be that minimum rectangular area or elliptical region of whole hand etc. can be covered in image.

First convolution network model can include：First input layer, the first output layer and multiple first convolutional layers, wherein, First input layer is used for input picture, and multiple first convolutional layers are used to image is detected to obtain human hand candidate region, then Human hand candidate region is exported by the first output layer.The number of plies of the network parameter of each layer and the first convolutional layer can be by artificial Setting, can also set at random, be determined according to actual demand.

Specifically, when the first convolution network model is handled sample image using multiple first convolutional layers, i.e., to sample This image carries out feature extraction, defeated by first when the first convolution network model obtains the human hand candidate region in sample image Enter layer and obtain sample image, the feature of sample image is then extracted by the first convolutional layer, and combine extracted feature and determine Human hand candidate region in sample image, then result is exported by the first output layer.

The markup information of hand region in sample image is obtained, using the markup information as training foundation, by sample In the initial model of image aqueduct the first convolution network model, gradient descent method and back-propagation algorithm can be used to carry out model Training, obtains the first convolution network model.When training obtains the first convolution network model, it can first train to obtain the first input layer Parameter, the first output layer parameter and multiple first convolutional layer parameters, then further according to the parameter obtained, build the first convolution net Network model.

The sample image containing human hand markup information can be specifically used to be trained the first convolution network model, to make The the first convolution network model that must train to obtain is more accurate, can be selected when selecting sample image it is a variety of in the case of sample Image, the sample image for being labeled with human hand information can be included in sample image, can also include not being labeled with human hand information Sample image.

Moreover, in the present embodiment, the first convolution network model can be RPN (Region Proposal Network), when So, the present embodiment is simply illustrated as example, and the first convolution network model is not limited to that in practical application, for example, It can also be Multi-Box Network or YOLO etc..

In step S220, the information of forecasting of amendment human hand candidate region.

In the present embodiment, the information of forecasting of the human hand candidate region for the sample image that the first convolution network model of training obtains It is rough judged result, it is understood that there may be certain error rate.Moreover, the information of forecasting of human hand candidate region is made in subsequent step To train the input item of the second convolution network model, therefore before the second convolution network model is trained, the first convolution will be trained The rough judged result that network model obtains is modified.

Specific makeover process can be corrected manually, or introduce the mistake that other convolutional network models carry out error result Filter etc., the purpose of amendment is, in the case of the input information for ensureing the second convolution network model is accurate, improves training second The accuracy rate of convolutional network model.The present embodiment is not limited specific makeover process.

In step S230, the second convolution net is trained according to the information of forecasting of revised human hand candidate region and sample image Network model.

Wherein, the second convolution network model and the first convolution network model sharing feature extract layer, and in the second convolution net The parameter constant of feature extraction layer is kept during network model training.

In force, the second convolution network model can include：Second input layer, the second output layer, multiple second convolution Layer and multiple full articulamentums.Second convolutional layer is mainly used in carrying out feature extraction, and full articulamentum is equivalent to grader, to volume Two The feature that lamination extracts is classified, when the second convolution network model obtains the gestures detection result being directed in sample image, Human hand candidate region is obtained by the second input layer, the feature of above-mentioned human hand candidate region is then extracted by the second convolutional layer, Full articulamentum carries out classification processing according to the feature of human hand candidate region, determines human hand whether is included in sample image, and bag In the case of human hand, the gesture of human hand candidate region and hand, finally classification results are exported by the second output layer.

Due to including convolutional layer in the first convolution network model and the second convolution network model, for the ease of carrying out model Training, reduce amount of calculation, the network parameter of the feature extraction layer in above-mentioned two convolutional network model can be arranged to identical Network parameter, i.e. the second convolution network model and the first convolution network model sharing feature extract layer, and in the second convolution net The parameter constant of feature extraction layer is kept during network model training.

Based on this, in the present embodiment, when training obtains the second convolution network model, can first train to obtain input layer The network parameter of network parameter and classification layer, then the network parameter of the feature extraction layer of the first convolution network model is defined as the The network parameter of the feature extraction layer of two convolutional network models, then joined according to the network of the network parameter of input layer, layer of classifying The network parameter of number and feature extraction layer builds the second convolution network model.

Specifically can be using the information of forecasting and sample image of revised human hand candidate region to the second convolutional network mould Type is trained, and to cause the second convolution network model that training obtains more accurate, can be selected when selecting sample image Sample image in the case of a variety of, the sample image for being labeled with gesture can be included in sample image, can also include not marking There is the sample image of gesture.

Moreover, the sample image in the present embodiment can be to meet above-mentioned resolution condition or other resolution conditions Sample image.

The gestural control method provided by the present embodiment, is respectively trained two convolutional network models：According to containing human hand The sample image of markup information trains the first convolution network model, obtains the human hand that the first convolution network model is directed to sample image The information of forecasting of candidate region；Correct the information of forecasting of human hand candidate region；According to the prediction of revised human hand candidate region Information and sample image train the second convolution network model.Wherein, the first convolution network model and the second convolution network model are deposited In following incidence relation：First convolution network model and the second convolution network model sharing feature extract layer, and in the second convolution The parameter constant of feature extraction layer is kept in network model training process.

The information of forecasting of the human hand candidate region of sample image obtained due to the first convolution network model of training is rough Judged result, it is understood that there may be certain error rate, therefore before the second convolution network model is trained, will first train the first convolution The rough judged result that network model obtains, which is modified, (as being manually modified, or introduces other convolutional network models Carry out filtering of error result etc.), then using the information of forecasting of revised human hand candidate region and sample image as volume Two The input of product network model, in the case of the input information for ensureing the second convolution network model is accurate, improves training second The accuracy rate of convolutional network model.

Moreover, the first convolution network model and the second convolution network model sharing feature extract layer, and in the second convolution net The parameter constant of feature extraction layer is kept during network model training, the feature extraction layer of the second convolution network model can be direct Using the feature extraction layer of the first convolution network model, provided convenience for the second convolution network model of training, reduce training The amount of calculation of second convolution network model.

In the present embodiment, by training obtained the first convolution network model and the first convolution network model, after can facilitating It is continuous that human hand and gestures detection are carried out to currently playing video image, and determine it is corresponding with the gesture detected show position, And then business object to be shown is drawn using computer graphics mode in the above-mentioned position that shows of video image, so work as business When object is used to show advertisement, compared with traditional video ads mode, on the one hand, the business object is mutually tied with video playback Close, be without by the network transmission additional ad video data unrelated with video, saved Internet resources and/or client System resource；On the other hand, business object is combined closely with the gesture in video image, has both remained video main body in video image The main image of (such as main broadcaster) and action, add interest, while will not also bother user and normally watch for video image again Video, so as to reduce user to the dislike of the business object showed in video image, and it can inhale to a certain extent Draw the notice of spectators, improve the influence power of business object.

Embodiment three

Fig. 3 is the flow chart for the gestural control method for showing according to embodiments of the present invention three.Wherein, video image is live Class video image, business object are to include the special efficacy of semantic information, specifically may include to include following at least the one of advertising message The special efficacy of kind form：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect etc..

In step S310, currently playing video image is obtained.

Wherein, above-mentioned steps S310 step content may refer to phase in above-described embodiment one in step S110 inside the Pass Hold, will not be repeated here.

In the present embodiment, it can be determined by the convolutional network model of video image and training in advance corresponding to hand information Human hand candidate region, and the gesture of hand is detected in human hand candidate region, handle accordingly referring to following step S320~step S330。

In step S320, video image is detected using the first convolutional network of training in advance, obtains the first of video image Characteristic information and the information of forecasting of human hand candidate region.

Wherein, fisrt feature information includes hand-characteristic information.First convolution network model can be used for detection image and draw Whether the multiple candidate regions divided are human hand candidate region.

In force, the video image comprising hand information got can be input in above-described embodiment two and trained In the first obtained convolution network model, video image can be entered respectively by the network parameter in the first convolution network model Row such as feature extraction, mapping and conversion processing, to carry out human hand candidate region detection to video image, are obtained in video image Comprising human hand candidate region.The information of forecasting of human hand candidate region is referred to introduction and explanation in above-described embodiment, This is repeated no more.

In step S330, the volume Two using the information of forecasting of fisrt feature information and human hand candidate region as training in advance The second feature information of product network model, and video image is carried out according to second feature information using the second convolution network model Gestures detection, obtain the gestures detection result of video image.

Wherein, the second convolution network model and the first convolution network model sharing feature extract layer.Gesture is included below extremely It is one of few：Wave, scissors hand, clench fist, ask hand, applause, palm to open, palm closure, perpendicular thumb, rifle posture of waving, pendulum V words Hand and pendulum OK hands.

Above-mentioned steps S330 processing procedure may refer to the related content in above-described embodiment, will not be repeated here.

In step S340, when detecting that gesture matches with prearranged gesture, human hand corresponding with the gesture detected is extracted The characteristic point of hand in candidate region.

In force, certain feature can all be included for each video image comprising hand information, wherein hand Point, such as finger, palm, hand profile characteristic point.Human hand in video image is detected and determines characteristic point, can be adopted Realize that the embodiment of the present invention is not construed as limiting to this with the mode in any appropriate correlation technique.For example, Linear feature extraction side Formula such as PCA principal component analysis, LDA linear discriminant analysis, ICA independent component analysis etc.；Such as Nonlinear feature extraction mode again Such as Kernel PCA core principle component analysis, manifold learning；The neural network model that training is completed such as the present invention can also be used Convolutional network model in embodiment carries out the extraction of the characteristic point of hand.

By taking net cast as an example, during net cast is carried out, human hand and true is detected from live video image Determine the characteristic point of hand；For another example in the playing process of a certain video for having recorded completion, examined from the video image of broadcasting Survey human hand and determine the characteristic point of hand；In another example in the recording process of a certain video, detected from the video image of recording Human hand simultaneously determines characteristic point of hand etc..

In step S350, according to the characteristic point of hand, determine that business object to be shown shows position in video image Put.

In force, after the characteristic point of hand determines, industry to be shown can be determined using the characteristic point of hand as foundation Business object shows position in one or more of video image.

In the present embodiment, business object to be shown showing in video image is being determined according to the characteristic point of hand During position, feasible implementation includes：

Mode one, according to the characteristic point of hand, using training in advance, for the exhibition from video images detection business object 3rd convolutional network model of existing position, determines the exhibition of business object to be shown corresponding with hand position in video image Existing position；Mode two, according to the characteristic point of hand and the type of business object to be shown, determine and detect in video image To the corresponding business object to be shown of gesture show position in video image.

Hereinafter, above two mode is described in detail respectively.

Mode one

Occupation mode one determine business object to be shown in video image when showing position, it is necessary to training in advance One convolutional network model (i.e. the 3rd convolutional network model), training the 3rd convolutional network model of completion has determination business pair As the function of showing position in video image；Or can also directly using third party trained completion, have determine The convolutional network model of the function that shows position of the business object in video image.

It should be noted that in the present embodiment, the training to business object illustrates emphatically, but those skilled in the art It should be understood that the 3rd convolutional network model while being trained to business object, can also be trained to hand, realize The joint training of hand and business object.

When needing the 3rd convolutional network model of training in advance, a kind of feasible training method includes procedure below：

(1) characteristic vector of business object sample image to be trained is obtained.

Wherein, the positional information and/or confidence of the business object in business object sample image are included in characteristic vector Spend information.When the confidence information of business object indicates business object and is illustrated in current location, the effect that can reach is (such as quilt Pay close attention to or be clicked or watched) probability, the probability can be according to the setting of the statistic analysis result of historical data, can also Set, can also be set according to artificial experience according to the result of emulation experiment.In actual applications, can be according to actual need Will, only the positional information of business object is trained, only the confidence information of business object can also be trained, may be used also To be trained to the two.The two is trained, enables to the 3rd convolutional network model after training more effective The positional information and confidence information of business object are accurately determined, to provide foundation for the displaying of business object.

3rd convolutional network model is trained by substantial amounts of sample image, it is necessary to use bag in the embodiment of the present invention Business object sample image containing business object is trained to the 3rd convolutional network model, and those skilled in the art should be bright , in the business object sample image trained, in addition to comprising business object, it should also include hand information.This Outside, the business object in the business object sample image in the embodiment of the present invention can be by advance labeling position information, or puts Confidence information, or two kinds of information have.Certainly, in actual applications, these information can also be obtained by other approach.And By in advance to business object carry out corresponding information mark, can with the data and interaction times of effectively save data processing, Improve data-handling efficiency.

Using the business object sample image of the positional information with business object and/or confidence information as training sample This, carries out characteristic vector pickup to it, acquisition include the positional information of business object and/or the feature of confidence information to Amount.

It is alternatively possible to hand and business object are trained simultaneously using the 3rd convolutional network model, in this situation Under, in the characteristic vector of business object sample image, it should also the feature comprising hand.

Extraction to characteristic vector can use the appropriate ways in correlation technique to realize that the embodiment of the present invention is herein no longer Repeat.

(2) process of convolution is carried out to characteristic vector, obtains characteristic vector convolution results.

In force, the positional information and/or confidence level of business object are included in the characteristic vector convolution results of acquisition Information.In the case where carrying out joint training to hand and business object, hand information is also included in characteristic vector convolution results.

The process of convolution number of characteristic vector can be set according to being actually needed, that is, the 3rd convolutional network mould In type, the number of plies of convolutional layer is configured according to being actually needed, and will not be repeated here.

Convolution results are that the result after feature extraction has been carried out to characteristic vector, and the result being capable of Efficient Characterization video image The feature of middle hand.

In the embodiment of the present invention, when both including the positional information of business object in characteristic vector, and business object is included During confidence information, that is, in the case that the positional information and confidence information to business object are trained, this feature Vector convolution result subsequently respectively carry out the condition of convergence judgement when share, without being reprocessed and being calculated, reduce by Resource loss caused by data processing, improves data processing speed and efficiency.

(3) in judging characteristic Vector convolution result the positional information of corresponding business object and/or confidence information whether Meet the condition of convergence.

Wherein, the condition of convergence is suitably set according to the actual requirements by those skilled in the art.When information meets the condition of convergence When, it is believed that it is appropriate that the network parameter in the 3rd convolutional network model is set；, can be with when information can not meet the condition of convergence It is inappropriate, it is necessary to be adjusted to it to think that the network parameter in the 3rd convolutional network model is set, the adjustment is an iteration Process, until using the network parameter after adjustment to characteristic vector carry out process of convolution result meet the condition of convergence.

In a kind of feasible pattern, the condition of convergence can be entered according to default normal place and/or default standard degree of confidence Row setting, e.g., position and the default normal place that the positional information of business object in characteristic vector convolution results is indicated it Between distance whether meet the condition of convergence of certain threshold value as the positional information of business object；By in characteristic vector convolution results Whether the difference between the confidence level of the confidence information instruction of business object and default standard degree of confidence meets certain threshold value Condition of convergence as the confidence information of business object etc..

Wherein it is preferred to default normal place can be the business pair in the business object sample image for treat training The mean place that the position of elephant obtains after being averaging processing；Default standard degree of confidence can be the business object for treating training The average confidence that the confidence level of business object in sample image obtains after being averaging processing.Because sample image is to wait to train Sample and data volume is huge, position that can be according to the business object in business object sample image to be trained and/or confidence level Established standardses position and/or standard degree of confidence, the normal place so set and standard degree of confidence are also more objective and accurate.

It is specifically carrying out the positional information of corresponding business object in characteristic vector convolution results and/or confidence information It is no meet the condition of convergence judgement when, a kind of feasible mode includes：

The positional information of corresponding business object in characteristic vector convolution results is obtained, passes through business object corresponding to calculating Positional information instruction position and default normal place between Euclidean distance, obtain corresponding to business object position letter The first distance between the position of instruction and default normal place is ceased, according to the position of business object corresponding to the first Distance Judgment Whether confidence breath meets the condition of convergence；

And/or

Obtain the confidence information of corresponding business object in characteristic vector convolution results, business object corresponding to calculating Euclidean distance between the confidence level of confidence information instruction and default standard degree of confidence, obtains putting for corresponding business object The 3rd distance between the confidence level of confidence information instruction and default standard degree of confidence, according to industry corresponding to the 3rd Distance Judgment Whether the confidence information of business object meets the condition of convergence.Wherein, by the way of Euclidean distance, realization is simple and can be effective Whether the instruction condition of convergence is satisfied.But not limited to this, other manner, such as horse formula distance, bar formula distance etc. is equally applicable.

Preferably, as it was previously stated, default normal place is the business pair in the business object sample image for treat training The mean place that the position of elephant obtains after being averaging processing；And/or default standard degree of confidence is the business pair for treating training The average confidence obtained after being averaging processing as the confidence level of the business object in sample image.

(4) if meeting the condition of convergence, the training to convolutional network model is completed；If being unsatisfactory for the condition of convergence, basis The positional information and/or confidence information of corresponding business object in characteristic vector convolution results, adjust the 3rd convolutional network mould The network parameter of type simultaneously changes according to the network parameter of the 3rd convolutional network model after adjustment to the 3rd convolutional network model Generation training, the positional information and/or confidence information of the business object after repetitive exercise meet the condition of convergence.

By carrying out above-mentioned training to the 3rd convolutional network model, the 3rd convolutional network model can be to being carried out based on hand The position that shows of the business object of displaying carries out feature extraction and classification, determines business object in video image so as to have Show the function of position.Wherein, when showing position and including multiple, by the training of above-mentioned business object confidence level, volume three Product network model can also determine multiple orders of quality for showing the bandwagon effect in position, so that it is determined that optimal shows position Put.In subsequent applications, when needing to show business object, the present image in video, which can determine that, effectively to be showed Position.

In addition, before above-mentioned training is carried out to the 3rd convolutional network model, can also be in advance to business object sample graph As being pre-processed, including：Multiple business object sample images are obtained, wherein, include in each business object sample image The markup information of business object；The position of business object is determined according to markup information, judge determine business object position with Whether the distance of predeterminated position is less than or equal to given threshold；By business corresponding to the business object less than or equal to given threshold Object samples image, it is defined as business object sample image to be trained.Wherein, predeterminated position and given threshold can be by these Art personnel are appropriately arranged with using any appropriate ways, such as according to data statistic analysis result or correlation distance meter Formula or artificial experience etc. are calculated, the embodiment of the present invention is not construed as limiting to this.

By being pre-processed in advance to business object sample image, ineligible sample image can be filtered out, To ensure the accuracy of training result.

The training of the 3rd convolutional network model is realized by said process, trains the 3rd convolutional network model of completion can With for determining that business object shows position in video image.For example, during net cast, if main broadcaster's click-to-call service When object instruction carries out business object displaying, the hand of main broadcaster in live video image is obtained in the 3rd convolutional network model After characteristic point, the forehead position of the optimal location such as main broadcaster of displaying business object is can indicate that, and then controls live apply The position shows business object；Or during net cast, if the instruction of main broadcaster's click-to-call service object carries out business object exhibition When showing, the 3rd convolutional network model directly can show position according to what live video image determined business object.

Mode two

According to the characteristic point of hand and the type of business object to be shown, determined and hand position phase in video image The business object to be shown answered shows position.

In force, after the characteristic point of hand is obtained, business to be shown can be determined according to the rule of setting Object shows position.Wherein it is determined that the position that shows of business object to be shown includes at least one of：In video image Body region beyond the palm area of personage, the upper area of palm, the lower zone of palm, the background area of palm, hand The region in setting range in background area, video image in domain, video image centered on the region where hand, regard Region set in advance etc. in frequency image.

After determining and showing position, it may further determine that business object to be shown shows position in video image Put.For example, carry out business pair to show show place-centric point of the central point for showing region of position correspondence as business object The displaying of elephant；For another example a certain coordinate position showed in region for showing position correspondence is defined as the center for showing position Point etc., the embodiment of the present invention is not construed as limiting to this.

In a preferred embodiment, it is determined that business object to be shown shows position in video image When, not only according to the characteristic point of hand, always according to the type of business object to be shown, determine business object to be shown regarding Show position in frequency image.Wherein, the type of business object includes at least one of：Forehead patch type, cheek paster Type, chin patch type, virtual hat-type, virtual costume type, virtual dressing type, virtual headwear type, virtual hair Type, virtual jewellery type are adornd, in addition, virtual bottle cap type, virtual cup type, literal type etc. can also be included.

Can be business pair using the characteristic point of hand and hand position as reference in addition, always according to the type of business object As selecting appropriate to show position.

In addition, in the characteristic point according to hand and the type of business object to be shown, business object to be shown is obtained In the case that multiple in video image show position, can from it is multiple show select at least one to show position in position. For example, for the business object of literal type, background area can be illustrated in, the palm area or hand of personage can also be illustrated in Portion's upper area etc..

Gesture and show the corresponding relation of position furthermore, it is possible to prestore, it is determined that the gesture detected with it is corresponding When prearranged gesture matches, it can be obtained from the gesture prestored and the corresponding relation for showing position corresponding to prearranged gesture Target shows position and shows position in video image as business object to be shown., wherein it is desired to explanation, although Above-mentioned gesture be present and show the corresponding relation of position, still, necessarily relation, gesture are not only gesture with showing position A kind of mode that triggering business object shows, and show position and necessarily relation is also not present with human hand, it that is to say business object Some region of hand can be presented in, the other regions that can also be shown in outside hand, such as the background area of video image Domain etc..Moreover, identical gesture can also trigger the display of different business object, for example, main broadcaster has continuously done what is waved twice Gesture, first time gesture can show two-dimentional paster special efficacy, and second of gesture can show three-dimensional special efficacy etc., and special efficacy twice The contents such as corresponding advertisement can be with identical, can also be different.

In step S360, business object to be shown is drawn using computer graphics mode showing position.

When business object is the two-dimentional paster special efficacy for including semantic information, the paster can be used to carry out advertisement putting And displaying.Before the drafting of business object is carried out, the relevant information of business object can be first obtained, such as the mark of business object Knowledge, size etc..After determining and showing position, business object can be zoomed in and out, rotated according to the coordinate for showing position Adjustment, then, is drawn by corresponding plotting mode such as OpenGL modes to business object to be shown.In some situations Under, advertisement can also be shown in the form of three-dimensional special efficacy, and the word or LOGO of advertisement are such as shown by particle effect mode.Example Such as, the title of a certain product is shown by the two-dimentional paster special efficacy of virtual bottle cap type, attracts spectators' viewing, improve advertisement putting With displaying efficiency.

Gestural control method provided in an embodiment of the present invention, the displaying of business object is triggered by gesture, when business pair During as showing advertisement, compared with traditional video ads mode, on the one hand, the business object is combined with video playback, The system of Internet resources and/or client need not be saved by the network transmission additional ad video data unrelated with video Resource；On the other hand, business object is combined closely with the gesture in video image, has both remained video main body in video image The main image of (such as main broadcaster) and action, add interest, while will not also bother user and normally watch for video image again Video, so as to reduce user to the dislike of the business object showed in video image, and it can inhale to a certain extent Draw the notice of spectators, improve the influence power of business object.

Example IV

Fig. 4 is the flow chart for the gestural control method for showing according to embodiments of the present invention four.

Special efficacy of the present embodiment using business object to include semantic information, the business object include including advertising message The special efficacy of following at least one form：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect, exemplified by specially two-dimentional paster special efficacy, The gesture control scheme of the embodiment of the present invention is illustrated.

The gestural control method of the present embodiment comprises the following steps：

In step S401, the first convolution network model is trained according to the sample image containing human hand markup information, obtains the Information of forecasting of the one convolution network model for the human hand candidate region of sample image.

In step S402, the information of forecasting of amendment human hand candidate region.

In step S403, the second convolution net is trained according to the information of forecasting of revised human hand candidate region and sample image Network model.

Above-mentioned steps S401~step S403 step content may refer to the related content in above-described embodiment, herein not Repeat again.

In step S404, the characteristic vector of business object sample image to be trained is obtained.

Wherein, the positional information and/or confidence of the business object in business object sample image are included in characteristic vector Spend information, and characteristic vector corresponding to gesture.Business object sample image to be trained can be above-mentioned containing human hand mark The sample image of information.

In force, some training standards for not meeting the 3rd convolutional network model in business object sample image be present Sample image is, it is necessary to be filtered out this part sample image by the pretreatment to business object sample image.

First, in the present embodiment, business object, and each business object are included in each business object sample image All it is labeled with positional information and confidence information.In a kind of feasible embodiment, the position of the central point of business object is believed Cease the positional information as the business object.In this step, sample image was carried out according only to the positional information of business object Filter.The coordinate of the position of positional information instruction is obtained, the coordinate and the position coordinates of the business object of default the type are entered Row compares, and calculates the position variance of the two.If the position variance is less than or equal to the threshold value of setting, the business object sample graph Picture can be as sample image to be trained；If the position variance is more than the threshold value of setting, the business object sample is filtered out Image.Wherein, default position coordinates and the threshold value of setting can suitably be set by those skilled in the art according to actual conditions Put, for example, because the image for being generally used for the 3rd convolutional network model training has identical size, therefore the threshold value set can Think image it is long or wide 1/20~1/5, it is preferable that can be image it is long or wide 1/10.

Further, it is also possible to pair determine business object sample image to be trained in business object position and confidence level It is averaged, obtains mean place and average confidence, the mean place and average confidence can be used as follow-up determination to restrain The foundation of condition.

It is used for the business object sample graph trained when being example by two-dimentional paster special efficacy of business object, in the present embodiment As needing to be labeled with the coordinate of optimal location advertising and the confidence level of the advertisement position.Wherein, optimal location advertising can hand, The place such as preceding background mark, therefore the joint training of the advertisement position in the place such as hand-characteristic point, preceding background can be realized, this is relative In the scheme individually trained based on one technology of hand, be advantageous to save computing resource.It is wide that the size of confidence level illustrates this The probability that position is optimal advertisement position is accused, if for example, this advertisement position is to be blocked more, confidence level is low.

In step S405, process of convolution is carried out to characteristic vector, obtains characteristic vector convolution results.

In step S406, the positional information and/or confidence of corresponding business object in this feature Vector convolution result are judged Whether degree information meets the condition of convergence.

In step S407, if satisfied, then completing the training to the 3rd convolutional network model；If not satisfied, then according to feature The positional information and/or confidence information of corresponding business object in Vector convolution result, the 3rd convolutional network model of adjustment Network parameter is simultaneously iterated instruction according to the network parameter of the 3rd convolutional network model after adjustment to the 3rd convolutional network model Practice, until the positional information and/or confidence information of the business object after repetitive exercise meet the condition of convergence.

Above-mentioned steps S404~step S407 specific processing may refer to the related content in above-described embodiment, herein not Repeat again.

The 3rd convolutional network model of training completion can be obtained by above-mentioned steps S404~step S407 processing.Its In, the structure of the 3rd convolutional network model may be referred to the first convolution network model or the second convolutional network in above-described embodiment two The structure of model, will not be repeated here.

The first convolution network model, the second convolution network model and the 3rd convolutional network model obtained by above-mentioned training Video image can be handled accordingly, specifically may comprise steps of S408~step S413.

In step S408, currently playing video image is obtained.

In step S409, video image is detected using the first convolutional network of training in advance, obtains the first of video image Characteristic information and the information of forecasting of human hand candidate region.

In step S410, the volume Two using the information of forecasting of fisrt feature information and human hand candidate region as training in advance The second feature information of product network model, and video image is carried out according to second feature information using the second convolution network model Gestures detection, obtain the gestures detection result of video image.

Wherein, can be with probability in the case where including human hand during video image is determined after carrying out human hand candidate region detection Form determine gesture in human hand candidate region.For example, so that palm opens gesture and palm closure gesture as an example, when palm Open gesture probability it is high when, it is believed that in video image comprising palm open gesture human hand, when palm closure gesture it is general When rate is high, it is believed that the human hand of palm closure gesture is included in video image.

And then in a kind of optional implementation of the application, the output result of the second convolution network model model can be with Including：Human hand candidate region does not include the probability of the human hand of the probability of human hand, human hand candidate region comprising palm opening gesture, people Hand candidate region includes probability of human hand of palm closure gesture etc..

To improve detection speed, in the case where the first convolutional layer parameter is consistent with the second convolutional layer parameter, the second convolution Network model model obtains the gestures detection for video image according to human hand candidate region and the feature of various predetermined gestures When as a result, the fisrt feature for the video image that the second convolution network model model directly can extract multiple first convolutional layers, It is defined as the second feature of the human hand candidate region of multiple second convolutional layer extractions, then according to above-mentioned second feature, by more Individual full articulamentum carries out classification processing to human hand candidate region, obtains the gestures detection result for video image.So can be with Amount of calculation is greatlyd save, improves detection speed.

In step S411, when it is determined that the gesture of the hand detected matches with corresponding prearranged gesture, extraction and inspection The characteristic point of hand in the corresponding human hand candidate region of gesture measured.

In step S412, according to the characteristic point of hand, using training in advance, for determining business object in video image In the 3rd convolutional network model for showing position, corresponding with hand position business pair to be shown is determined in video image Elephant shows position.

In step S413, business object to be shown is drawn using computer graphics mode showing position.

With the rise of the live and short video sharing in internet, increasing video is in a manner of live or short-sighted frequency Occur.This kind of video is usually using personage as leading role (single personage or a small amount of personage), using personage plus simple background as prevailing scenario, Spectators mainly watch on the mobile terminals such as mobile phone.In the case, the dispensing for some business objects (such as advertisement putting) For, on the one hand, because the screen shows region of mobile terminal is limited, if placing advertisement with traditional fixed position, often Main Consumer's Experience region is taken, easily causes user to dislike；On the other hand, for the live application of main broadcaster's class, due to live Instantaneity, the advertisement of traditional fixed duration of insertion can substantially bother the continuity of user and anchor exchange, influence user's sight See experience；Another further aspect, for short video ads, because the content duration of live or short-sighted frequency is natively shorter, also adopted The advertisement that fixed duration is inserted with traditional approach brings difficulty.And by the present embodiment provide scheme, can in real time to regarding Video image in frequency playing process is detected, and provides the optimal ad placement of effect, and do not influence the viewing of user Experience, it is more preferable to launch effect；By the way that business object is combined with video playback, so as to need not by network transmission and video without The additional ad video data of pass, has saved the system resource of Internet resources and/or client；Moreover, business object and video Gesture in image is combined closely, and has both been remained the main image of video main body (such as main broadcaster) and action in video image, has been again Video image adds interest, while will not also bother user and normally watch video, so as to reduce user to video figure The dislike of business object is shown as in, and the notice of spectators can be attracted to a certain extent, improves the shadow of business object Ring power.

Embodiment five

Based on identical technical concept, Fig. 5 is the box for the gesture control device for showing according to embodiments of the present invention five Figure.Reference picture 5, described device include gesture detection module 501, show position determination module 502 and business object drafting module 503。

Gesture detection module 501, for carrying out gestures detection to currently playing video image.

Show position determination module 502, for when detecting that gesture matches with prearranged gesture, determining business to be shown Object shows position in video image.

Business object drafting module 503, for drawing business object using computer graphics mode showing position.

The gesture control device that the present embodiment provides, by being carried out to the currently playing video image comprising hand information Human hand candidate region and gestures detection, and the gesture detected is matched with corresponding prearranged gesture, when both match When, determine that business object to be shown shows position in video image by hand position, be used to open up when business object When showing advertisement, compared with traditional video ads mode, on the one hand, the business object is combined with video playback, without passing through The network transmission additional ad video data unrelated with video, has saved the system resource of Internet resources and/or client；It is another Aspect, business object are combined closely with the gesture in video image, have both been remained in video image video main body (such as main broadcaster) Main image and action, add interest for video image again, while will not also bother user and normally watch video, so as to To reduce user to the dislike of the business object showed in video image, and the attention of spectators can be attracted to a certain extent Power, improve the influence power of business object.

Embodiment six

Based on identical technical concept, referring to the logic diagram of Fig. 6 gesture control device.

The gesture control device of the present embodiment includes：Gesture detection module 501, for entering to currently playing video image Row gestures detection；Position determination module 502, for when detecting that gesture matches with prearranged gesture, determining business to be shown Object shows position in video image；Business object drafting module 503, for using computer graphics side showing position Formula draws business object.

Alternatively, showing position determination module 503 includes：Feature point extraction unit, for the gesture extracted and detected The characteristic point of hand in corresponding human hand candidate region；Show position determination unit, for the characteristic point according to hand, it is determined that with The corresponding business object to be shown of gesture detected shows position in video image.

Alternatively, position determination unit 503 is showed, for the characteristic point according to hand and the class of business object to be shown Type, it is determined that business object to be shown corresponding with the gesture detected shows position in video image.

Alternatively, position determination unit 503 is showed, for the characteristic point according to hand and the class of business object to be shown Type, it is determined that multiple in video image of business object to be shown corresponding with the gesture detected show position；From multiple Show and select at least one to show position in position.

Alternatively, position determination module 503 is showed, for when the gesture and corresponding prearranged gesture phase that determine to detect Timing, it is determined that corresponding with prearranged gesture business object to be shown shows position as with detecting in video image The corresponding business object to be shown of gesture shows position in video image.

Alternatively, position determination module 503 is showed, for from the gesture that prestores and the corresponding relation for showing position In, obtain prearranged gesture corresponding to target show position as the corresponding business object to be shown of the gesture with detecting regarding Show position in frequency image.

Alternatively, business object is to include the special efficacy of semantic information, and video image is live class video image.

Alternatively, the business object includes the special efficacy of following at least one form comprising advertising message：Two-dimentional paster is special Effect, three-dimensional special efficacy, particle effect.

Alternatively, showing position includes at least one of：The hair zones of personage, forehead region, face in video image The background area in body region, video image beyond buccal region domain, chin area, head, in video image with where hand Region centered on setting range in region, region set in advance in video image.

Alternatively, the type of business object includes at least one of：Forehead patch type, cheek patch type, chin Patch type, virtual hat-type, virtual costume type, virtual dressing type, virtual headwear type, virtual hair decorations type, void Intend jewellery type.

Alternatively, gesture includes at least one of：Wave, scissors hand, clench fist, hold in the palm hand, applause, palm open, palm Closure, perpendicular thumb, rifle posture of waving, pendulum V words hand and pendulum OK hands.

Alternatively, gesture detection module 502, for detecting video image using the first convolutional network of training in advance, obtain The fisrt feature information of video image and the information of forecasting of human hand candidate region are obtained, fisrt feature information includes hand-characteristic and believed Breath；Second using the information of forecasting of fisrt feature information and human hand candidate region as the second convolution network model of training in advance Characteristic information, and the gestures detection using the second convolution network model according to second feature information progress video image, depending on The gestures detection result of frequency image；Wherein, the second convolution network model and the first convolution network model sharing feature extract layer.

Alternatively, the device also includes：Human hand area determination module 504, for according to the sample containing human hand markup information This image trains the first convolution network model, obtains the first convolution network model for the pre- of the human hand candidate region of sample image Measurement information；Correcting module 505, for correcting the information of forecasting of human hand candidate region；Convolution model training module 506, for root The second convolution network model is trained according to the information of forecasting and sample image of revised human hand candidate region, wherein, the second convolution Network model and the first convolution network model sharing feature extract layer, and keep special in the second convolution network model training process Levy the parameter constant of extract layer.

Alternatively, show position determination module 503, for by gesture and training in advance, for being examined from video image The 3rd convolutional network model for showing position of business object is surveyed, it is determined that business pair to be shown corresponding with the gesture detected Elephant shows position.

The gesture control device provided by the present embodiment, by the currently playing video image for including hand information Human hand candidate region and gestures detection are carried out, and the gesture detected is matched with corresponding prearranged gesture, when both phases During matching, determine that business object to be shown shows position in video image by hand position, used when business object When advertisement is shown, compared with traditional video ads mode, on the one hand, the business object is combined with video playback, without By the additional ad video data that network transmission is unrelated with video, the system resource of Internet resources and/or client has been saved； On the other hand, business object is combined closely with the gesture in video image, has both remained video main body in video image (such as master Broadcast) main image and action, add interest again for video image, while will not also bother user and normally watch video, So as to reduce user to the dislike of the business object showed in video image, and spectators can be attracted to a certain extent Notice, improve the influence power of business object.

Embodiment seven

Reference picture 7, a kind of structural representation of according to embodiments of the present invention seven terminal device is shown, the present invention is specifically Embodiment is not limited the specific implementation of terminal device.

As shown in fig. 7, the terminal device can include：Processor (processor) 702, communication interface (Communications Interface) 704, memory (memory) 706 and communication bus 708.

Wherein：

Processor 702, communication interface 704 and memory 706 complete mutual communication by communication bus 708.

Communication interface 704, the network element for clients such as other with miscellaneous equipment or server etc. communicate.

Processor 702, for configuration processor 710, it can specifically perform the correlation step in above method embodiment.

Specifically, program 710 can include program code, and the program code includes computer-managed instruction.

Processor 710 is probably central processor CPU, or specific integrated circuit ASIC (Application Specific Integrated Circuit), or it is arranged to implement the integrated electricity of one or more of the embodiment of the present invention Road, or graphics processor GPU (Graphics Processing Unit).One or more processing that terminal device includes Device, can be same type of processor, such as one or more CPU, or, one or more GPU；It can also be different type Processor, such as one or more CPU and one or more GPU.

Memory 706, for depositing program 710.Memory 706 may include high-speed RAM memory, it is also possible to also include Nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.

Program 710 specifically can be used for so that processor 702 performs following operation：Currently playing video image is carried out Gestures detection；When detecting that gesture matches with prearranged gesture, business object to be shown showing in video image is determined Position；Business object is drawn using computer graphics mode showing position.

In a kind of optional embodiment, program 710 is additionally operable to cause processor 702 it is determined that business pair to be shown As the position that shows in the video image, including：Extract hand in human hand candidate region corresponding with the gesture detected Characteristic point；According to the characteristic point of hand, it is determined that business object to be shown corresponding with the gesture detected is in video image In show position.

In a kind of optional embodiment, program 710 is additionally operable to cause processor 702 in the characteristic point according to hand, It is determined that business object to be shown corresponding with the gesture detected shows position in video image, including：According to hand Characteristic point and business object to be shown type, it is determined that business object to be shown corresponding with the gesture detected regarding Show position in frequency image.

In a kind of optional embodiment, program 710 be additionally operable to cause processor 702 in the characteristic point according to hand and The type of business object to be shown, it is determined that business object to be shown corresponding with the gesture detected is in video image Show position, including：According to the characteristic point of hand and the type of business object to be shown, it is determined that corresponding to the gesture detected Multiple in video image of business object to be shown show position；From it is multiple show at least one show is selected in position Position.

In a kind of optional embodiment, program 710 is additionally operable to cause processor 702 it is determined that business pair to be shown As the position that shows in video image, including：From the gesture prestored and the corresponding relation for showing position, obtain predetermined Target corresponding to gesture shows position as the corresponding business object to be shown of the gesture with detecting in video image Show position.

In a kind of optional embodiment, business object is to include the special efficacy of semantic information, and video image is live Class video image.

In a kind of optional embodiment, above-mentioned business object includes following at least one form comprising advertising message Special efficacy：Two-dimentional paster special efficacy, three-dimensional special efficacy, particle effect.

In a kind of optional embodiment, showing position includes at least one of：The hair of personage in video image Body region beyond region, forehead region, cheek region, chin area, head, the background area in video image, video The region in setting range in image centered on the region where hand, region set in advance in video image.

In a kind of optional embodiment, the type of business object includes at least one of：Forehead patch type, face Cheek patch type, chin patch type, virtual hat-type, virtual costume type, virtual dressing type, virtual headwear type, Virtual hair decorations type, virtual jewellery type.

In a kind of optional embodiment, gesture includes at least one of：Wave, scissors hand, clench fist, hold in the palm hand, drum The palm, palm opening, palm closure, perpendicular thumb, rifle posture of waving, pendulum V words hand and pendulum OK hands.

In a kind of optional embodiment, program 710 is additionally operable to cause processor 702 to currently playing video figure As carrying out gestures detection, including：Video image is detected using the first convolutional network of training in advance, obtains the first of video image Characteristic information and the information of forecasting of human hand candidate region, fisrt feature information include hand-characteristic information；By fisrt feature information With the second feature information of the information of forecasting of human hand candidate region as the second convolution network model of training in advance, and using the Two convolutional network models carry out the gestures detection of video image according to second feature information, obtain the gestures detection knot of video image Fruit；Wherein, the second convolution network model and the first convolution network model sharing feature extract layer.

In a kind of optional embodiment, program 710 is additionally operable to cause processor 702 to currently playing video figure As before carrying out gestures detection, training the first convolution network model according to the sample image containing human hand markup information, obtaining the Information of forecasting of the one convolution network model for the human hand candidate region of sample image；Correct the prediction letter of human hand candidate region Breath；Second convolution network model is trained according to the information of forecasting of revised human hand candidate region and sample image, wherein, second Convolutional network model and the first convolution network model sharing feature extract layer, and protected in the second convolution network model training process Hold the parameter constant of feature extraction layer.

In a kind of optional embodiment, program 710 be additionally operable to so that processor 702 it is determined that with the gesture that detects Corresponding business object to be shown shows position in video image, including：By gesture and training in advance, for from The 3rd convolutional network model for showing position of video images detection business object, it is determined that corresponding with the gesture detected wait to show The business object shown shows position.

The terminal device provided by the present embodiment, by being carried out to the currently playing video image comprising hand information Human hand candidate region and gestures detection, and the gesture detected is matched with corresponding prearranged gesture, when both match When, determine that business object to be shown shows position in video image by hand position, be used to open up when business object When showing advertisement, compared with traditional video ads mode, on the one hand, the business object is combined with video playback, without passing through The network transmission additional ad video data unrelated with video, has saved the system resource of Internet resources and/or client；It is another Aspect, business object are combined closely with the gesture in video image, have both been remained in video image video main body (such as main broadcaster) Main image and action, add interest for video image again, while will not also bother user and normally watch video, so as to To reduce user to the dislike of the business object showed in video image, and the attention of spectators can be attracted to a certain extent Power, improve the influence power of business object.

It may be noted that according to the needs of implementation, each step/part described in this application can be split as more multistep Suddenly/part, the part operation of two or more step/parts or step/part can be also combined into new step/part, To realize the purpose of the present invention.

Above-mentioned the method according to the invention can be realized in hardware, firmware, or be implemented as being storable in recording medium Software or computer code in (such as CD ROM, RAM, floppy disk, hard disk or magneto-optic disk), or it is implemented through network download Original storage in long-range recording medium or nonvolatile machine readable media and the meter that will be stored in local recording medium Calculation machine code, so as to which method described here can be stored in using all-purpose computer, application specific processor or programmable or special With such software processing in hardware (such as ASIC or FPGA) recording medium.It is appreciated that computer, processor, micro- Processor controller or programmable hardware include can storing or receive software or computer code storage assembly (for example, RAM, ROM, flash memory etc.), when the software or computer code are by computer, processor or hardware access and when performing, realize herein The processing method of description.In addition, when all-purpose computer accesses the code for realizing the processing being shown in which, the execution of code All-purpose computer is converted into the special-purpose computer for performing the processing being shown in which.

The foregoing is only a specific embodiment of the invention, but protection scope of the present invention is not limited thereto, any Those familiar with the art the invention discloses technical scope in, change or replacement can be readily occurred in, should all be contained Cover within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims

1. a kind of gestural control method, it is characterised in that methods described includes：

Gestures detection is carried out to currently playing video image；

When detecting that gesture matches with prearranged gesture, determine that business object to be shown shows position in the video image Put；

The business object is drawn using computer graphics mode in the position that shows.

2. according to the method for claim 1, it is characterised in that described to determine business object to be shown in the video figure Show position as in, including：

Extract the characteristic point of hand in human hand candidate region corresponding with the gesture detected；

According to the characteristic point of the hand, it is determined that business object to be shown corresponding with the gesture detected regards described Show position in frequency image.

3. according to the method for claim 2, it is characterised in that the characteristic point according to the hand, it is determined that with detection To the corresponding business object to be shown of the gesture show position in the video image, including：

According to the characteristic point of the hand and the type of the business object to be shown, it is determined that with the gesture phase that detects The business object to be shown answered shows position in the video image.

4. according to the method for claim 3, it is characterised in that the characteristic point according to the hand and described to be shown Business object type, it is determined that the business object to be shown corresponding with the gesture detected is in the video figure Show position as in, including：

According to the characteristic point of the hand and the type of the business object to be shown, it is determined that with the gesture phase that detects Multiple in the video image of the business object to be shown answered show position；

From it is the multiple show select at least one to show position in position.

5. according to any described methods of claim 1-4, it is characterised in that the gesture includes at least one of：Wave, Scissors hand, clench fist, hold in the palm hand, applause, palm opening, palm closure, perpendicular thumb, rifle posture of waving, pendulum V words hand and pendulum OK hands.

6. according to any described methods of claim 1-5, it is characterised in that described to carry out hand to currently playing video image Gesture detects, including：

The video image is detected using the first convolutional network of training in advance, obtains the fisrt feature information of the video image With the information of forecasting of human hand candidate region, the fisrt feature information includes hand-characteristic information；

The second convolutional network using the information of forecasting of the fisrt feature information and the human hand candidate region as training in advance The second feature information of model, and the video is carried out according to the second feature information using the second convolution network model The gestures detection of image, obtain the gestures detection result of the video image；Wherein, the second convolution network model and described First convolution network model sharing feature extract layer.

7. according to the method for claim 6, it is characterised in that described that gestures detection is carried out to currently playing video image Before, methods described also includes：

First convolution network model is trained according to the sample image containing human hand markup information, obtains the first convolutional network mould Information of forecasting of the type for the human hand candidate region of the sample image；

Correct the information of forecasting of the human hand candidate region；

Second convolution network model is trained according to the information of forecasting of the revised human hand candidate region and the sample image, Wherein, the second convolution network model and the first convolution network model sharing feature extract layer, and in the volume Two The parameter constant of the feature extraction layer is kept in product network model training process.

8. according to any described methods of claim 1-7, it is characterised in that described to determine business object to be shown described Show position in video image, including：

By the gesture and training in advance, for the 3rd convolution net for showing position from video images detection business object Network model, it is determined that business object to be shown corresponding with the gesture detected shows position.

9. a kind of gesture control device, it is characterised in that described device includes：

Gesture detection module, for carrying out gestures detection to currently playing video image；

Show position determination module, for when detecting that gesture matches with prearranged gesture, determining that business object to be shown exists Show position in the video image；

Business object drafting module, for drawing the business object using computer graphics mode in the position that shows.

10. a kind of terminal device, including：Processor, memory, communication interface and communication bus, the processor, the storage Device and the communication interface complete mutual communication by the communication bus；

The memory is used to deposit an at least executable instruction, and the executable instruction makes the computing device such as right will Ask operation corresponding to the gestural control method described in 1 to 8 any one.