CN109697416A

CN109697416A - A kind of video data handling procedure and relevant apparatus

Info

Publication number: CN109697416A
Application number: CN201811532116.4A
Authority: CN
Inventors: 毕明伟; 丁守鸿; 李季檩; 孟嘉; 吴双; 刘尧
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2018-12-14
Filing date: 2018-12-14
Publication date: 2019-04-30
Anticipated expiration: 2038-12-14
Also published as: CN109697416B

Abstract

The embodiment of the invention discloses a kind of video data handling procedure and relevant apparatus, this method comprises: obtaining target video sequence, targeted object region where extracting target object in each video frame of target video sequence, and crucial point location is carried out to the target object in targeted object region, obtain the location information of the target critical point of the target object in each video frame and the target critical point of each video frame；The location information of target critical point based on each video frame obtains the corresponding behavioral characteristics value of target critical point of each video frame；Meet the video frame under dbjective state from behavioral characteristics value is chosen in target video sequence, and the video frame under dbjective state is determined as key video sequence frame, and local image region corresponding to target critical point is taken in key video sequence frame and is identified, target identification is obtained as a result, determining the attribute of target object in turn.Using the present invention, the precision of vivo identification can be improved, to enhance the authentication dynamics of system.

Description

A kind of video data handling procedure and relevant apparatus

Technical field

The present invention relates to Internet technical field more particularly to a kind of video data handling procedures and relevant apparatus.

Background technique

With the development of science and technology, the terminals such as mobile phone, computer, attendance recorder are commonly used.Currently, all may be used in many terminals To integrate face identification system, so as to carry out authentication based on recognition of face, and when authentication passes through Corresponding operation function is triggered, for example, can be operated with the unlock of triggering terminal etc..

However, when terminal gets the target image of user by the face identification system, no matter the target figure It whether is that recognition of face directly can be carried out to the face in the target image got comprising the user as in.In other words, In existing face identification system, the single picture got can be identified, therefore, when the user is (for example, illegal User) by false face (utilizing other people photo) Lai Jinhang authentication when, once the face characteristic in the photo accords with The target image characteristics of preset owner are closed, then the illegal user can be mistaken for owner, and then trigger unlock operation, to drop The low precision of face vivo identification, and seriously affect the authentication dynamics of system.

Summary of the invention

The embodiment of the present invention provides a kind of video data handling procedure and relevant apparatus, and the accurate of vivo identification can be improved Degree, to enhance the authentication dynamics of system.

On the one hand the embodiment of the present invention provides a kind of video data handling procedure, comprising:

Target video sequence is obtained, the mesh where extracting target object in each video frame of the target video sequence Mark subject area；

Crucial point location is carried out to the target object in the targeted object region, is obtained in each video frame The target object target critical point and each video frame target critical point location information；

The location information of target critical point based on each video frame, obtains the target critical of each video frame The corresponding behavioral characteristics value of point；

Behavioral characteristics value is chosen from the target video sequence meets the video frame under dbjective state；The dbjective state Under video frame be used to characterize that the movement that is filtered out from the target video sequence to link up and the target object is in institute State the video frame of dbjective state；

Video frame under the dbjective state is determined as key video sequence frame, and according to the target in the key video sequence frame Part belonging to key point, the local image region where taking the part in the key video sequence frame；

The local image region is identified, obtains target identification as a result, and true based on the target identification result The attribute of the fixed target object.

Wherein, the acquisition target video sequence extracts target pair from each video frame of the target video sequence As the targeted object region at place, comprising:

Acquisition includes the video data of the target object, and the Digital video resolution is corresponding for the target object Target video sequence, and obtain the first video frame and the second video frame from the target video sequence；

Image-region where obtaining the target object in first video frame, as in first video frame Targeted object region, and the image-region where obtaining the target object in second video frame, as described Targeted object region in two video frames.

Wherein, the target object in the targeted object region carries out crucial point location, obtains described every The target critical point of target object described in a video frame and the location information of target critical point, comprising:

Crucial point location is carried out to the target object in the targeted object region in first video frame, obtains institute State the position of all key points of the target object in all key points and first video frame in first video frame Confidence ceases, and two key points on first position are determined as to the mesh of first video frame from obtained all key points Mark key point；

In the targeted object region in second video frame to all key points in first video frame into Row tracking, obtains the position of all key points in second video frame and all key points in second video frame Information；

According to the target critical point in first video frame, included in the targeted object region of second video frame All key points in, two key points on the second position are determined as to the target critical point of second video frame.

Wherein, the image-region where obtaining the target object in first video frame, as described Targeted object region in one video frame, comprising:

If first video frame is the first video frame of the target video sequence, it is based on first network model, filter Identify the target object in wiping out background except the background area in first video frame, and based on the first network model The image-region in the first video frame behind region, and the image-region that will identify that as the target object described first Targeted object region in video frame.

Wherein, related to the institute in first video frame in the targeted object region in second video frame Key point obtains all keys in all key points and second video frame in second video frame to being tracked The location information of point, comprising:

Based on the location information of each key point in first video frame, each crucial point tracking is mapped to Targeted object region in second video frame, and based on being mapped in the targeted object region in second video frame The key point arrived obtains all key points in second video frame, and described second is determined in second video frame The location information of each key point in video frame.

Wherein, the location information of the target critical point based on each video frame, obtains each video frame The corresponding behavioral characteristics value of target critical point, comprising:

The location information of the target critical point of each video frame is obtained, and according to the target critical point of each video frame Location information, determine the corresponding distance difference of target critical point of each video frame, and the distance difference that will be determined It is determined as behavioral characteristics value corresponding to the target critical point of corresponding video frame.

Wherein, the dbjective state includes the first sign state and the second sign state, the target of each video frame The corresponding behavioral characteristics value of key point includes that the dynamic under behavioral characteristics value and the second sign state under the first sign state is special Value indicative；

The behavioral characteristics value of choosing from the target video sequence meets the video frame under dbjective state, comprising:

In each video frame, the behavioral characteristics value under first sign state is obtained, and in first body The first maximum behavioral characteristics value is obtained in behavioral characteristics value under symptom state, and is determined based on the described first maximum behavioral characteristics value First object threshold value；

In each video frame, the behavioral characteristics value under second sign state is obtained, and in second body The second maximum behavioral characteristics value is obtained in behavioral characteristics value under symptom state, and is determined based on the described second maximum behavioral characteristics value Second targets threshold；

Behavioral characteristics value under first sign state is compared with the first object threshold value, and by described Behavioral characteristics value under two sign states is compared with second targets threshold；

By the behavioral characteristics value under continuous multiple the first sign states greater than first object threshold value, and/or less than second Video frame corresponding to behavioral characteristics value under second sign state of targets threshold is determined as from the target video sequence The video frame for acting the coherent and described target object and being in dbjective state filtered out, to obtain the video under dbjective state Frame.

Wherein, the video frame by under the dbjective state determines key video sequence frame, and according to the key video sequence frame In target critical point belonging to part, the topography where taking the part in the key video sequence frame Region, comprising:

Using the movement filtered out, the coherent and described target object is in the video frame of the dbjective state as candidate video Frame, and quality evaluation is carried out to the targeted object region in the candidate video frame, and filtered out according to quality assessment result described Blurry video frames in candidate video frame；

In filtering out the candidate video frame after blurry video frames, the candidate video frame with highest resolution is determined as closing Key video frame, and based on part belonging to the target critical point in the key video sequence frame, take the key video sequence frame In the part region as local image region.

Wherein, described that the local image region is identified, target identification is obtained as a result, and knowing based on the target Other result determines the attribute of the target object, comprising:

The local image region is determined as pending area, and based on the second network model to the pending area Feature extraction is carried out, characteristics of image corresponding with the pending area is obtained；

According in second network model, multiple Attribute class in described image feature and second network model are obtained Matching degree between type feature；

By multiple attribute type features in the matching degree obtained by second network model and second network model Corresponding label information is associated, and obtains the corresponding target identification of second network model as a result, and based on the target Recognition result determines the corresponding attribute of the target object.

Wherein, the method also includes:

Sample set associated with the target object is obtained, and the first label letter will be carried in the sample set The sample data of breath is determined as positive sample, and is determined as the sample data for carrying the second label information in the sample set Negative sample；Wherein, the positive sample is the sample data that the attribute of target object is living body attribute, and the negative sample is target pair The attribute of elephant is the sample data of non-living body attribute；

In the sample set, by the size scaling of the corresponding image data of the positive sample to identical size, and base Corresponding first label information of positive sample, corresponding second label information of the negative sample after scaling, training described second Network model.

Wherein, optionally, the part includes the first sign information and the second sign information；

It is described that the local image region is identified, target identification is obtained as a result, and based on the target identification knot Fruit determines the attribute of the target object, comprising:

First sign information region is determined as the first image-region in the local image region, and will Second sign information region is determined as the second image-region, and by the first image region and second image Region inputs cascade network model, to extract the first characteristics of image and second image-region in the first image region In the second characteristics of image；

The first image feature is inputted into the first classifier in the cascade network model, exports the first image The first matching degree in feature and second network model between multiple attribute type features of the first classifier；

Second characteristics of image is inputted into the second classifier in the cascade network model, exports second image The second matching degree in feature and the cascade network model between multiple attribute type features of the second classifier；Described second Classifier be and the mutually cascade classifier of the first classifier；

The weighted value of weighted value and second classifier based on first classifier, by first matching degree with Second matching degree is merged, and obtains the corresponding target identification of the cascade network model as a result, and based on the target Recognition result determines the corresponding attribute of the target object.

On the one hand the embodiment of the present invention provides a kind of video data processing apparatus, comprising:

Retrieval module is mentioned from each video frame of the target video sequence for obtaining target video sequence Take the targeted object region where target object；

Key point locating module, for carrying out crucial point location to the target object in the targeted object region, Obtain the target critical point of the target object in each video frame and the target critical point of each video frame Location information；

Characteristic value acquisition module obtains described for the location information of the target critical point based on each video frame The corresponding behavioral characteristics value of target critical point of each video frame；

Video frame chooses module, meets under dbjective state for choosing behavioral characteristics value from the target video sequence Video frame；Video frame under the dbjective state be used to characterize the movement filtered out from the target video sequence link up and The target object is in the video frame of the dbjective state；

Key frame determining module, for the video frame under the dbjective state to be determined as key video sequence frame, and according to institute Part belonging to the target critical point in key video sequence frame is stated, the institute, part is taken in the key video sequence frame Local image region；

Local identification module obtains target identification as a result, and based on institute for identifying to the local image region State the attribute that target identification result determines the target object.

Wherein, the retrieval module includes:

Data parsing unit, for acquiring the video data comprising the target object, and by the Digital video resolution For the corresponding target video sequence of the target object, and the first video frame and the second view are obtained from the target video sequence Frequency frame；

Area determination unit is made for the image-region where obtaining the target object in first video frame For the targeted object region in first video frame, and the figure where obtaining the target object in second video frame As region, as the targeted object region in second video frame.

Wherein, the key point locating module includes:

Key point positioning unit, in the targeted object region in first video frame to the target object into Row key point location, obtains all key points and first video frame of the target object in first video frame In all key points location information, and two key points on first position are determined as from obtained all key points The target critical point of first video frame；

Key point tracing unit, in the targeted object region in second video frame to first video frame In all key points to being tracked, obtain in all key points and second video frame in second video frame All key points location information；

Key point determination unit, for according to the target critical point in first video frame, in second video frame Targeted object region included all key points in, two key points on the second position are determined as second video The target critical point of frame.

Wherein, the area determination unit is specifically used for, if first video frame is the head of the target video sequence A video frame is then based on first network model, filters out the background area in first video frame, and be based on the first network Model identifies image-region of the target object in the first video frame behind wiping out background region, and the image that will identify that Targeted object region of the region as the target object in first video frame.

Wherein, the key point tracing unit, specifically for the position based on each key point in first video frame Each crucial point tracking is mapped to the targeted object region in second video frame by confidence breath, and based on described the The key point mapped in targeted object region in two video frames obtains all keys in second video frame Point, and the location information of each key point in second video frame in determining second video frame.

Wherein, the characteristic value acquisition module, the location information of the target critical point specifically for obtaining each video frame, And the location information of the target critical point according to each video frame, determine that the target critical point of each video frame is corresponding Distance difference, and the distance difference determined is determined as behavioral characteristics corresponding to the target critical point of corresponding video frame Value.

The video frame chooses module

First threshold determination unit, for obtaining the dynamic under first sign state in each video frame Characteristic value, and the first maximum behavioral characteristics value is obtained in the behavioral characteristics value under first sign state, and based on described First maximum behavioral characteristics value determines first object threshold value；

Second threshold determination unit, for obtaining the dynamic under second sign state in each video frame Characteristic value, and the second maximum behavioral characteristics value is obtained in the behavioral characteristics value under second sign state, and based on described Second maximum behavioral characteristics value determines the second targets threshold；

Threshold value comparison unit, for by under first sign state behavioral characteristics value and the first object threshold value into Row compares, and the behavioral characteristics value under second sign state is compared with second targets threshold；

Video frame selection unit, by the behavioral characteristics under continuous multiple the first sign states greater than first object threshold value Value, and/or less than video frame corresponding to the behavioral characteristics value under the second sign state of the second targets threshold, be determined as from institute The video frame for acting the coherent and described target object and being in dbjective state filtered out in target video sequence is stated, to obtain mesh Video frame under mark state.

Wherein, the key frame determining module includes:

Quality estimation unit, the view for acting the coherent and described target object and being in the dbjective state for that will filter out Frequency frame carries out quality evaluation as candidate video frame, and to the targeted object region in the candidate video frame, and according to quality Assessment result filters out the blurry video frames in the candidate video frame；

Key frame determination unit, for that will have highest resolution in filtering out the candidate video frame after blurry video frames Candidate video frame be determined as key video sequence frame, and it is right based on part belonging to the target critical point in the key video sequence frame As taking the part region in the key video sequence frame as local image region.

Wherein, the local identification module includes:

Feature extraction unit for the local image region to be determined as pending area, and is based on the second network mould Type carries out feature extraction to the pending area, obtains characteristics of image corresponding with the pending area；

Characteristic matching unit, for obtaining described image feature and second net according in second network model Matching degree in network model between multiple attribute type features；

In attribute determining unit, matching degree for will be obtained by second network model and second network model The corresponding label information of multiple attribute type features is associated, and obtains the corresponding target identification knot of second network model Fruit, and the corresponding attribute of the target object is determined based on the target identification result.

Wherein, the local identification module further include:

Sample acquisition unit, for obtaining sample set associated with the target object, and in the sample set It is middle that the sample data for carrying the first label information is determined as positive sample, and the second label letter will be carried in the sample set The sample data of breath is determined as negative sample；Wherein, the positive sample is the sample data that the attribute of target object is living body attribute, The negative sample is that the attribute of target object is the sample data of non-living body attribute；

Model training unit, in the sample set, the size of the corresponding image data of the positive sample to be contracted It puts to identical size, and based on corresponding first label information of positive sample, corresponding second label of the negative sample after scaling Information, training second network model.

Wherein, the part includes the first sign information and the second sign information；

It is described part identification module include:

Image-region determination unit, in the local image region that first sign information region is true It is set to the first image-region, and second sign information region is determined as the second image-region, and by described first Image-region and second image-region input cascade network model, to extract the first image in the first image region The second characteristics of image in feature and second image-region；

First matching unit, for the first image feature to be inputted to the first classification in the cascade network model Device exports in the first image feature and second network model between multiple attribute type features of the first classifier First matching degree；

Second matching unit, for second characteristics of image to be inputted to the second classification in the cascade network model Device exports in second characteristics of image and the cascade network model between multiple attribute type features of the second classifier Second matching degree；Second classifier be and the mutually cascade classifier of the first classifier；

Integrated unit is matched, for the weighted value of weighted value and second classifier based on first classifier, First matching degree is merged with second matching degree, obtains the corresponding target identification knot of the cascade network model Fruit, and the corresponding attribute of the target object is determined based on the target identification result.

On the one hand the embodiment of the present invention provides a kind of video data processing apparatus, comprising: processor and memory；

The processor is connected with memory, wherein for storing program code, the processor is used for the memory Said program code is called, to execute such as the method in the embodiment of the present invention in first aspect.

On the one hand the embodiment of the present invention provides a kind of computer storage medium, the computer storage medium is stored with meter Calculation machine program, the computer program include program instruction, and described program is instructed when being executed by a processor, executed such as the present invention Method in embodiment in first aspect.

The embodiment of the present invention, can be to the target video first when getting the corresponding target video sequence of target object Target object in sequence is detected, in order to the subsequent target that can find each video frame in the target video sequence Key point, and then can capture the target critical point and appear in location information in each video frame, so as to according to should Target critical point obtains moving corresponding to the target critical point in each video frame in the positional information calculation in each video frame State characteristic value；For example, each video frame can be further calculated out so that the target critical point is key point A and key point B as an example In key point A and the distance between key point B difference, and then target critical point institute is right in available corresponding video frame The behavioral characteristics value answered；Then, it by behavioral characteristics value corresponding to the target critical point in each video frame, can filter out Video frame under particular state, it can filter out behavioral characteristics value in each video frame and meet video under dbjective state Frame, and then key video sequence frame can be determined from the video frame filtered out, to improve the efficiency of vivo identification, and ensure living body The accuracy of identification；It is then possible to determine part belonging to target critical point in key video sequence frame, and then can be from this The local image region where the part is taken out in key video sequence frame, to improve the efficiency of image recognition；Finally, logical The part in the local image region under the particular state can be identified by crossing trained In vivo detection model, with The precision of vivo identification under particular state is improved, and then the authentication dynamics of system can be improved.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.

Fig. 1 is a kind of structural schematic diagram of network architecture provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic diagram of target video sequence provided in an embodiment of the present invention；

Fig. 3 is a kind of flow diagram of video data handling procedure provided in an embodiment of the present invention；

Fig. 4 is a kind of schematic diagram for obtaining video data provided in an embodiment of the present invention；

Fig. 5 is a kind of structural schematic diagram for obtaining key point provided in an embodiment of the present invention；

Fig. 6 is a kind of schematic diagram for tracking key point provided in an embodiment of the present invention；

Fig. 7 is a kind of schematic diagram of behavioral characteristics value for obtaining target critical point provided in an embodiment of the present invention；

Fig. 8 is a kind of schematic diagram for taking local image region provided in an embodiment of the present invention；

Fig. 9 is the flow diagram of another video data handling procedure provided in an embodiment of the present invention；

Figure 10 is a kind of schematic diagram for obtaining object region provided in an embodiment of the present invention；

Figure 11 is a kind of schematic diagram for obtaining key video sequence frame provided in an embodiment of the present invention；

Figure 12 is a kind of structural schematic diagram of video data processing apparatus provided in an embodiment of the present invention；

Figure 13 is the structural schematic diagram of another video data processing apparatus provided in an embodiment of the present invention.

Specific embodiment

Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.

It referring to Figure 1, is a kind of structural schematic diagram of network architecture provided in an embodiment of the present invention.As shown in Figure 1, the net Network framework may include cloud server 2000 and user terminal cluster；User terminal cluster may include that multiple users are whole End, as shown in Figure 1, specifically include user terminal 3000a, user terminal 3000b ..., user terminal 3000n；As shown in Figure 1, User terminal 3000a, user terminal 3000b ..., user terminal 3000n can with the cloud server 2000 carry out network Connection.

For ease of understanding, a user terminal is selected in multiple user terminals that the embodiment of the present invention can be shown in Fig. 1 As target terminal user, for example, can be using user terminal 3000a shown in FIG. 1 as target terminal user.Wherein, target User terminal may include: that smart phone, tablet computer, desktop computer, smart television etc. carry the intelligence of camera function eventually End.The target terminal user (for example, smart phone) can be opened detecting the camera (for example, front camera) in terminal In the case where opening, video record function is used in the corresponding data acquisition interface of the camera, to obtain comprising the target pair As the video flowing (video data comprising the target object can be obtained) of (for example, face).The video flowing can be by multiple Video frame comprising the face is constituted.Therefore, by being parsed to the collected video flowing, the available target object Corresponding target video sequence.Wherein, can be with displaying target prompt information on the data acquisition interface, which uses Execute specific movement (for example, eye closing) in instruction user, in order to it is subsequent can be by statistics target critical point (for example, eye Two key points in portion region) change in location situation in each video frame, and particular state is captured (for example, eye closing shape State) under key video sequence frame, and then the corresponding area, topography of target critical point can be determined from the key video sequence frame Domain, in other words, can from key video sequence frame using the eyes region under the closed-eye state taken out as area, topography Domain, so as to judge whether the face is living body by the local image characteristics in the local image region.

Wherein, the judgement of face living body is very heavy in face core body system (can also be referred to as system for face identity authentication) The link wanted, for example, by the face core body system, user can remotely be opened an account in certain application platform, remote authentication, The operations such as account deblocking complaint.In order to improve the safety of the face core body system, then need to the currently used target user The target user of terminal carries out real name reality people certification, to ensure that the target user is true legal user.In other words, the target User terminal can acquire the video data of the face comprising the target user, and by the video data transmission to above-mentioned Fig. 1 institute Cloud server 2000 in corresponding embodiment, so that the cloud server 2000 can use its powerful computing function, it is right Face in target video sequence corresponding to the video data carries out living body judgement.Optionally, which may be used also With collect comprising the target user face video data when, to mesh corresponding to the video data in local terminal The face marked in video sequence carries out living body judgement.

In consideration of it, when the cloud server 2000 is integrated with being capable of the identity verified of the living body attribute to target object When verifying system, target video sequence can be resolved to video flowing by above-mentioned in the cloud server 2000, in other words, When the cloud server 2000 receives video data transmitted by the target terminal user, which can be solved Analysis, to obtain the corresponding target video sequence of the target object.Optionally, it tests when being integrated with the identity in the target terminal user When card system, the video flowing can be parsed in the target terminal user, to obtain the corresponding target of the target object Video sequence.

For ease of understanding, the embodiment of the present invention to be for being integrated with the authentication system in the target terminal user, with It illustrates how to obtain target video sequence in the target terminal user, and how to be based on target critical point in each video frame In behavioral characteristics value, filter out the video frame under dbjective state from the target video sequence, and how based on filtering out Video frame carry out vivo identification etc..

For ease of understanding, further, Fig. 2 is referred to, is a kind of target video sequence provided in an embodiment of the present invention Schematic diagram.As shown in Fig. 2, target user can open in the front camera in target terminal user (for example, smart phone) In the case where, video record is carried out based on target prompting information shown in data acquisition interface 100a, to obtain comprising being somebody's turn to do The video data of target user (video data may include multiple video frames).It should be appreciated that constituting each of the video data Video frame can carry out serializing distribution according to time shaft shown in Fig. 2, therefore, when the target terminal user is to collected view When frequency is according to data parsing is carried out, available target video sequence corresponding with face (i.e. the target object) of the target user It arranges, each video frame in the target video sequence can carry out serializing distribution according to time sequencing shown in Fig. 2.Wherein, right It can be used as the first view in the target video sequence in former and later two video frames of the arbitrary neighborhood on time shaft shown in Fig. 2 Frequency frame and the second video frame.For example, the 1st moment corresponding video frame on the time shaft can be referred to as the first video frame, i.e., First video frame can be in the target video sequence, and the video frame of the 2nd moment object on the time shaft is referred to as the Two video frames.Optionally, the one 2 moment corresponding video frame on the time shaft can also be referred to as to the first video frame, and will The video frame of the 3rd moment object is referred to as the second video frame on the time shaft.By by the target video sequence arbitrarily connecting Two continuous video frames can be quickly found out and appear in each video frame respectively as the first video frame and the second video frame Target critical point.

For ease of understanding, the embodiment of the present invention can be using the target critical point as two on specific position in mouth region For key point, wherein the two key points can be referred to as the first key point and the second key point, be based on the two to illustrate Key point determines the detailed process of the key video sequence frame under the state of opening one's mouth.Wherein, which can be the mouth institute Key point on upper lip in the zone, the second key point can be first crucial with this on lower lip in the mouth region The corresponding key point of point, and then the target critical point that the two key point points can be referred to as in a video frame.For mesh There can be such target critical point in each video frame in mark video sequence.Therefore, it can be somebody's turn to do by statistics Location information of the target critical point in each video frame, and the dynamic of the target critical point in each video frame is calculated Characteristic value, and then it is special can quickly to be filtered out based on the corresponding behavioral characteristics value of target critical point in each video frame for dynamic Value indicative meets the video frame of dbjective state, so as to find key video sequence frame from the video frame filtered out, to improve image The efficiency of identification.In other words, by the target critical point (i.e. the first key point and the second key point) in each video frame Evolution rule counted, can calculate between the first key point and the second key point in each video frame away from Deviation value, and then can be special by dynamic corresponding to target critical point that calculated each distance difference is referred to as in each video frame Value indicative.In consideration of it, in target video sequence shown in Fig. 2, it can be based on the corresponding dynamic of target critical point of each video frame It is higher and in the key video sequence frame opened one's mouth under state that characteristic value finds resolution ratio, so as to based on the key video sequence found Mouth region (i.e. local image region) in frame carries out living body judgement, to improve recognition efficiency.It should be appreciated that living body judgement Purpose be to confirm that the collected video data of target terminal user institute is valid data, and not turned over by plane Bat (utilizing photo, certificate etc.) is obtained to have aggressive invalid data to the face core body system, in consideration of it, passing through The statistics of behavioral characteristics value can be quickly found out key video sequence frame, to improve recognition efficiency, and the authentication power of system can be enhanced Degree.

Wherein, above-mentioned target terminal user obtains above-mentioned target video sequence, determines that above-mentioned target critical point is corresponding dynamic State characteristic value, and the detailed process of local image region is taken, it may refer to the institute of embodiment corresponding to following Fig. 3 to Figure 11 The implementation of offer.

Further, Fig. 3 is referred to, is a kind of process signal of video data handling procedure provided in an embodiment of the present invention Figure.As shown in figure 3, method provided in an embodiment of the present invention may include:

Step S101 obtains target video sequence, and target pair is extracted from each video frame of the target video sequence As the targeted object region at place；

Specifically, target terminal user can acquire the video data comprising the target object, and by the video counts According to resolving to the corresponding target video sequence of the target object, and obtain from the target video sequence the first video frame and Second video frame；Further, the image-region where obtaining the target object in first video frame, as described Targeted object region in first video frame, and the image district where obtaining the target object in second video frame Domain, as the targeted object region in second video frame.

Wherein, the target terminal user can be the target terminal user in embodiment corresponding to above-mentioned Fig. 1, for example, The target terminal user can be the user terminal 3000a in embodiment corresponding to above-mentioned Fig. 1.The target terminal user can be Acquisition includes the video data of the target object in the case that camera is opened.For example, the target terminal user can be preceding In the case where setting camera unlatching, displaying target is prompted on the data acquisition interface 100a in the embodiment corresponding to above-mentioned Fig. 2 Information, which can serve to indicate that should using target user (for example, the user A) basis of the target terminal user Target prompting information makes corresponding movement (for example, open one's mouth, close one's eyes, choosing the specific actions such as eyebrow), in order to target user end End can collect the video flowing comprising these movements, it can obtain the view comprising target object (for example, face of user A) Frequency evidence.

Optionally, in some face core body systems (for example, intelligent entrance guard or remote authentication), the target terminal user is also It can be played by way of voice broadcast on above-mentioned data acquisition interface shown in the case where rear camera is opened Target prompting information, to collect the video data for another user for being different from above-mentioned user A.Further, Fig. 4 is referred to, It is a kind of schematic diagram for obtaining video data provided in an embodiment of the present invention.As shown in figure 4, user B can receive target use Shown target prompting information is (for example, please be close on the data acquisition interface 200a that family terminal (i.e. smart phone) is broadcasted Camera shooting terminal), and can according to the target prompting information, slowly from geographical location, A shifts to geographical location B, with to the target user Terminal is drawn close, and then the target terminal user can be made to collect video flowing of user B during executing specific action, i.e., The available video data comprising target object (for example, face of user B).

Further, which can carry out data parsing to collected video data, to obtain the mesh The corresponding target video sequence of object is marked, which can be the target video in embodiment corresponding to above-mentioned Fig. 2 Sequence includes N number of video frame in the target video sequence, for ease of understanding, only with target video sequence in the embodiment of the present invention For two video frames of middle continuous adjacent, with the detailed process progress to target critical point is determined from the two video frames It illustrates.Wherein, the two video frames can for the 1st moment in embodiment corresponding to above-mentioned Fig. 2 video frame and it is above-mentioned 2nd when The video frame at quarter.Wherein, the video frame at the 1st moment can be referred to as the first video frame, and the video frame at the 2nd moment can be referred to as Second video frame.

Step S102 carries out crucial point location to the target object in the targeted object region, obtains described every The location information of the target critical point of the target critical point and each video frame of the target object in a video frame；

Specifically, the target terminal user can be in the targeted object region in first video frame to the mesh It marks object and carries out crucial point location, obtain all key points of the target object in first video frame and described the The location information of all key points in one video frame, and by two keys on first position from obtained all key points Point is determined as the target critical point of first video frame；Further, which can be in second video To all key points in first video frame to being tracked in targeted object region in frame, second video is obtained The location information of all key points in frame and all key points in second video frame；Further, which uses Family terminal can also be according to the target critical point in first video frame, in the targeted object region institute of second video frame In all key points for including, two key points on the second position are determined as to the target critical point of second video frame.

It should be appreciated that for two video frame (i.e. the first video frames of accessed arbitrary continuation in above-mentioned steps S101 With the second video frame) for, which can be to the institute in the target object region in the two video frames The target of each video frame is found in the key point for having crucial point to be positioned, and then can be navigated to from each video frame The location information of key point and the target critical point.

In other words, which can be based on first network model (for example, the convolution in the first network model Neural network model) Face datection is carried out to the video frame of input, to get face location (i.e. face of each video frame Position can be referred to as face frame, can also be referred to as face region), and then can be based on the face frame found each The location information of all key points associated with the target object and all key points is determined in video frame.Wherein, The target critical point be in each video frame on predeterminated position two key points (for example, a key point on upper eyelid and A key point on palpebra inferior corresponding with the upper eyelid), the two key points belong to all keys in corresponding video frame A part in point.For the video frame comprising specific action that continuous acquisition arrives, the target critical point can be counted on In the change in location situation in each video frame, and the change in location situation is to meet certain changing rule under normal conditions.

In order to save the Face datection time, processing speed is greatly improved, can only be carried out in the initial stage of Face datection Face datection, in other words, if first video frame is the first video frame of the target video sequence, the target terminal user It can be based on first network model, filter out the background area in first video frame, and identify based on the first network model Image-region of the target object in the first video frame behind wiping out background region, and the image-region conduct that will identify that Targeted object region of the target object in first video frame.Further, which can be from this The key point of the target object is positioned in targeted object region in first video frame, and then the available target pair As the location information of all key points in all key points and first video frame in first video frame, and The target that two key points on first position are determined as first video frame can be closed from obtained all key points Key point.

For example, can first video frame to the target video sequence (the first video frame can be referred to as the first video Frame) carry out Face datection multiple key points can be extracted from the face frame (for example, 94 keys after obtaining face frame Point).Then, the target terminal user can based on the location information for the key point and these key points that these are extracted, Key point track algorithm is used in subsequent frame (i.e. the second video frame), and then can be reflected in the second video frame by what is tracked Key point is penetrated as the key point in respective frame.It should be appreciated that each mapping key point in the second video frame (can also be referred to as For each key point in the second video frame) to there are mapping relations one by one between the corresponding key point in the first video frame.

For ease of understanding, further, Fig. 5 is referred to, is a kind of knot for obtaining key point provided in an embodiment of the present invention Structure schematic diagram.The target terminal user can be based on first when the first video frame is the first video frame of target video sequence Convolutional neural networks model (for example, model A) in network model filters out the background area of the first video frame, with To the targeted object region comprising the target object (i.e. face).In other words, which can filter out first view Background area in frequency frame, and recognition of face is carried out to the image-region in the first video frame behind wiping out background region, and will The image-region identified is determined as the region where face, it can obtains the target in first video frame shown in fig. 5 Subject area；Then, the target terminal user region (i.e. targeted object region) where the face can be inputted this first Another convolutional neural networks model (for example, Model B) in network model, the Model B go out the face region for identification In each part (i.e. facial contour and face), and then can be shown in Fig. 5 targeted object region in by the face The key points of facial contour and face respective numbers indicates.In other words, which can be from mesh shown in fig. 5 It marks in subject area and crucial point location is carried out to the target object, it is right with each part to be extracted from the object region As associated multiple key points.The key point that the target terminal user can further extract these is referred to as the target All key points of the object in first video frame, and the position letter of these key points can be obtained from first video frame Breath, in order to be tracked to all key points in first video frame in subsequent video frame, and then can be rear All key points of stable corresponding video frame and the location information of these key points are obtained in continuous frame.

Further, Fig. 6 is referred to, is a kind of schematic diagram for tracking key point provided in an embodiment of the present invention.Based on upper All key points in the first video frame accessed by stating in embodiment corresponding to Fig. 5, available as shown in FIG. 6 first The schematic diagram of each key point in video frame.Optionally, which can be further by the institute in the first video frame These obtained key points are added to set of keypoints M corresponding with first video frame, and the institute in set of keypoints M is related Key point may be displayed in the first video frame as shown in FIG. 6.Wherein, if specific action during data acquisition is Mouth, then the target terminal user, can be further in Fig. 6 after obtaining multiple key points in the first video frame described in Fig. 6 Shown in (can will by two key points on predeterminated position (i.e. first position in the first video frame) in display interface 100b Key point A and key point B) it is used as the first key point pair.In addition, the target terminal user can also own in the first video frame Key point is tracked, to obtain mapping key point (the i.e. second view that each key point maps in the second video frame The each mapping key point mapped in frequency frame can be referred to as the key point in the second video frame).In other words, the target User terminal can also determine the location information of all key points in the first video frame, and according to key point tracing algorithm and All key points in first video frame are mapped to second by the location information of each key point in first video frame Targeted object region in video frame, and mapped in the targeted object region that can be based further in second video frame The key point arrived obtains all key points in the second video frame as shown in FIG. 6, is obtaining owning in second video frame After key point, the key point that these mappings can be obtained is added in the corresponding set of keypoints N of the second video, so that should Each key point in each key point and set of keypoints M in set of keypoints N has the relationship mapped one by one.It should Understand, for the same target object, after all key points in the first video frame pass through mapping, can exist accordingly Each key point mapped mapping key point, i.e. each key point and second in the first video frame are found in second video frame There are one-to-one mapping relations between mapped mapping key point in video frame.In consideration of it, can be by the first video frame In each key point and the second video frame in the mapping key point that maps, the key being referred to as in the two video frames Point.Further, the target terminal user can in the display interface 200b described in Fig. 6 will the obtained key point B ' of mapping and Key point A ' is determined as the second key point pair in second video frame.Further, which can be by first The first key point in video frame to and the second video frame in the second key point the target being referred to as in corresponding video frame is closed Key point.It should be appreciated that when the target terminal user finds corresponding position in each video frame in the target video sequence On the key point clock synchronization that is constituted of two key points, then can be by key point that these find to being referred to as the target video sequence Target critical point in column in corresponding video frame, and then these key points can be counted, the position in each video frame is become Change situation, in order to be able to further execute step S103.

Wherein it is possible to understand, a multitask neural network model can be understood as in the first network model, it should Multitask neural network model can be according to the target video sequence under the specific action got, in target video series Target object in each video frame is identified, and the image-region that will identify that is as targeted object region.For each For the targeted object region identified, the target terminal user can be based further on the multitask neural network model from this Part belonging to the target critical point is identified in targeted object region, and then can be where the part Target critical point associated with the specific action and the location information of the target critical point are captured in local image region (for example, the pass of target corresponding to mouth region when specific action is to open one's mouth can be captured in above-mentioned display interface 100b Key point), and then the target critical point captured can be tracked by key point tracing algorithm, quickly described second The corresponding target critical point of the part is found in video frame.Wherein, the target critical point in the target video sequence can With comprising the first key point in embodiment corresponding to above-mentioned Fig. 6 to, the second key point pair.

It should be appreciated that by key point tracing algorithm can using above-mentioned second video frame as newly the first video frame, and The video frame at the 3rd moment adjacent with the first new video frame in embodiment corresponding to above-mentioned Fig. 2 is referred to as new second Video frame.Then, which can be by all key points in embodiment corresponding to above-mentioned Fig. 6 in the second video frame Set of keypoints in the first video frame new as this, and can be by the second key point in embodiment corresponding to above-mentioned Fig. 6 To as from the first key point pair in the first new video frame determined in the set of keypoints.Wherein, in order to The first key point in embodiment corresponding to above-mentioned Fig. 6 in the first video frame is to distinguishing, in the first new video frame First key point is to the third key point pair that can be referred to as in the first new video frame；Further, the target user is whole End can be new to this first video frame in all key points be tracked, the institute obtained in the second new video frame is related Key point, and then can be all in the second new video frame according to the third key point pair in the first new video frame The second new key point pair is determined in key point.It is understood that in order to second in embodiment corresponding to above-mentioned Fig. 6 The second key point in video frame is to distinguishing, and the second new key point determined in the second new video frame is to can With the 4th key point pair being referred to as in the second new video frame.Similarly, the third key point to and the 4th key point to can With the target critical point being referred to as in the target video sequence.

It should be appreciated that for the 4th key point determined in subsequent each video frame to can be interpreted as subsequent every The second key point pair determined in a video frame, therefore, which can will be in the target video sequence First video frame is determined by the first key point in the first video frame to in each video frame after first video frame The second key point out is to the target critical point for being determined as each video frame in the target video sequence.

In consideration of it, by the way that all key points in the first video frame are mapped in the subsequent video frame, in order to can be with These key point mappeds mapping key point is found in subsequent video frame, and mapping each in subsequent video frame can be closed Key point is referred to as the key point in corresponding video frame.It is understood that being closed for the target obtained in subsequent each video frame The detailed process of key point can here will together referring to the detailed process of the second key point pair in above-mentioned the second video frame of acquisition It does not continue to be repeated.

Step S103, the location information of the target critical point based on each video frame, obtains each video frame The corresponding behavioral characteristics value of target critical point；

Specifically, which can obtain each video frame in each video frame of target video sequence Target critical point location information, and the location information of the target critical point according to each video frame determines described every The corresponding distance difference of target critical point of a video frame, and the distance difference determined is determined as to the target of corresponding video frame Behavioral characteristics value corresponding to key point.

Wherein, the corresponding behavioral characteristics value of target critical point of each video frame may include dynamic under the first sign state Behavioral characteristics value under state characteristic value and the second sign state, wherein first sign state and second sign state It can be referred to as dbjective state；

Optionally, which can be special only for the dynamic under the first sign state Value indicative.Optionally, which can also be special only for the dynamic under the second sign state Value indicative.Wherein, first sign state can be the state of opening one's mouth, and second sign state can be closed-eye state；It is optional Ground, first sign state can be eyes-open state, which can be the state of shutting up.

For ease of understanding, the embodiment of the present invention is only to open one's mouth for state by first sign state, to illustrate mouth The change in location situation in each video frame of the target critical point in the embodiment corresponding to above-mentioned Fig. 2 in region.Further, Fig. 7 is referred to, is a kind of schematic diagram of behavioral characteristics value for obtaining target critical point provided in an embodiment of the present invention.Due to the mesh Mark key point is determine after crucial point location to the target object in each video frame based on above-mentioned first network model Two key points on out the characteristics of position, i.e., the target critical point in each video frame is to be identified from corresponding video frame The key point pair determined in key point represented by face and face contour out, therefore, the target terminal user is at this To the statistics of change in location situation of the target critical point in each video frame under the state of opening one's mouth in target video frame sequence Journey can be equivalent to the statistics of the behavioral characteristics value to the target critical point in mouth region shown in Fig. 7, Jin Erke To find the video frame under dbjective state based on the corresponding behavioral characteristics value of target critical point in each video frame, it can from The movement filtered out in the target video sequence is coherent, and in the video frame opened one's mouth under state.Wherein, the target critical point It can be indicated with depth of opening one's mouth (i.e. the distance between two key points difference) in the behavioral characteristics value of each video frame.For just In understanding, the depth of opening one's mouth in each video frame in embodiment corresponding to above-mentioned Fig. 2 can be used into as shown in Figure 7 first respectively Distance difference, second distance difference ..., N distance difference indicate, and then the target critical point of available each video frame Corresponding behavioral characteristics value.

In other words, in the target video sequence, the depth of opening one's mouth in video frame corresponding to the 1st moment can be the mesh The corresponding first distance difference of key point is marked, i.e. depth of opening one's mouth in video frame corresponding to the 2nd moment can close for the target The corresponding second distance difference ... of key point, i.e. depth of opening one's mouth in video frame corresponding to the n-th moment can be the target critical The corresponding N distance difference of point.Since each video frame in the target video sequence is by embodiment corresponding to above-mentioned Fig. 2 Time sequencing carry out serializing arrangement, therefore, can be by counting two key points (i.e. first in the target critical point Key point and the second key point) location information in each video frame is appeared in, the depth of opening one's mouth of each video frame is obtained, in turn The corresponding behavioral characteristics value of target critical point of available corresponding video frame shown in Fig. 7.

It for ease of understanding, can be in conjunction with the key point A and key in the first video frame in embodiment corresponding to above-mentioned Fig. 6 Point B and key point A ' in the second video frame and key point B ' describes the target critical point in the two adjacent video frames Behavioral characteristics value.Wherein, key point A and key point B can be understood as two keys under the specific action on predeterminated position Point, for example, key point A is the key point on lower lip, key point B is the corresponding pass key point A corresponding with this on upper lip Key point.Wherein, key point A ' is that key point A is mapped to obtained mapping key point, key point B ' in the second video frame to be Key point B is mapped to obtained mapping key point in the second video frame, and then the target terminal user can be above-mentioned aobvious Show the location information that key point A and key point B in the first video frame are obtained in the 100b of interface, and can be according to the two keys The distance between point difference obtains of the target critical o'clock being made of key point A and key point B in the first video frame Mouth depth (can obtain first distance value shown in Fig. 7).Similarly, which can also be in above-mentioned display interface The location information of the key point A ' and key point B ' in the second video frame are obtained in 200b, and then can calculate the target critical Depth of opening one's mouth (second distance value shown in Fig. 7 can be obtained) o'clock in the second video frame.Similarly, based on above-mentioned second view All key points tracked in frequency frame can be tracked these key points in subsequent each video frame, Jin Erke To capture the target critical point on corresponding position in the key point that these are tracked, therefore, existed based on the target critical point Location information in subsequent each video frame can calculate the open one's mouth depth of the target critical point in subsequent each video frame (for example, the N distance difference in available N video frame shown in Fig. 7, N at this time can be for just more than or equal to 3 for degree Integer).Obtain in each video frame open one's mouth depth when, then can determine the target critical point in corresponding video frame Behavioral characteristics value.Wherein, which can be used for being depicted the open one's mouth depth of the target critical point under dbjective state And/or eye opening depth.Wherein, state when state and/or eye closing when dbjective state can be understood as opening one's mouth to act act.

Wherein, depth of opening one's mouth is the corresponding distance of the calculated target critical point of institute in embodiment corresponding to above-mentioned Fig. 7 Difference, since the location information of the target critical point in each video frame is not quite identical, can be in the target video The corresponding distance difference of target critical point is referred to as dynamic corresponding to the target critical point in each video frame of sequence Characteristic value.Wherein, according to the size of these distance differences, which can further unite from these distance differences Counting out second distance difference in embodiment corresponding to above-mentioned Fig. 7 can be maximum distance difference (for example, 5cm), i.e., this second The video frame at the 2nd moment corresponding to distance difference can be considered as the video frame under complete state.Optionally, the target user is whole It to be minimum range difference (for example, 0cm) that end can also count N distance difference from these distance differences, then the N The video frame at the n-th moment corresponding to distance difference can be considered as the video frame under full-shut position.Similarly, above-mentioned eye opening depth Then can be understood as the target terminal user can be based on the target critical point in ocular in embodiment corresponding to above-mentioned Fig. 6 Change in location situation, and distance difference of the target critical point in corresponding video frame on the ocular counted on, in turn The video under closed-eye state can be correspondingly found in these video frames by the distance difference on the ocular that counts Frame, or find the video frame under eyes-open state.

Step S104 chooses behavioral characteristics value from the target video sequence and meets the video frame under dbjective state；

Specifically, if the corresponding behavioral characteristics value of target critical point includes the behavioral characteristics value and under the first sign state Behavioral characteristics value under two sign states, then the target terminal user can be in each video frame of the target video sequence In, the behavioral characteristics value under the first sign state is obtained, and obtain in the behavioral characteristics value under first sign state the One maximum behavioral characteristics value, and first object threshold value is determined based on the described first maximum behavioral characteristics value；At the same time, each In video frame, which can also obtain the behavioral characteristics value under the second sign state, and in second sign It obtains the second maximum behavioral characteristics value in behavioral characteristics value under state, and determines the based on the second maximum behavioral characteristics value Two targets thresholds；Further, the target terminal user can by under first sign state behavioral characteristics value with it is described First object threshold value is compared, and the behavioral characteristics value under second sign state is carried out with second targets threshold Compare；Further, the behavioral characteristics which can filter out from target video sequence according to comparison result Value meets the video frame under dbjective state, i.e., the video frame under the dbjective state can be characterized is screened from target video sequence The video frame for acting the coherent and described target object and being in dbjective state out.

Wherein, the comparison result may include the first comparison result and/or the second comparison result.Wherein, based on this One comparison result can filter out video corresponding to the behavioral characteristics value under the first sign state greater than first object threshold value Frame；Based on the behavioral characteristics value that second comparison result can be under the second sign state filtered out less than the second targets threshold Corresponding video frame.Wherein, it can be m based on the video frame filtered out in the first comparison result, compare based on second As a result the video frame filtered out in can be n.It should be appreciated that the dynamic when each video frame in the target video sequence is special It, can be with when value indicative can be synchronized comprising behavioral characteristics value under the behavioral characteristics value and the second sign state under the first sign state Respectively by the behavioral characteristics value and first object threshold value progress threshold value comparison under first sign state, by second sign state Under behavioral characteristics value and the second targets threshold threshold value comparison, to obtain the corresponding comparison result of corresponding video frame, at this point, should Comparison result may include the first comparison result and the second comparison result, it can filters out and is greater than in the target video sequence Behavioral characteristics value under first sign state of first object threshold value, and less than dynamic under the second sign state of the second targets threshold K video frame corresponding to state characteristic value.Wherein, m, n, k, can be positive integer, and the k video frame filtered out is understood that For the partial video frame of the above-mentioned m video frame filtered out, and the partial video frame is also in the above-mentioned n video frame filtered out Partial video frame, so, this k video frame is the m video frame filtered out and the identical view in the n video frame filtered out Frequency frame.

For ease of understanding, the embodiment of the present invention is only with the target critical point for the first sign state (for example, state of opening one's mouth) Under target critical point for, with illustrate from the target video sequence screen behavioral characteristics value meet the video under dbjective state The detailed process of frame.Further, table 1 is referred to, is target critical point (the first sign shape that the embodiment of the present invention is counted on Target critical point under state) behavioral characteristics value in the successive video frames of part distribution situation table.

Table 1

Video frame	The video frame at the 1st moment	The video frame at the 2nd moment	The video frame at the 3rd moment
				Behavioral characteristics value	3cm	5cm	4cm
Video frame	The video frame at the 4th moment	The video frame at the 5th moment	The video frame at the 6th moment
				Behavioral characteristics value	3cm	1cm	0cm

Based on above-mentioned steps S101- step S103, which can obtain the target by crucial point location The location information of the target critical point of each video frame in video sequence, and then can be based on the target critical of each video frame The location information of point is calculated the behavioral characteristics value of the target critical point of corresponding video frame and exists to get to the target critical point Behavioral characteristics value in each video frame.So as shown in Table 1 above, which can count on target critical Behavioral characteristics value in point (for example, target critical point in above-mentioned mouth region) video frame corresponding to the 1st moment is 3cm, behavioral characteristics value of the target critical o'clock in the video frame corresponding to the 2nd moment are 5cm, and the target critical o'clock is the 3rd Behavioral characteristics value in video frame corresponding to moment is 4cm, and the target critical o'clock is in the video frame corresponding to the 4th moment Behavioral characteristics value is 2cm, and behavioral characteristics value of the target critical o'clock in the video frame corresponding to the 5th moment is 1cm, the target Behavioral characteristics value of the key point in the video frame corresponding to the 6th moment is 0cm.In consideration of it, the target terminal user can be from It is 5cm that the maximum behavioral characteristics value under the state of opening one's mouth is found in these behavioral characteristics values, i.e. maximum under the state of opening one's mouth is dynamic State characteristic value can be referred to as accessed the first maximum behavioral characteristics value from the behavioral characteristics value under the first sign state (i.e. 5cm).Then the target terminal user can determine the first mesh based on the first maximum behavioral characteristics value and threshold parameter Mark threshold value (for example, the first object threshold value can be with are as follows: 5cm*0.4=2cm).Wherein, threshold parameter 0.4, the threshold parameter In other words a closed state for judging the first sign information region (i.e. mouth region) can use what this was determined First object threshold value judges the closed state of mouth.It therefore, can be from above-mentioned target video sequence by mouth region In target critical point behavioral characteristics value be greater than the first object threshold value video frame be referred to as the video frame under the state of opening one's mouth, And then the behavioral characteristics value that the target terminal user is found can be determined as and meet video frame under dbjective state.For example, such as It, can four (m=4) a video frames (i.e. video frame at the 1st moment, when the 2nd by behavioral characteristics value greater than 2cm shown in above-mentioned table 1 The video frame at quarter, the video frame and the video frame at the 4th moment at the 3rd moment) the behavioral characteristics value that is determined as finding meets target-like Video frame under state.In other words, which can filter out under 4 states of opening one's mouth from the target video sequence Video frame can further execute step S105 in order to subsequent, it can by the video under this 4 dbjective states filtered out Frame determines key video sequence frame.

Video frame under the dbjective state is determined as key video sequence frame by step S105, and according to the key video sequence Part belonging to target critical point in frame, the Local map where taking the part in the key video sequence frame As region.

As shown in Table 1 above, the target terminal user can be filtered out from target video sequence this four movement it is coherent and Target object is in the video frame of dbjective state, as the video frame under the dbjective state filtered out, and then can pass through quality Assessment models find key video sequence frame (for example, the key video sequence frame can be the 2nd moment shown in table 1 in this 4 video frames Video frame, i.e., with the video frame of highest resolution), and can be found in the key video sequence frame belonging to target critical point Part, and then the region where can taking out the part (can be in above-mentioned Fig. 2 institute as local image region Target critical point in corresponding embodiment in video frame shown in the 2nd moment, and then can be by the mouth in the region where mouth Region where bar is as local image region), further to execute step S106.At this point, the target terminal user not nationwide examination for graduation qualification Consider the closed state of the eyes under the second sign state.

Optionally, for ease of understanding, the embodiment of the present invention can also be only with the target critical point for the second sign state (example Such as, closed-eye state) under target critical point for, with illustrate from the target video sequence screen behavioral characteristics value meet target The detailed process of video frame under state.Further, table 2 is referred to, is the target critical that the embodiment of the present invention is counted on The distribution situation table of behavioral characteristics value of the point (the target critical point under the second sign state) in the successive video frames of part.

Table 2

Video frame	The video frame at the 1st moment	The video frame at the 2nd moment	The video frame at the 3rd moment
				Behavioral characteristics value	1.5cm	1.2cm	0.6cm
Video frame	The video frame at the 4th moment	The video frame at the 5th moment	The video frame at the 6th moment
				Behavioral characteristics value	0.5cm	0.4cm	0cm

When target critical point is two key points in eye region on specific position, which can With calculate the two key points between the location information in each video frame apart from difference, it can obtain the target pass Behavioral characteristics value of the key point in each video frame.So as shown in Table 2 above, which can count on mesh Mark key point (the target critical point that two key points i.e. in the ocular are constituted) video frame corresponding to the 1st moment In behavioral characteristics value be 1.5cm, behavioral characteristics value of the target critical o'clock in the video frame corresponding to the 2nd moment be 1.2cm, behavioral characteristics value of the target critical o'clock in the video frame corresponding to the 3rd moment are 0.6cm, which exists Behavioral characteristics value in video frame corresponding to 4th moment is 0.5cm, the target critical o'clock video corresponding to the 5th moment Behavioral characteristics value in frame is 0.4cm, and behavioral characteristics value of the target critical o'clock in the video frame corresponding to the 6th moment is 0cm.In consideration of it, the target terminal user can find the maximum behavioral characteristics under the eyes-open state from these behavioral characteristics values Value is 1.5cm, i.e., the maximum behavioral characteristics value under the eyes-open state can be referred to as from the behavioral characteristics under the second sign state The second accessed maximum behavioral characteristics value (i.e. 1.5cm) in value.Then the target terminal user can be based on this second most Larger Dynamic characteristic value and threshold parameter determine the second targets threshold (for example, second targets threshold can be with are as follows: 1.5cm*0.4 =0.6cm).Wherein, threshold parameter 0.4, the threshold parameter is for judging the second sign information region (i.e. eyes area Domain) closed state in other words can use second targets threshold determined to judge the closed state of eyes.Therefore, The behavioral characteristics value of the target critical point in eyes region can be less than second mesh from above-mentioned target video sequence The video frame of mark threshold value is referred to as the video frame under closed-eye state, is determined as the behavioral characteristics value that the target terminal user is found Meet the video frame under dbjective state.For example, as shown in Table 2 above, behavioral characteristics value can be less than to three (the i.e. n=of 0.6cm 3) a video frame (i.e. the video frame at the 4th moment, the video frame and the video frame at the 6th moment at the 5th moment) is determined as finding dynamic State characteristic value meets the video frame under dbjective state.And then movement can be filtered out from target video sequence and is linked up, and target Object is in video frame (the i.e. video frame at the 4th moment, the video frame and the video at the 6th moment at the 5th moment under dbjective state Frame) as selected behavioral characteristics value from the target video sequence meet the video frame under dbjective state, and then can lead to It crosses Evaluation Model on Quality and further finds key video sequence frame from these three video frames (for example, the target keywords can be table 2 Shown in the 6th moment video frame), and part described in the target critical point can be further found in the key video sequence frame Object (i.e. eyes), and then the part region can be taken out in the key video sequence frame as local image region (the target critical point i.e. in the embodiment corresponding to above-mentioned Fig. 2 in the video frame at the 6th moment is in eyes region, Jin Erke Using by the region where the eyes as local image region), further to execute step S106.At this point, the target user is whole End does not take into account that the closed state of the mouth under the first sign state.

Optionally, which can also meet target-like choosing behavioral characteristics value from target video sequence When video frame under state, the synchronous behavioral characteristics considered under behavioral characteristics value and the second sign state under the first sign state Value, that is, will be greater than the behavioral characteristics value under the first sign state of first object threshold value (i.e. > 2cm), and less than the second target threshold It is worth video frame corresponding to the behavioral characteristics value under second sign state of (i.e. < 0.6cm), is determined as the behavioral characteristics filtered out Value meets the video frame under dbjective state, with the k video frame filtered out.In conjunction with the target critical in mouth region Dynamic of the point (at this point it is possible to referred to as first object key point) under the first sign state in above-mentioned 6 video frames Corresponding first comparison result of characteristic value (it is big can to find 4 behavioral characteristics values based on above-mentioned table 1 from target video sequence In the video frame of first object threshold value) and eyes region in target critical point (at this point it is possible to referred to as second Target critical point) corresponding second comparison result of behavioral characteristics value under the second sign state in above-mentioned 6 video frames is (i.e. Video frame of 3 behavioral characteristics values less than the second targets threshold can be found from the target video sequence based on above-mentioned table 2), It can be based further on first comparison result and the second comparison result filters out behavioral characteristics value and meets under dbjective state and (opens one's mouth And under closed-eye state) video frame.At this point, the target critical point in the target video sequence is i.e. while crucial comprising above-mentioned first Point and above-mentioned second key point, in consideration of it, the target terminal user is based on for 6 video frames shown in above-mentioned Tables 1 and 2, 1 (i.e. k=1) a video frame (i.e. the video frame at the 4th moment) can be filtered out from this 6 video frames to be used as from the target video The behavioral characteristics value of selected taking-up in sequence meets the video frame under dbjective state.Wherein it is possible to understand, the 4th moment Video frame can be the same video frame in above-mentioned 4 filtered out video frame and 3 video frames filtering out.In consideration of it, The video frame at the 4th moment filtered out further can be determined as key video sequence frame by the target terminal user, and in the pass Determine that part belonging to the target critical point includes eyes and mouth in key video frame, and then can be in the key video sequence frame In take eyes region and mouth region, and the eyes region taken out and mouth region are determined as Local image region corresponding to the target critical point, further to execute step S106.

Step S106 identifies the part in the local image region, obtains target identification as a result, simultaneously base The attribute of the target object is determined in the target identification result.

Specifically, which, which can divide the local image region, is determined as pending area, and base Feature extraction is carried out to the pending area in the second network model, is obtained and the part pair in the pending area The characteristics of image answered；According in second network model, obtain described image feature with it is multiple in second network model Matching degree between attribute type feature；It will be in the matching degree that obtained by second network model and second network model The corresponding label information of multiple attribute type features is associated, and obtains the corresponding target identification knot of second network model Fruit, and determine that the corresponding attribute of target object is living body attribute based on the target identification result.Optionally, the target terminal user It is also based on the target identification result and determines that the corresponding attribute of the target object is non-living body attribute.

Wherein, it during carrying out living body judgement by the local image region, needs in advance to the second network mould Type is trained, in order to be able to the discrimination for improving image recognition by trained second network model.It should manage Solution, may include eye feature, mouth from the local image characteristics in the local image region taken out in the key video sequence frame The corresponding characteristics of image of parts such as Bart's sign, nose feature, ear feature and eyebrow feature, and then available part The matching degree between multiple attribute type features in the corresponding characteristics of image of object and the nervus opticus network model.

Further, Fig. 8 is referred to, is a kind of schematic diagram for taking local image region provided in an embodiment of the present invention. If part belonging to the target critical point includes the first sign information and the second sign information, wherein the first sign letter For breath for that can be mouth with eyes, the second sign information, then the local image region taken out can be eye shown in Fig. 8 Image-region and mouth image region.Optionally, if part belonging to target critical point is only the second sign information, The local image region then taken out can be the region where the mouth under the state of opening one's mouth.Optionally, if the target is closed Part described in key point is only the first sign information, then the local image region taken out can be under closed-eye state Region where eyes.

Further, as shown in figure 8, the local image region taken out can be input to Fig. 8 by the target terminal user Shown in the second network model, second network model can for by great amount of samples set (true man's sample and attack sample) into Acquired model after row training.Second network model can be convolutional neural networks model, wherein the pending area can Think the corresponding localized target region of the target object (face) (for example, in embodiment corresponding to above-mentioned Fig. 8 where mouth Region).It, can be first corresponding wait locate by target object in order to improve the accuracy rate that image data identifies in subsequent pending area Reason region is adjusted to fixed size, and the image data in the pending area after adjustment size is then inputted convolution mind Through the input layer in network model.The second convolution neural network model may include input layer, convolutional layer, pond layer, complete Articulamentum and output layer；Wherein the parameter size of input layer is equal to the size of the pending area after adjustment size.When it is described to After image data in processing region is input to the output layer of convolutional neural networks, convolutional layer is subsequently entered, is randomly selected first A fritter in image data in the pending area as sample, and from this small sample, believe to some features by study Breath, then successively slips over all pixels region of the pending area using this sample as a window, that is to say, that from The characteristic information learnt in sample does convolution algorithm with the image data in pending area, to obtain in pending area Image data it is (available for example, when target object is the face of animal or people in characteristics of image most significant on different location The corresponding local image characteristics in each face position of the face of animal or people in the pending area).Finishing convolution After operation, the characteristics of image of the image data in the pending area is extracted, but extract only by convolution algorithm Feature quantity it is big, in order to reduce calculation amount, also need to carry out pond operation, that is, volume will be passed through from the pending area The characteristics of image that product operation is extracted is transmitted to pond layer, carries out aggregate statistics to the characteristics of image of extraction, these statistical pictures are special The order of magnitude of sign will be well below the order of magnitude for the characteristics of image that convolution algorithm extracts, while can also improve classifying quality.Often Pond method mainly includes average pond operation method and maximum pond operation method.Average pond operation method is one The feature that a average image feature represents the characteristics of image set is calculated in a characteristics of image set；Maximum pond operation is The feature that maximum image feature represents the characteristics of image set is extracted in a characteristics of image set.Pass through the volume of convolutional layer The pondization of product processing and pond layer is handled, and can extract the static structure feature letter of the image data in the pending area Breath, it can obtain the corresponding characteristics of image of the pending area.

Wherein, it according to the full link sort layer (i.e. classifier) in the convolutional neural networks model, identifies described to be processed Matching degree in the corresponding characteristics of image in region and the convolutional neural networks model between multiple attribute type features, convolution mind It is trained completion in advance through the classifier in network model, the input of the classifier is the corresponding image of the pending area Feature, the output of classifier are the matching degrees between the characteristics of image and a variety of attribute type features, the higher explanation of matching degree from The local image characteristics of the part extracted in the pending area label corresponding with corresponding attribute type feature Matching probability between information is bigger；Therefore, which can be further from point of the convolutional neural networks model Maximum matching degree is determined in the matching degree that class device is exported, and can be further according to the maximum matching degree and the maximum With the associated corresponding label information of attribute type feature of degree, the corresponding target identification result of second network model is obtained. It may further determine that out the corresponding attribute of the target object based on the target identification result, for example, the attribute may include living The corresponding living body attribute of body face, and the corresponding non-living body attribute of false face.Wherein, include in second network model The value volume and range of product of attribute type feature be in training second network model by a large amount of training sample set (i.e. it is a large amount of just Sample video segment and a large amount of negative sample video clip) in include the value volume and range of product of label information determine.Wherein, institute The sample data that the attribute that positive sample is target object is living body attribute is stated, that is, includes the video clip of true man；The negative sample Attribute for target object is the sample data of non-living body attribute, i.e., has comprising button hole, photo cutting, scribble etc. aggressive Sample data.

Therefore, if the classifier in the second network model shown in Fig. 8 synchronizes the classifier comprising being identified to eye With the classifier that mouth is identified, as long as then in embodiment corresponding to above-mentioned Fig. 8 obtained by classifier corresponding to mouth Maximum matching degree and eyes corresponding to there is one to be unsatisfactory for local In vivo detection in the obtained maximum matching degree of classifier Threshold value, then it is assumed that the corresponding attribute of the target object is non-living body attribute.Optionally, if in the second network model shown in Fig. 8 Classifier only include the classifier identified to eye, then the obtained maximum matching of the classifier corresponding to the eyes Degree is when being unsatisfactory for local In vivo detection threshold value, then it is assumed that the corresponding attribute of the target object is non-living body attribute, conversely, then can be with It is living body attribute for the corresponding attribute of the target object.Optionally, if the classifier in the second network model shown in Fig. 8 only wraps Containing the classifier identified to mouth, then the obtained maximum matching degree of the classifier corresponding to the mouth is unsatisfactory for part When In vivo detection threshold value, then it is assumed that the corresponding attribute of the target object is non-living body attribute, conversely, then can be the target object Corresponding attribute is living body attribute.

It is understood that by collecting the data flow comprising specific action, the available a set of and target object Associated action language, the action language can be interpreted as change in location rule of the target critical point in each video frame (i.e. the situation of change of location information of the two key points in each video frame), can then be sieved based on the change in location rule The video frame under particular state is selected, and then the target object can be further determined that out based on the video frame under the particular state Attribute (attribute may include: living body attribute and non-living body attribute).For true man, it is based on the set of action language Can find the key video sequence frame under particular state, and the reproduction videos synthesized for some cut using photo or video and Speech, then each video frame in obtained target video sequence will be present cutting and editing trace, so can by this second Network model identifies that the part in the specified region is non-living body attribute, so can effectively keep out photo, video, Or the rogue attacks of static state 3D model, in addition, by being given to from filtering out the video frame under particular state in target video sequence Second network model carries out vivo identification, the efficiency of vivo identification, the performance of lifting system can be improved, and can be improved and be The discrimination of system.

The embodiment of the present invention, can be to the target video first when getting the corresponding target video sequence of target object Target object in sequence is detected, in order to the subsequent target that can find each video frame in the target video sequence Key point, and then location information in each video frame can be appeared in by capturing the target critical point, so as to root It is obtained in each video frame corresponding to the target critical point according to the target critical point in the positional information calculation in each video frame Behavioral characteristics value；For example, each view can be further calculated out so that the target critical point is key point A and key point B as an example The distance between key point A and key point B in frequency frame difference, and then the target critical point in available corresponding video frame Corresponding behavioral characteristics value；Then, it by behavioral characteristics value corresponding to the target critical point in each video frame, can sieve Select the video frame under particular state, it can filter out behavioral characteristics value in each video frame and meet view under dbjective state Frequency frame, and then key video sequence frame can be determined from the video frame filtered out, to improve the efficiency of vivo identification, and ensure The accuracy of vivo identification；It is then possible to determine part belonging to target critical point in key video sequence frame, and then can be with Local image region where taking out part in the key video sequence frame, to improve the efficiency of image recognition；Finally, logical The part in the local image region under the particular state can be identified by crossing trained In vivo detection model, with The precision of vivo identification under particular state is improved, and then the authentication dynamics of system can be improved.

Further, Fig. 9 is referred to, is that the process of another video data handling procedure provided in an embodiment of the present invention is shown It is intended to.As shown in figure 3, method provided in an embodiment of the present invention may include:

Step S201, acquisition include the video data of the target object, and are the mesh by the Digital video resolution The corresponding target video sequence of object is marked, and obtains the first video frame and the second video frame from the target video sequence；

Step S202, the image-region where obtaining the target object in first video frame, as described Targeted object region in one video frame, and the image-region where obtaining the target object in second video frame, As the targeted object region in second video frame

Wherein, first video frame can be the first video frame in the target video sequence, and first video frame is also It can be obtained from the target video sequence for the non-first video frame in the target video sequence, the i.e. target terminal user The corresponding object region of target object in each video frame is got, in order to can further execute step S203, i.e., Crucial point location can be carried out to the target object from the object region in each video frame.

Step S203 carries out key point to the target object in the targeted object region in first video frame and determines It is related to obtain institute of the target object in all key points and first video frame in first video frame for position The location information of key point, and two key points on first position are determined as first view from obtained all key points The target critical point of frequency frame

Wherein it is possible to understand, which, can when getting all key points in the first video frame It is added to first in first video frame with the location information of the key point and key point that further get these Set of keypoints, so can the location information based on each key point from first set of keypoints, will be on first position Two key points be determined as the target critical point of first video frame；

Wherein, the detailed process which positions all key points in the first video frame can be with are as follows: If first video frame is the first video frame of the target video sequence, it is based on first network model, filters out described first Background area in video frame, and identify image district of the target object in the first video frame behind wiping out background region Domain, and the image-region that will identify that is as the targeted object region in first video frame；Further, the target user Terminal can extract all key points in the targeted object region, and by all key points extracted and key point Location information is added to the first set of keypoints in first video frame.

Further, referring to Figure 10, it is a kind of schematic diagram for obtaining object region provided in an embodiment of the present invention. As shown in Figure 10, it is assumed that target user is just on certain authentication platform (for example, bank finance platform) by shown in Fig. 10 Target terminal user carries out recognition of face to the image data of the collected face comprising the target user, in order to subsequent It can identify and hold whether the target user of the target terminal user is true man.Wherein, target terminal user shown in Fig. 10 It before identifying above-mentioned face, needs first to call the camera applications in the terminal, and corresponding by the camera applications Video data under camera (for example, being built in the front camera in the target terminal user) acquisition specific action, and In the target terminal user by the Digital video resolution be the corresponding target video sequence of above-mentioned face.In order to improve recognition of face Efficiency, and improve terminal treatment effeciency, in the process of face recognition can be only to the first of the target video sequence Video frame carries out Face datection, (can obtain target object area shown in Fig. 10 to obtain the corresponding face frame of the face Domain).Wherein it is possible to understand, which can be referred to as the first video frame for the first video frame.Then, The target terminal user can backstage to first video frame got carry out image procossing, for example, can by this first Front and back scene area in video frame is split, and is used with taking out target shown in Fig. 10 from the first picture frame shown in Fig. 10 Objective contour region corresponding to the overall profile at family.Wherein, foreground area is that the overall profile of above-mentioned target user is corresponding Objective contour region, background area is that the figure behind the objective contour region of the target user is taken out in first video frame As region.

Wherein, above-mentioned authentication platform can also include: gate inhibition, attendance, traffic, community, old-age pension qualification authentication etc. Need to carry out the authentication platform of recognition of face.

It should be appreciated that can be prevented in above-mentioned background area by filtering out the background area in above-mentioned first picture frame The interference of each pixel, so as to improve the accuracy of subsequent recognition of face.Then, above-mentioned target terminal user can be into one Step identifies face (i.e. target object) in objective contour region shown in Fig. 10, to get face shown in Fig. 10 The region (i.e. targeted object region) at place.Then, target terminal user shown in Fig. 10 can be based further on first network Model (for example, the first network model can be multitask convolutional neural networks model) is from targeted object region shown in Fig. 10 Key point associated with above-mentioned face is determined in (face frame), and (all key points in the face frame are as shown in Fig. 10 Added key point in set of keypoints).Further, each in the available set of keypoints of the target terminal user The location information of key point, and by each crucial point tracking be mapped to second video frame (i.e. with the first video frame In adjacent next video frame), and based on the key point mapped in the second video frame, it obtains in second video frame All key points.It can be seen that the target terminal user can be according to key point tracing algorithm, in subsequent video frame All key points appeared in first set of keypoints are tracked, it is steady in order to be obtained in subsequent video frame While fixed correspondence key point, the Face datection time can also be saved, processing speed is greatly improved.Wherein, each key Point is the characteristic point that can characterize the face position of above-mentioned target user.Wherein, all in the second video frame to map obtained pass Key point can be added to corresponding second set of keypoints of the second video frame.Wherein, each key in the first set of keypoints Point has mapping relations one by one between each key point in the second set of keypoints.

In order to improve the accuracy rate identified to the key point at the face position in above-mentioned targeted object region, Ke Yixian Using the targeted object region as pending area, and further the pending area is adjusted to fixed size, then By the input layer in the image data input multitask convolutional neural networks in the pending area after adjustment size.The multitask Convolutional neural networks model may include input layer, convolutional layer, pond layer, full articulamentum and output layer；The wherein ginseng of input layer Number size is equal to the size of the pending area after adjustment size.When to be input to this more for the image data in above-mentioned pending area After the output layer of task convolutional neural networks model, convolutional layer is subsequently entered, randomly selects the figure in the pending area first As the fritter in data is as sample, and from this small sample, then study utilizes this sample to some characteristic informations The all pixels region of the pending area is successively slipped over as a window, that is to say, that the feature learnt from sample Information does convolution algorithm with the image data in pending area, it is hereby achieved that the image data in the pending area exists Most significant characteristic information on different location, it can oriented in the pending area by the multitask convolutional neural networks Target user the corresponding characteristic point in each face position.After finishing convolution algorithm, extract above-mentioned wait locate The characteristic information of the image data in region is managed, but big only by the feature quantity that convolution algorithm extracts, is calculated to reduce Amount is also needed to carry out pond operation, that is, will be transmitted from above-mentioned pending area by the characteristic information that convolution algorithm extracts To pond layer, aggregate statistics are carried out to the characteristic information of extraction, the order of magnitude of these statistical nature information will be well below convolution The order of magnitude for the characteristic information that operation is extracted, while can also improve classifying quality.Common pond method mainly includes average Pond operation method and maximum pond operation method.Average pond operation method is to calculate one in a characteristic information set A average characteristics information represents the feature of this feature information aggregate；Maximum pond operation is extracted in a characteristic information set Maximum characteristic information represents the feature of this feature information aggregate out.Pass through the process of convolution of convolutional layer and the pond Hua Chu of pond layer Reason, can extract the static structure characteristic information of the image data in the pending area, it can obtain the pending district The corresponding characteristic information in face position in domain.

Then, which can further utilize the classifier in the multitask convolutional neural networks model, Identify pending area in image data static structure characteristic information with it is multiple in the multitask convolutional neural networks model The matching degree of attribute type feature, and maximum matching degree and respective attributes in multiple matching degrees that above-mentioned classifier is exported The corresponding label information of type feature is associated, so as to find the region where the face position under specific action, with Convenient for the subsequent characteristic point that can navigate to each position in above-mentioned face,

Wherein, the key point in set of keypoints (i.e. the first set of keypoints and the second set of keypoints) can be positioning Out can characterize the corresponding characteristic point in each face position, i.e., above-mentioned key point can be significant portion, the faces such as mouth, eyes The corresponding characteristic point in position.For example, can be found in the set of keypoints all in mouth region for mouth Key point similarly for eyes, can find all key points in eyes region in the set of keypoints.

Wherein, attribute type feature included in the multitask convolutional neural networks model (i.e. first network model) Value volume and range of product is in the training multitask convolutional neural networks by institute in a large amount of training dataset (i.e. standard drawing image set) What the value volume and range of product for the label information for including determined.

Wherein, multiple attribute type features included in above-mentioned multitask neural network model can be special for eyes type Sign, nose types feature, mouth type feature, face mask type feature, and each of the multitask neural network model Attribute type feature corresponds to a label information, in order in the multitask neural network, available above-mentioned face Matching degree between the corresponding characteristic information in face position and above-mentioned multiple attribute type features, then the target terminal user can With further will by the obtained matching degree of multitask neural network model maximum matching degree and the multitask nerve net The corresponding label information of respective attributes type feature in network in multiple attribute type features is associated, to above-mentioned facial regions Face in domain are classified, so as to navigate to the eyes and mouth that can characterize above-mentioned target user in the first video frame Key point in bar region.And then it can be from set of keypoints shown in Fig. 10 by two key points on first position It is determined as the target critical point of first video frame, in order to be able to further execute step S204.

Step S204, it is related to the institute in first video frame in the targeted object region in second video frame Key point obtains all keys in all key points and second video frame in second video frame to being tracked The location information of point；

Step S205, according to the target critical point in first video frame, in the target object of second video frame In all key points that region is included, the target that two key points on the second position are determined as second video frame is closed Key point；

Wherein, the specific implementation of above-mentioned steps S204- step S205 can be found in obtains in embodiment corresponding to above-mentioned Fig. 3 The detailed process for taking the target critical point in each video frame, will not continue to be described here.

Wherein, based on the first key point pair determined from the first video frame in embodiment corresponding to above-mentioned Fig. 6 and Speech, first key point is to including key point A and key point B.In the embodiment as corresponding to above-mentioned Fig. 6 in the second video frame It is obtained that all key points are that the location information based on all key points in the first video frame is tracked, and therefore, constitutes Between the key point A ' and key point B ' of second key point pair and the key point A and key point B that constitute first key point pair Certainly exist certain tracking mapping relations.Further, table 3 is referred to, is that a kind of tracking provided in an embodiment of the present invention is reflected Penetrate relation table.

Table 3

First video frame	Key point A	Key point B
			Location information	(C1, B1)	(C2, B2)
Second video frame	Key point A '	Key point B '
			Location information	(C1 ', B1 ')	(C2 ', B2 ')

As shown in Table 3 above, the location information of the key point A in the first video frame is coordinate (C1, B1), key point B's Location information be coordinate (C2, B2), and key point A and key point B be determined in first video frame first key Point pair, i.e. target critical point in first video frame, therefore, the location information of the target critical point of first video frame is The location information of key point A and key point B.It should be appreciated that due to the second key point in the second video frame to be to first close Determined by the location information of the key point A and key point B of key point centering are tracked after mapping, therefore, in the second video frame Key point A ' and the first video frame in key point A between pass there are above-mentioned tracking mapping relations, in second video frame The location information of key point A ' can be coordinate (C1 ', B1 ')；Similarly, in the key point B ' in the second video frame and the first video frame Key point B between there is also above-mentioned tracking mapping relations, the location information of the key point B ' in second video frame can be Coordinate (C2 ', B2 ').In consideration of it, the location information of the target critical point of second video frame is key point A ' and key point B ' Location information.It should be appreciated that can be incited somebody to action for the location information of remaining key point each in above-mentioned first video frame Each remaining key point is mapped in the second video frame, and then each remaining key can be found in second video frame The mapping key point (the mapping key point can be referred to as the key point in the second video frame) that point maps, so as to Determine the tracking mapping relations in all key points and the first video frame in second video frame between all key points.Its In, each tracking mapped between obtained key point maps in each residue key point and the second video frame in the first video frame Relationship may refer to the description of the tracking mapping relations between key point A and key point A ' cited by the embodiment of the present invention, this In will not continue to repeat.Similarly, for determining the tool of the second new key point pair respectively from subsequent each video frame Body process can here will not together referring to the description in the embodiment of the present invention to the second key point pair is determined in the second video frame It is further continued for being repeated.

Step S206, the location information of the target critical point based on each video frame, obtains each video frame The corresponding behavioral characteristics value of target critical point；

Wherein, dbjective state may include the first sign state and the second sign state, and the target critical point is corresponding Behavioral characteristics value may include the behavioral characteristics value under behavioral characteristics value and the second sign state under the first sign state.

Step S207 obtains the behavioral characteristics value under first sign state, and in institute in each video frame It states and obtains the first maximum behavioral characteristics value in the behavioral characteristics value under the first sign state, and is special based on the described first maximum dynamic Value indicative determines first object threshold value；

Step S208 obtains the behavioral characteristics value under second sign state, and in institute in each video frame It states and obtains the second maximum behavioral characteristics value in the behavioral characteristics value under the second sign state, and is special based on the described second maximum dynamic Value indicative determines the second targets threshold；

Behavioral characteristics value under first sign state is compared by step S209 with the first object threshold value, And the behavioral characteristics value under second sign state is compared with second targets threshold；

Step S210, by the behavioral characteristics value under continuous multiple the first sign states greater than first object threshold value, and/or Less than video frame corresponding to the behavioral characteristics value under the second sign state of the second targets threshold, it is determined as regarding from the target The video frame for acting the coherent and described target object and being in dbjective state filtered out in frequency sequence, to obtain under dbjective state Video frame.

Video frame under the dbjective state is determined as key video sequence frame by step S211, and according to the key video sequence Part belonging to target critical point in frame, the Local map where taking the part in the key video sequence frame As region；

Specifically, the target terminal user can by the movement filtered out by above-mentioned steps S210 the coherent and described target pair As in the dbjective state video frame as candidate video frame, and to the targeted object region in the candidate video frame into Row quality evaluation, and the blurry video frames in the candidate video frame are filtered out according to quality assessment result；Filtering out fuzzy video In candidate video frame after frame, the candidate video frame with highest resolution is determined as key video sequence frame, and be based on the pass Part belonging to target critical point in key video frame takes the part location in the key video sequence frame Domain is as local image region.

For ease of understanding, the embodiment of the present invention is only to filter out the video frame under the state of opening one's mouth as candidate video frame Example, to describe to carry out the candidate video frame detailed process of quality evaluation.Further, referring to Figure 11, it is of the invention real A kind of schematic diagram of acquisition key video sequence frame of example offer is provided.Assuming that under the state of opening one's mouth counted by above-mentioned steps S209 First maximum behavioral characteristics value is 5cm, wherein during determining the first maximum behavioral characteristics value, the target user Terminal can appear in the behavioral characteristics value in each video frame to the target critical point in the mouth region and be ranked up (example Such as, can arrange by descending sequence), and then can be found in the behavioral characteristics value under first sign state First maximum behavioral characteristics value, it is possible to further referring in the part successive video frames in embodiment corresponding to above-mentioned table 1 The distribution situation of behavioral characteristics value, so can be determined based on the first maximum behavioral characteristics value first object threshold value (for example, 2cm).Then, which can carry out the behavioral characteristics value in each video frame with the first object threshold value respectively Compare, in order to be able to the behavioral characteristics under the first sign state greater than first object threshold value are filtered out in these video frames The corresponding video frame of value, and then these video frames filtered out can be determined as candidate video frame, it can obtain Figure 11 Shown in candidate video frame.Then, which can be by Evaluation Model on Quality, to the video in candidate video frame Frame 10, video frame 20, video frame 30, video frame 40 carry out quality evaluation, to obtain quality assessment result.For example, at this four In video frame, if the quality assessment result are as follows: the resolution ratio of video frame 10, video frame 20 and video frame 30 is not up to the quality and comments Estimate the resolution threshold of model, the rate respectively of video frame 40 has reached the resolution threshold of the instruction assessment models, then the target User terminal can filter out the blurry video frames in the candidate video frame according to quality assessment result, it can filter out four views Video frame 10, video frame 20 and video frame 30 in frequency frame；It at the same time, can be true by the video frame 40 with highest resolution Be set to key video sequence frame, and determine part belonging to the target critical point from the key video sequence frame, so from this Using the part region as local image region in key video sequence frame, it can found from the key video sequence frame Local image region (region i.e. where mouth) corresponding to the target critical point.

Optionally, in this four video frames, if the quality assessment result are as follows: the resolution ratio of video frame 20 and video frame 30 The rate respectively of the not up to resolution threshold of the Evaluation Model on Quality, video frame 10 and video frame 40 has reached instruction assessment mould The resolution threshold of type, then the target terminal user can filter out fuzzy in the candidate video frame according to quality assessment result Video frame, it can filter out the video frame 20 and video frame 30 in four video frames；At the same time, model video can filtered out In the candidate video frame of frame, the resolution ratio of video frame 10 and video frame 40 is compared, wherein if the resolution ratio of video frame 40 Greater than the resolution ratio of video frame 10, then the video frame 40 differentiated with highest can be determined as to key video sequence frame, and from described Local image region (region i.e. where mouth) corresponding to the target critical point is obtained in key video sequence frame.

Step S212 identifies the local image region, obtains target identification as a result, and knowing based on the target Other result determines the attribute of the target object.

Wherein, the part in the local image region may include the first sign information and the second sign information；

Then the target terminal user can also specifically execute following steps: in the office when executing step S212 First sign information region is determined as the first image-region in portion's image-region, and by second sign information Region is determined as the second image-region, and the first image region and second image-region are inputted cascade network Model, to extract the spy of the second image in the first characteristics of image and second image-region in the first image region Sign；Further, the first image feature is inputted into the first classifier in the cascade network model, output described first The first matching degree in characteristics of image and second network model between multiple attribute type features of the first classifier；Into one Second characteristics of image is inputted the second classifier in the cascade network model, it is special to export second image by step ground The second matching degree in sign and the cascade network model between multiple attribute type features of the second classifier；Described second point Class device be and the mutually cascade classifier of the first classifier；Further, weighted value and institute based on first classifier First matching degree is merged with second matching degree, obtains the cascade network by the weighted value for stating the second classifier The corresponding target identification of network model is as a result, and determine the corresponding attribute of the target object based on the target identification result.

Wherein, the specific implementation of step S212 can be found in embodiment corresponding to above-mentioned Fig. 3 and retouch to step S104 It states, will not continue to repeat here.

Further, referring to Figure 12, it is that a kind of structure of video data processing apparatus provided in an embodiment of the present invention is shown It is intended to.As shown in figure 12, above-mentioned video data processing apparatus 1 can be whole for the target user in embodiment corresponding to above-mentioned Fig. 1 End.Above-mentioned video data processing apparatus 1 may include: retrieval module 10, key point locating module 20, and characteristic value obtains mould Block 30, video frame choose module 40, key frame determining module 50 and local identification module 60；

Retrieval module 10, for obtaining target video sequence, from each video frame of the target video sequence Extract the targeted object region where target object；

Wherein, retrieval module 10 includes: data parsing unit 101 and area determination unit 102；

Data parsing unit 101, for acquiring the video data comprising the target object, and by the video data solution Analysis is the corresponding target video sequence of the target object, and obtains the first video frame and second from the target video sequence Video frame；

Area determination unit 102, for the image-region where obtaining the target object in first video frame, As the targeted object region in first video frame, and where obtaining the target object in second video frame Image-region, as the targeted object region in second video frame.

Wherein, the area determination unit 102 is specifically used for, if first video frame is the target video sequence First video frame is then based on first network model, filters out the background area in first video frame, and be based on first net Network model identifies image-region of the target object in the first video frame behind wiping out background region, and the figure that will identify that Targeted object region as region as the target object in first video frame.

Wherein, the data parsing unit 101 and the specific implementation of area determination unit 102 can be found in above-mentioned Fig. 3 To the description of step S101 in corresponding embodiment, will not continue to repeat here.

Key point locating module 20, it is fixed for carrying out key point to the target object in the targeted object region Position, obtains the target critical point of the target object in each video frame and the target critical point of each video frame Location information；

The key point locating module 20 includes: key point positioning unit 201, key point tracing unit 202 and key point Determination unit 203；

Key point positioning unit 201, in the targeted object region in first video frame to the target pair As carrying out crucial point location, all key points and first view of the target object in first video frame are obtained The location information of all key points in frequency frame, and it is from obtained all key points that two key points on first position are true It is set to the target critical point of first video frame；

Key point tracing unit 202, for being regarded in the targeted object region in second video frame to described first All key points in frequency frame obtain all key points and second video in second video frame to being tracked The location information of all key points in frame；

Wherein, the key point tracing unit 202, specifically for based on each key point in first video frame Each crucial point tracking is mapped to the targeted object region in second video frame by location information, and based on described The key point mapped in targeted object region in second video frame obtains all keys in second video frame Point, and the location information of each key point in second video frame in determining second video frame.

Key point determination unit 203, for according to the target critical point in first video frame, in second video In all key points that the targeted object region of frame is included, two key points on the second position are determined as second view The target critical point of frequency frame.

Wherein, key point positioning unit 201, the specific implementation of key point tracing unit 202 and key point determination unit 203 Mode can be found in the description in embodiment corresponding to above-mentioned Fig. 3 to step S102, will not continue to repeat here.

Characteristic value acquisition module 30 obtains institute for the location information of the target critical point based on each video frame State the corresponding behavioral characteristics value of target critical point of each video frame；

Wherein, the characteristic value acquisition module 30 is believed specifically for obtaining the position of target critical point of each video frame Breath, and the location information of the target critical point according to each video frame, determine the target critical point of each video frame Corresponding distance difference, and the distance difference determined is determined as the spy of dynamic corresponding to the target critical point of corresponding video frame Value indicative.

Video frame chooses module 40, meets under dbjective state for choosing behavioral characteristics value from the target video sequence Video frame；It is coherent that video frame under the dbjective state is used to characterize the movement filtered out from the target video sequence And the target object is in the video frame of the dbjective state；

It includes: first threshold determination unit 401, second threshold determination unit 402, threshold value that the video frame, which chooses module 40, Comparing unit 403, video frame selection unit 404；

First threshold determination unit 401, it is dynamic under first sign state for obtaining in each video frame State characteristic value, and the first maximum behavioral characteristics value is obtained in the behavioral characteristics value under first sign state, and be based on institute It states the first maximum behavioral characteristics value and determines first object threshold value；

Second threshold determination unit 402, it is dynamic under second sign state for obtaining in each video frame State characteristic value, and the second maximum behavioral characteristics value is obtained in the behavioral characteristics value under second sign state, and be based on institute It states the second maximum behavioral characteristics value and determines the second targets threshold；

Threshold value comparison unit 403, for by under first sign state behavioral characteristics value and the first object threshold Value is compared, and the behavioral characteristics value under second sign state is compared with second targets threshold；

Video frame selection unit 404, the dynamic under continuous multiple the first sign states greater than first object threshold value is special Value indicative, and/or less than video frame corresponding to the behavioral characteristics value under the second sign state of the second targets threshold, be determined as from The video frame for acting the coherent and described target object and being in dbjective state filtered out in the target video sequence, to obtain Video frame under dbjective state.

Wherein, it includes: first threshold determination unit 401, second threshold determination unit that the video frame, which chooses module 40, 402, the specific executive mode of threshold value comparison unit 403, video frame selection unit 404 can be found in embodiment corresponding to above-mentioned Fig. 3 In to step S104 describe, will not continue to repeat here.

Key frame determining module 50, for the video frame under the dbjective state to be determined as key video sequence frame, and according to Part belonging to target critical point in the key video sequence frame, takes the part in the key video sequence frame The local image region at place；

Wherein, key frame determining module 50 includes: quality estimation unit 501 and key frame determination unit 502；

Quality estimation unit 501, for the movement filtered out is coherent and the target object to be in the dbjective state Video frame as candidate video frame, and in the candidate video frame targeted object region carry out quality evaluation, and according to Quality assessment result filters out the blurry video frames in the candidate video frame；

Key frame determination unit 502, for that will have highest resolution in filtering out the candidate video frame after blurry video frames The candidate video frame of rate is determined as key video sequence frame, and right based on part belonging to the target critical point in the key video sequence frame As taking the part region in the key video sequence frame as local image region.

Wherein, the specific executive mode of quality estimation unit 501 and key frame determination unit 502 can be found in above-mentioned Fig. 3 institute To the description of step S105 in corresponding embodiment, will not continue to repeat here.

Local identification module 60 obtains target knowledge for identifying to the part in the local image region Not as a result, and determining the attribute of the target object based on the target identification result.

Wherein, local identification module 60 includes: feature extraction unit 601, characteristic matching unit 602 and attribute determining unit 603；

Feature extraction unit 601 for the local image region to be determined as pending area, and is based on the second network Model carries out feature extraction to the pending area, obtains characteristics of image corresponding with the pending area；

Characteristic matching unit 602, for obtaining described image feature and described second according in second network model Matching degree in network model between multiple attribute type features；

Attribute determining unit 603, matching degree and the second network mould for will be obtained by second network model The corresponding label information of multiple attribute type features is associated in type, obtains the corresponding target identification of second network model As a result, and determining the corresponding attribute of the target object based on the target identification result.

Wherein, the specific executive mode of feature extraction unit 601, characteristic matching unit 602 and attribute determining unit 603 can Step S106 is described referring in embodiment corresponding to above-mentioned Fig. 3, will not continue to repeat here.

Optionally, local identification module 60 further include: sample determination unit 604 and model training unit 605；

Sample determination unit 604, for obtaining sample set associated with the target object, and in the sample set The sample data for carrying the first label information is determined as positive sample in conjunction, and the second label will be carried in the sample set The sample data of information is determined as negative sample；Wherein, the positive sample is the sample number that the attribute of target object is living body attribute According to the negative sample is that the attribute of target object is the sample data of non-living body attribute；

Model training unit 605 is used in the sample set, by the size of the corresponding image data of the positive sample Zoom to identical size, and based on corresponding second mark of corresponding first label information of positive sample, the negative sample after scaling Sign information, training second network model.

Wherein, the sample determination unit 604 and the specific executive mode of model training unit 605 can be found in above-mentioned Fig. 3 To the description of the second network model in corresponding embodiment, will not continue to repeat here.

Optionally, the part includes the first sign information and the second sign information；

The part identification module 60 can also specifically include: image-region determination unit 606, the first matching unit 607, Second matching unit 608 and matching integrated unit 609；

Image-region determines single 606, is used for first sign information region in the local image region It is determined as the first image-region, and second sign information region is determined as the second image-region, and by described One image-region and second image-region input cascade network model, to extract the first figure in the first image region As the second characteristics of image in feature and second image-region；

First matching unit 607, for the first image feature to be inputted to first point in the cascade network model Class device exports in the first image feature and second network model between multiple attribute type features of the first classifier The first matching degree；

Second matching unit 608, for second characteristics of image to be inputted to second point in the cascade network model Class device exports in second characteristics of image and the cascade network model between multiple attribute type features of the second classifier The second matching degree；Second classifier be and the mutually cascade classifier of the first classifier；

Match integrated unit 609, the weight for weighted value and second classifier based on first classifier Value, first matching degree is merged with second matching degree, is obtained the corresponding target of the cascade network model and is known Not as a result, and determining the corresponding attribute of the target object based on the target identification result.

Wherein, described image area determination unit 606, the first matching unit 607, the second matching unit 608 and matching are melted The specific executive mode for closing unit 609, which can be found in embodiment corresponding to above-mentioned Fig. 9, describes step S212, here will not be followed by It is continuous to be repeated.

Wherein, the retrieval module 10, key point locating module 20, characteristic value acquisition module 30, video frame are chosen The specific executive mode of module 40, key frame determining module 50 and local identification module 60 can be found in the corresponding implementation of above-mentioned Fig. 3 Step S101- step S106 is described in example, will not continue to repeat here.

Further, referring to Figure 13, it is the structure of another video data processing apparatus provided in an embodiment of the present invention Schematic diagram.As shown in figure 13, above-mentioned video data processing apparatus 1000 can be applied to the target in above-mentioned Fig. 1 corresponding embodiment User terminal.Above-mentioned video data processing apparatus 1000 may include: processor 1001, network interface 1004 and memory 1005, in addition, above-mentioned video data processing apparatus 1000 can also include: user interface 1003 and at least one communication bus 1002.Wherein, communication bus 1002 is for realizing the connection communication between these components.Wherein, user interface 1003 can wrap Display screen (Display), keyboard (Keyboard) are included, optional user interface 1003 can also include wireline interface, the nothing of standard Line interface.Network interface 1004 optionally may include standard wireline interface and wireless interface (such as WI-FI interface).Memory 1004 can be high speed RAM memory, be also possible to non-labile memory (non-volatile memory), such as extremely A few magnetic disk storage.Memory 1005 optionally can also be that at least one is located remotely from the storage of aforementioned processor 1001 Device.As shown in figure 13, as may include operating system, network communication in a kind of memory 1005 of computer storage medium Module, Subscriber Interface Module SIM and equipment control application program.

In the video data processing apparatus 1000 shown in Figure 13, network interface 1004 can provide network communication function；And User interface 1003 is mainly used for providing the interface of input for user；And processor 1001 can be used for calling in memory 1005 The equipment of storage controls application program, to realize:

It should be appreciated that the executable Fig. 3 or Fig. 9 above of video data processing apparatus 1000 described in the embodiment of the present invention To the description of above-mentioned video data handling procedure in corresponding embodiment, also can be performed in embodiment corresponding to Figure 12 above to upper The description of video data processing apparatus 1 is stated, details are not described herein.In addition, being described to using the beneficial effect of same procedure, also not It is repeated again.

In addition, it need to be noted that: the embodiment of the invention also provides a kind of computer storage medium, and above-mentioned meter Computer program performed by the video data processing apparatus 1 being mentioned above, and above-mentioned calculating are stored in calculation machine storage medium Machine program includes program instruction, when above-mentioned processor executes above procedure instruction, is able to carry out corresponding to Fig. 3 above or Fig. 9 To the description of above-mentioned video data handling procedure in embodiment, therefore, will no longer repeat here.In addition, to using identical The beneficial effect of method describes, and is also no longer repeated.For in computer storage medium embodiment according to the present invention not The technical detail of disclosure please refers to the description of embodiment of the present invention method.

Those of ordinary skill in the art will appreciate that realizing all or part of the process in above-described embodiment method, being can be with Relevant hardware is instructed to complete by computer program, above-mentioned program can be stored in a computer-readable storage medium In, the program is when being executed, it may include such as the process of the embodiment of above-mentioned each method.Wherein, above-mentioned storage medium can be magnetic Dish, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

The above disclosure is only the preferred embodiments of the present invention, cannot limit the right model of the present invention with this certainly It encloses, therefore equivalent changes made in accordance with the claims of the present invention, is still within the scope of the present invention.

Claims

1. a kind of video data handling procedure characterized by comprising

Target video sequence is obtained, the target pair where extracting target object in each video frame of the target video sequence As region；

Crucial point location is carried out to the target object in the targeted object region, obtains the institute in each video frame State the location information of the target critical point of target object and the target critical point of each video frame；

The location information of target critical point based on each video frame obtains the target critical point pair of each video frame The behavioral characteristics value answered；

Behavioral characteristics value is chosen from the target video sequence meets the video frame under dbjective state；Under the dbjective state For characterizing, the movement filtered out from the target video sequence links up video frame and the target object is in the mesh The video frame of mark state；

Video frame under the dbjective state is determined as key video sequence frame, and according to the target critical in the key video sequence frame Part belonging to point, the local image region where taking the part in the key video sequence frame；

The local image region is identified, obtains target identification as a result, and determining institute based on the target identification result State the attribute of target object.

2. the method according to claim 1, wherein the acquisition target video sequence, from the target video The targeted object region where target object is extracted in each video frame of sequence, comprising:

Acquisition includes the video data of the target object, and is the corresponding mesh of the target object by the Digital video resolution Video sequence is marked, and obtains the first video frame and the second video frame from the target video sequence；

Image-region where obtaining the target object in first video frame, as the mesh in first video frame Subject area, and the image-region where obtaining the target object in second video frame are marked, as second view Targeted object region in frequency frame.

3. according to the method described in claim 2, it is characterized in that, the target pair in the targeted object region As carrying out crucial point location, the target critical point and target critical point of target object described in each video frame are obtained Location information, comprising:

Crucial point location is carried out to the target object in the targeted object region in first video frame, obtains the mesh Mark the position letter of all key points of the object in all key points and first video frame in first video frame Breath, and close the target that two key points on first position are determined as first video frame from obtained all key points Key point；

To all key points in first video frame to chasing after in the targeted object region in second video frame Track obtains the position letter of all key points in second video frame and all key points in second video frame Breath；

According to the target critical point in first video frame, in the institute that the targeted object region of second video frame is included Have in key point, two key points on the second position are determined as to the target critical point of second video frame.

4. according to the method described in claim 2, it is characterized in that, described obtain the target pair from first video frame As the image-region at place, as the targeted object region in first video frame, comprising:

If first video frame is the first video frame of the target video sequence, it is based on first network model, filters out institute The background area in the first video frame is stated, and identifies the target object in wiping out background region based on the first network model The image-region in the first video frame afterwards, and the image-region that will identify that as the target object in first video Targeted object region in frame.

5. according to the method described in claim 3, it is characterized in that, the targeted object region in second video frame In to all key points in first video frame to being tracked, obtain all key points in second video frame with And the location information of all key points in second video frame, comprising:

Based on the location information of each key point in first video frame, each crucial point tracking is mapped to described Targeted object region in second video frame, and based on mapping in the targeted object region in second video frame Key point obtains all key points in second video frame, and second video is determined in second video frame The location information of each key point in frame.

6. method according to claim 1-5, which is characterized in that the target based on each video frame The location information of key point obtains the corresponding behavioral characteristics value of target critical point of each video frame, comprising:

Obtain the location information of the target critical point of each video frame, and the position of the target critical point according to each video frame Confidence breath determines the corresponding distance difference of target critical point of each video frame, and the distance difference determined is determined Behavioral characteristics value corresponding to target critical point for corresponding video frame.

7. the method according to claim 1, wherein the dbjective state includes the first sign state and the second body Symptom state, the corresponding behavioral characteristics value of target critical point of each video frame include the behavioral characteristics under the first sign state Behavioral characteristics value under value and the second sign state；

In each video frame, the behavioral characteristics value under first sign state is obtained, and in the first sign shape The first maximum behavioral characteristics value is obtained in behavioral characteristics value under state, and determines first based on the described first maximum behavioral characteristics value Targets threshold；

In each video frame, the behavioral characteristics value under second sign state is obtained, and in the second sign shape The second maximum behavioral characteristics value is obtained in behavioral characteristics value under state, and determines second based on the described second maximum behavioral characteristics value Targets threshold；

Behavioral characteristics value under first sign state is compared with the first object threshold value, and by second body Behavioral characteristics value under symptom state is compared with second targets threshold；

By the behavioral characteristics value under continuous multiple the first sign states greater than first object threshold value, and/or less than the second target Video frame corresponding to behavioral characteristics value under second sign state of threshold value is determined as being sieved from the target video sequence The video frame for acting the coherent and described target object and being in dbjective state selected, to obtain the video frame under dbjective state.

8. the method according to claim 1, wherein the video frame by under the dbjective state determines key Video frame, and the part according to belonging to the target critical point in the key video sequence frame are scratched in the key video sequence frame Take the local image region where the part, comprising:

The movement filtered out is linked up and the target object is in the video frame of the dbjective state as candidate video frame, and Quality evaluation is carried out to the targeted object region in the candidate video frame, and the candidate view is filtered out according to quality assessment result Blurry video frames in frequency frame；

In filtering out the candidate video frame after blurry video frames, the candidate video frame with highest resolution is determined as crucial view Frequency frame, and based on part belonging to the target critical point in the key video sequence frame, it takes in the key video sequence frame The part region is as local image region.

9. being obtained the method according to claim 1, wherein described identify the local image region Target identification as a result, and the attribute of the target object is determined based on the target identification result, comprising:

The local image region is determined as pending area, and the pending area is carried out based on the second network model Feature extraction obtains characteristics of image corresponding with the pending area；

According to obtaining in second network model, multiple attribute types in described image feature and second network model are special Matching degree between sign；

The matching degree obtained by second network model is corresponding with multiple attribute type features in second network model Label information be associated, obtain the corresponding target identification of second network model as a result, and based on the target identification As a result the corresponding attribute of the target object is determined.

10. according to the method described in claim 9, it is characterized in that, the method also includes:

Sample set associated with the target object is obtained, and the first label information will be carried in the sample set Sample data is determined as positive sample, and the sample data for carrying the second label information is determined the sample that is negative in the sample set This；Wherein, the positive sample is the sample data that the attribute of target object is living body attribute, and the negative sample is target object Attribute is the sample data of non-living body attribute；

In the sample set, by the size scaling of the corresponding image data of the positive sample to identical size, and based on contracting Corresponding first label information of positive sample, corresponding second label information of the negative sample after putting, training second network Model.

11. the method according to claim 1, wherein the part includes the first sign information and second Sign information；

It is described that the local image region is identified, target identification is obtained as a result, and true based on the target identification result The attribute of the fixed target object, comprising:

First sign information region is determined as the first image-region in the local image region, and will be described Second sign information region is determined as the second image-region, and by the first image region and second image-region Cascade network model is inputted, to extract in the first characteristics of image and second image-region in the first image region Second characteristics of image；

The first image feature is inputted into the first classifier in the cascade network model, exports the first image feature The first matching degree between multiple attribute type features of the first classifier in second network model；

Second characteristics of image is inputted into the second classifier in the cascade network model, exports second characteristics of image The second matching degree between multiple attribute type features of the second classifier in the cascade network model；Second classification Device be and the mutually cascade classifier of the first classifier；

The weighted value of weighted value and second classifier based on first classifier, by first matching degree with it is described Second matching degree is merged, and obtains the corresponding target identification of the cascade network model as a result, and based on the target identification As a result the corresponding attribute of the target object is determined.

12. a kind of video data processing apparatus characterized by comprising

Retrieval module extracts mesh from each video frame of the target video sequence for obtaining target video sequence Mark the targeted object region where object；

Key point locating module is obtained for carrying out crucial point location to the target object in the targeted object region The position of the target critical point of the target critical point and each video frame of the target object in each video frame Information；

Characteristic value acquisition module obtains described each for the location information of the target critical point based on each video frame The corresponding behavioral characteristics value of target critical point of video frame；

Video frame chooses module, meets the video under dbjective state for choosing behavioral characteristics value from the target video sequence Frame；It is coherent and described that video frame under the dbjective state is used to characterize the movement filtered out from the target video sequence Target object is in the video frame of the dbjective state；

Key frame determining module, for the video frame under the dbjective state to be determined as key video sequence frame, and according to the pass Part belonging to target critical point in key video frame, where taking the part in the key video sequence frame Local image region；

Local identification module obtains target identification as a result, and based on the mesh for identifying to the local image region Mark recognition result determines the attribute of the target object.

13. device according to claim 12, which is characterized in that the retrieval module includes:

Data parsing unit is institute for acquiring the video data comprising the target object, and by the Digital video resolution The corresponding target video sequence of target object is stated, and obtains the first video frame and the second video from the target video sequence Frame；

Area determination unit, for the image-region where obtaining the target object in first video frame, as institute State the targeted object region in the first video frame, and the image district where obtaining the target object in second video frame Domain, as the targeted object region in second video frame.

14. a kind of video data processing apparatus characterized by comprising processor and memory；

The processor is connected with memory, wherein the memory is for storing program code, and the processor is for calling Said program code, to execute such as the described in any item methods of claim 1-11.

15. a kind of computer storage medium, which is characterized in that the computer storage medium is stored with computer program, described Computer program includes program instruction, and described program is instructed when being executed by a processor, executed such as any one of claim 1-11 The method.