CN111666908B

CN111666908B - Method, device, equipment and storage medium for generating interest portraits of video users

Info

Publication number: CN111666908B
Application number: CN202010526685.9A
Authority: CN
Inventors: 眭哲豪; 卢江虎; 项伟
Original assignee: Guangzhou Baiguoyuan Information Technology Co Ltd
Current assignee: Bigo Technology Singapore Pte Ltd
Priority date: 2020-06-09
Filing date: 2020-06-09
Publication date: 2023-05-16
Anticipated expiration: 2040-06-09
Also published as: CN111666908A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for generating an interest portrait of a video user. Wherein the method comprises the following steps: intercepting real face areas actually shot by a video user in each video frame of a target video respectively; clustering the real face areas in the target video by adopting the face representative features of the real face areas to obtain a representative face set of the video user in the target video; and generating image information representing each real face area in the face set, and fusing the image information of each real face area to generate interest image information of the video user in the target video. The technical scheme provided by the embodiment of the invention realizes the generation of the user interest portrait at the video level, eliminates the interference of the non-face picture in the target video, and improves the comprehensiveness and accuracy of the generation of the interest portrait information of the video user in the target video.

Description

Method, device, equipment and storage medium for generating interest portraits of video users

Technical Field

The embodiment of the invention relates to the technical field of video processing, in particular to a method, a device, equipment and a storage medium for generating interest images of a video user.

Background

In the current age of information explosion, short video has become an important channel for people to acquire information, and if a proper short video work can be recommended to each user, the short video work is beneficial to improving the use experience of the user, and short video products are beneficial to improving daily life and user retention; however, when performing accurate recommendation for each user, the short video field in which the user is interested needs to be analyzed through basic attribute information of the video user, such as age and gender of the video user, which is included in various videos uploaded by the user historically, and further other short videos related to the field are recommended for the user, so that it is important to accurately generate interest portrait information of the short video user, such as age and gender, in the information recommendation field.

At present, a certain video frame which can represent the user information contained in the video is firstly screened from the video uploaded by the user history, and then a convolutional neural network based on deep learning is adopted to directly identify the face area in the video frame, so that the portrait information composed of age, sex and the like of the user contained in the video frame is judged and used as the interest portrait information of the video user represented by the video.

At this time, in the prior art, user portrait identification is only performed on a specific video frame in a video, and the identified portrait information is used as the interest portrait information of the video user represented by the video, and user portrait identification is not performed on other video frames, so that the interest portrait information of the finally obtained video user has a certain one-sided property, and meanwhile, the interest portrait identification of the video user between different video frames has more interference, so that the interest portrait identification accuracy of the video user is affected.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for generating interest portraits of a video user, which are used for improving the comprehensiveness and the accuracy of interest portraits of the video user.

In a first aspect, an embodiment of the present invention provides a method for generating an interest portrait of a video user, where the method includes:

intercepting real face areas actually shot by a video user in each video frame of a target video respectively;

clustering the real face areas in the target video by adopting the face representative features of the real face areas to obtain a representative face set of the video user in the target video;

And generating image information of each real face area in the representative face set, and fusing the image information of each real face area to generate interest image information of the video user in the target video.

In a second aspect, an embodiment of the present invention provides an interest image generating apparatus for a video user, including:

the real face intercepting module is used for intercepting real face areas actually photographed by a video user in each video frame of the target video respectively;

the representative face clustering module is used for clustering the real face areas in the target video by adopting the face representative characteristics of the real face areas to obtain a representative face set of the video user in the target video;

and the portrait generation module is used for generating portrait information of each real face area in the representative face set, fusing portrait information of each real face area and generating interest portrait information of the video user in the target video.

In a third aspect, an embodiment of the present invention provides an apparatus, including:

one or more processors;

a storage means for storing one or more programs;

The one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method for generating an representation of interest for a video user according to any embodiment of the present invention.

In a fourth aspect, an embodiment of the present invention provides a computer readable storage medium, on which a computer program is stored, where the program when executed by a processor implements the method for generating an interest image for a video user according to any embodiment of the present invention.

According to the interest portrait generation method, device and equipment for the video user and the storage medium, firstly, the real face areas actually shot by the video user are respectively cut out in each video frame of the target video, interference of non-face pictures in the target video is eliminated, then face representative features of the real face areas are adopted, the real face areas in the target video are clustered to obtain a representative face set of the video user in the target video, at the moment, the real face areas in the representative face set can comprehensively represent the interest range of the video user in the target video, then image information of each real face area in the representative face set is directly generated, image information of each real face area is fused, and the interest portrait information of the video user in the target video is generated, so that the interest portrait generation of the user in the video level is realized, the problem of interest portrait information in the condition that the user portrait information in a certain specific video frame is taken as the interest portrait information of the video is solved, meanwhile, interference of the non-face pictures in the target video can be eliminated, and the interest and the accuracy of the video user in the target video are improved.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1A is a flowchart of a method for generating an interest image for a video user according to an embodiment of the present invention;

FIG. 1B is a schematic diagram of a process for generating an interest image of a video user according to an embodiment of the present invention;

FIG. 2A is a flowchart of a method for generating an interest image for a video user according to a second embodiment of the present invention;

fig. 2B is a schematic diagram of a real face region capturing process in the method according to the second embodiment of the present invention;

FIG. 3A is a flowchart of a method for generating an interest image for a video user according to a third embodiment of the present invention;

FIG. 3B is a schematic diagram of a user-level interest image generating process in a method according to a third embodiment of the present invention;

FIG. 4 is a schematic diagram of a device for generating an interest image for a video user according to a fourth embodiment of the present invention;

fig. 5 is a schematic structural diagram of a device according to a fifth embodiment of the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not limiting thereof. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present invention are shown in the drawings. Furthermore, embodiments of the invention and features of the embodiments may be combined with each other without conflict.

Example 1

Fig. 1A is a flowchart of a method for generating an interest image of a video user according to an embodiment of the present invention, where the embodiment is applicable to any case of generating interest image information of a video user at a video level. The method for generating the interest portraits of the video user provided by the embodiment of the invention can be executed by the device for generating the interest portraits of the video user provided by the embodiment of the invention, the device can be realized in a mode of software and/or hardware, and the device is integrated in the equipment for executing the method, and the equipment can be a background server specially responsible for storing the uploaded video data.

Specifically, referring to fig. 1A, the method may include the steps of:

s110, capturing real face areas actually shot by the video user in each video frame of the target video.

Specifically, in the field of short video recommendation, it is generally required to determine basic attribute information of each video user, such as age and gender of the video user, and screen short videos of interest to the video user according to the basic attribute information of each video user for accurate recommendation.

Optionally, the video user may refer to a user registered with a corresponding account in the short video software, and at this time, because the video user usually shoots a relevant picture when actually shooting a video, that is, a picture of the video user exists in a video actually shot by the video user in a history process, the embodiment can calculate a user interest portrait including basic attribute information such as age and sex of a user interested by the video user by analyzing a face picture included in a video shot by the video user in the history. At this time, the target video in this embodiment may refer to any video that is uploaded to the video server after the video user actually shoots in the history process to be published on the internet.

In this embodiment, when generating the user interest image including the basic attribute information such as age and sex of the user interested in the video user, the face data appearing in the video actually shot by the video user is mainly referred to, so that when the target video in this embodiment is screened out from the multiple videos actually shot by the video user in the history process, so as to analyze the basic attribute information such as age and sex of the user included in the target video, the target video is decoded correspondingly first, and processed in a second frame cutting mode, so as to obtain a series of video frames after the target video is decoded, and then picture analysis is performed on each video frame in the target video, so that the real face area actually shot by the video user is cut out in each video frame, for example, for the live video of the video user, the corresponding real face area is cut out in a part of the picture area shot by the video user in each video frame of the live video, and then the real face area shot by other users in each video frame of the live video is not required to be cut out, and then the real face area of the real face information of the user is removed, and the real face information of the user is not shot in the live video frame is accurately analyzed.

And S120, clustering the real face areas in the target video by adopting the face representative features of the real face areas to obtain a representative face set of the video user in the target video.

The face representing feature in this embodiment refers to a feature that can represent an accurate form of key points such as a nose, a mouth, and eyes in a real face area, for example, an emmbedding feature vector that is used to represent a low-dimensional shape of a real face.

Specifically, after each video frame of the target video is respectively intercepted into a real face area actually shot by a video user, as a plurality of people can appear when the video user actually shoots the target video, the intercepted real face areas in each video frame of the target video can be face pictures of a plurality of different users, and at the moment, the face representative features among different users are different, so that the embodiment firstly adopts the existing feature extraction network to extract the features of each real face area intercepted in the target video, and the face representative feature of each real face area is obtained; further, similarity among face representative features of each real face region is analyzed through an existing clustering algorithm to cluster each real face region in a target video, in this embodiment, a density-based clustering algorithm (Density Based Spatial Clustering of Applications with Noise, DBSCAN) can be adopted to perform clustering analysis on the face representative features of each real face region in the target video, so that a plurality of different face clustering results are obtained, and at the moment, the real face regions contained in each face clustering result are similar faces, namely, each face clustering result can represent face information of a user; at this time, since the target video is actually shot by the video user, the face with the largest mirror in the target video is considered to be the main person of the target video, that is, the video user in this embodiment, even if the face with the largest mirror in the target video is not the video user in this embodiment, the video user is interested in the video containing such faces, so after obtaining a plurality of different face clustering results, the face clustering result with the largest number of similar real face areas can be screened out, at this time, each real face area contained in the screened face clustering result is considered to be the face information capable of representing the interest of the video user, and further the screened face clustering result is taken as the representative face set of the video user in the target video in this embodiment, and the representative face set contains a plurality of real face areas interested by the video user and the face representative feature of each real face area.

S130, generating image information representing each real face area in the face set, fusing the image information of each real face area, and generating interest image information of the video user in the target video.

Specifically, after the representative face set of the video user in the target video is obtained, as the plurality of real face areas contained in the representative face set can represent real face information of interest of the video user, the embodiment can judge the image information such as age and gender represented by each real face area of interest of the video user by analyzing the face representative characteristics of each real face area in the representative face set, and then fuse the image information represented by each real face area in the representative face set, so that the interest image information of the video user in the target video is obtained, and at the moment, the image information of each real face area in the representative face set of the video user is analyzed and fused, so that the one-sided nature existing when the user interest image of the video level is analyzed by adopting a specific video frame is avoided, and the comprehensiveness and accuracy of the interest image information generation of the video user in the target video are improved.

For example, as shown in fig. 1B, in order to accurately generate portrait information of each real face area in a representative face set of a video user in a target video, in this embodiment, a training mode of a convolutional neural network model is adopted, and a portrait generation model capable of accurately generating portrait information of a real face area is pre-constructed; at this time, in this embodiment, the generating image information representing each real face area in the face set may specifically include: for each real face region in the representative face set, the face representative features of the real face region are input into a pre-constructed portrait generation model to obtain sub-portrait information of the real face region in each portrait dimension, and portrait information of the real face region is combined.

Specifically, after obtaining a representative face set of a video user in a target video, the embodiment can sequentially input face representative features of each real face area in the representative face set into a pre-constructed portrait generation model, and perform feature analysis on the face representative features of each real face area in different portrait dimensions through network parameters and network structures of the portrait generation model in a training process, so as to obtain sub-portrait information of each real face area in each portrait dimension, wherein the portrait dimension can be different attributes such as age, sex and the like contained in a user image; the sub-portrait information of each real face area under each portrait dimension can be combined subsequently, so that portrait information of each real face area is obtained.

For example, in order to ensure the accuracy of sub-portrait information generation of the real face region in each portrait dimension, the present embodiment may set corresponding portrait sub-models, such as an age sub-model set for the age of the user, a gender sub-model set for the gender of the user, and the like, in a pre-built portrait generation model according to each portrait dimension. In this case, the face representative feature of each real face region in the representative face set may be sequentially input into each image sub-model in the image generation model, and sub-image information of the real face region in the corresponding image dimension may be analyzed by each image sub-model, so that sub-image information of the real face region in the corresponding image dimension may be output by each image sub-model.

In this embodiment, the types of the image sub-models may be set according to the characteristics of different image dimensions, for example, the age sub-model is a regression model, and the sex sub-model is a classification model.

In addition, after obtaining sub-portrait information of the real face region in each portrait dimension, for fusion operation of portrait information of each real face region, the embodiment can perform mean analysis on sub-portrait information of each real face region in a representative face set in each portrait dimension to determine interest sub-portrait information of a video user in the target video in the portrait dimension; and combining the interest sub-picture information of the video user in each portrait dimension in the target video to obtain the interest portrait information of the video user in the target video.

Specifically, for each image dimension, the image sub-model in the image dimension outputs sub-image information of each real face area in the image dimension, for example, the ages and the like represented by each real face area of a video user, at the moment, the sub-image information of each real face area in the image dimension in a representative face set is subjected to mean value analysis, the sub-image mean value in each image dimension is used as interest sub-image information of the video user in the image dimension in a target video, for example, the age mean value of each real face area and the like, the interest sub-image information of each real face area in each image dimension in the target video is respectively determined by mean value analysis, and then the interest sub-image information of the video user in each image dimension in the target video is combined, so that the interest sub-image information of the video user in the target video is obtained.

According to the technical scheme provided by the embodiment, firstly, the real face areas actually shot by the video user are respectively cut out in each video frame of the target video, interference of non-face pictures in the target video is eliminated, then the face representative characteristics of each real face area are adopted to cluster the real face areas in the target video, a representative face set of the video user in the target video is obtained, at the moment, the real face areas in the representative face set can comprehensively represent the interest range of the video user in the target video, image information of each real face area in the representative face set is directly generated subsequently, image information of each real face area is fused, and the image information of the interest of the video user in the target video is generated, so that user interest image generation under the video level is realized, the problem of interest image information one-sided performance of the user image information in a certain specific video frame in the video is avoided, meanwhile, interference of the non-face pictures in the target video is eliminated, and the comprehensiveness and the accuracy of interest image information of the video user in the target video are improved.

Example two

Fig. 2A is a flowchart of a method for generating an interest image of a video user according to a second embodiment of the present invention, and fig. 2B is a schematic diagram of a real face region clipping process in the method according to the second embodiment of the present invention. This embodiment is optimized based on the above embodiment. Specifically, as shown in fig. 2A, the present embodiment mainly explains in detail a specific clipping process of a real face area in each video frame in a target video.

Optionally, as shown in fig. 2A, the present embodiment may include the following steps:

s210, each video frame of the target video is used for respectively cutting out a frame area actually shot by a video user as a corresponding new video frame.

Optionally, when the video user actually shoots the video, the video shot at this time may be combined with other videos to be uploaded together as a unified video, so that in this embodiment, there is a high possibility that pictures shot by other users exist in each video frame of the target video, at this time, in order to ensure correlation between the intercepted real face area and the video user, the embodiment first determines a picture area actually shot by the video user in each video frame of the target video, and then intercepts a frame area actually shot by the video user in each video frame of the target video, as a corresponding new video frame, and subsequently intercepts the real face area in the new video frame, so as to avoid interference of video pictures shot by other users.

For example, for capturing a new video frame in a target video, the embodiment may specifically include: if the target video is a live video, determining an actual shooting area of a video user in the target video, and respectively cutting out a frame picture in the actual shooting area in each video frame to serve as a corresponding new video frame; and if the target video is a non-shot video, taking each video frame in the target video as a corresponding new video frame.

Specifically, since the frames shot by other users only exist in the video in time, in this embodiment, after the target video is acquired, whether the target video is the video in time is first determined, if the target video is the video in time, which indicates that the frames shot by other users exist in each video frame of the target video, then the actual shooting area of the video user needs to be determined in each video frame of the target video, and then the frame frames in the actual shooting area are respectively captured in each video frame to be used as corresponding new video frames; if the target video is a non-shot video, it is indicated that there is no picture shot by other users in each video frame of the target video, then each video frame in the target video is not processed in any way, and each original video frame in the target video is directly used as a corresponding new video frame.

At this time, the step of determining whether the target video is a live video may specifically include: if the video frame size proportion in the target video accords with the preset video proportion in a shooting mode, determining the target video as the video in a shooting mode, otherwise, adopting an edge detection operator to carry out linear detection on the middle area of each video frame in the target video; if each video frame in the target video detects a straight line, and the straight line exceeds a preset length, determining that the target video is a video in time.

Specifically, because the in-time video is obtained by combining a plurality of different videos actually shot by different users, the size of the video actually shot by the users is consistent with the size of a terminal display screen, so that the size proportion of the in-time video and the non-in-time video has larger difference, for example, the length-width size proportion of the non-in-time video actually shot by the users is matched with the display screen and is more approximate to a rectangle, the length-width size proportion of most of the in-time video after being spliced by the plurality of non-in-time videos is more approximate to a square, so that the size proportion of most of the in-time video generally accords with the preset in-time video proportion, at the moment, the target video is determined to be in-time video, otherwise, each video frame in the target video is converted into a gray level image, and an edge detection operator (Sobel operator) is continuously adopted for carrying out linear detection on each video frame in the target video, and as the middle of the in-time video is always provided with a vertical straight line, the middle part of each video frame is selected for carrying out linear detection, and the middle part of the video frame is more approximate to a square, and the corresponding linear part of the middle part of the video frame is subjected to linear detection operation is carried out, and the linear part of the linear expansion operation is detected, and the linear part of the corresponding part is detected is calculated; at this time, if a straight line is detected in the middle area of each video frame in the target video and exceeds a preset length, the straight line is indicated to be a straight line appearing in the video in time, and the target video is determined to be the video in time. Meanwhile, if the video frame size proportion in the target video does not accord with the preset video proportion in a time shooting, or each video frame in the target video does not detect a straight line, or the detected straight line does not exceed the preset length, the target video is taken as a non-time shooting video.

S220, respectively cutting out corresponding face target areas in each new video frame, and screening out corresponding real face areas from the face target areas.

Optionally, after capturing the frame area actually captured by the video user from each video frame of the target video as a corresponding new video frame, performing face image analysis on each new video frame in the target video, and capturing the face target area actually captured by the video user from each new video frame; at this time, after a video user actually shoots a certain video, a cartoon character special effect or a facial expression special effect may be added to the video, so that some non-real face areas with cartoon character special effects or facial expression special effects may exist in the face target areas cut in each new video frame, at this time, the corresponding real face areas need to be screened out from the face target areas, and only the face representative features of each real face area are extracted subsequently, thereby eliminating the influence of the non-real face on the generation of interest images of the video user and ensuring the interest image generation accuracy of the video user.

For example, for the recognition of the face target area in each new video frame, the embodiment may pre-construct a face detection model, where the face detection model can accurately recognize the face area in the image, and in this case, as shown in fig. 2B, the embodiment may specifically include: inputting each new video frame into a pre-constructed face detection model respectively, and marking an initial face area in the new video frame and the face confidence of the initial face area; and taking the initial face region with the face confidence degree exceeding the preset confidence threshold value in each new video frame as a corresponding face target region, and intercepting according to the marking position of the face target region.

Specifically, in this embodiment, each new video frame may be respectively input into a pre-constructed face detection model, where the face detection model identifies whether a corresponding face area exists in each new video frame, and at this time, the face detection model marks each initial face area in each new video frame by using a rectangular frame, and records the coordinate position of each initial face area and the face confidence of the initial face area; at this time, if the face confidence of a certain initial face area exceeds a preset confidence threshold, it is indicated that the initial face area accurately contains a face picture, so that the initial face area with the face confidence exceeding the preset confidence threshold in each video frame is used as a corresponding face target area, and corresponding interception is performed according to the rectangular frame mark position of each face target area in each new video frame, so as to obtain the face target area in each new video frame.

Further, after the corresponding face target areas are respectively cut out in each new video frame, the embodiment adopts a pre-built face classification model to judge whether each face target area is a real face area, and at the moment, for screening of the real face areas, the embodiment can respectively input each face target area into the pre-built face classification model to obtain the real face score of the face target area; and taking the face target area with the real face score exceeding the preset real threshold value as a corresponding real face area.

And S230, clustering the real face areas in the target video by adopting the face representative features of the real face areas to obtain a representative face set of the video user in the target video.

S240, generating image information representing each real face area in the face set, fusing the image information of each real face area, and generating interest image information of the video user in the target video.

According to the technical scheme provided by the embodiment, the frame areas actually shot by the video users are respectively cut out in each video frame of the target video to serve as corresponding new video frames, the corresponding face target areas are respectively cut out in each new video frame, the corresponding real face areas are further screened out from the face target areas, influences of pictures shot by other users and image generation of non-real faces on the video users are eliminated, face representative features of the real face areas are adopted subsequently, the real face areas in the target video are clustered to obtain a representative face set of the video users in the target video, at the moment, the real face areas in the representative face set can comprehensively represent the interest range of the video users in the target video, image information of each real face area in the representative face set is directly generated subsequently, and the interest information of the video users in the target video is generated, so that interest image generation of the video users in the target video is realized, and the interest information generation of the video users in the target video is improved.

Example III

Fig. 3A is a flowchart of a method for generating an interest image of a video user according to the third embodiment of the present invention, and fig. 3B is a schematic diagram of a user-level interest image generating process in the method according to the third embodiment of the present invention. This embodiment is optimized based on the above embodiment. Specifically, as shown in fig. 3A, the present embodiment mainly explains in detail a specific generation process of interest portrait information of a video user at a user level.

Optionally, as shown in fig. 3A, the present embodiment may include the following steps:

s310, capturing real face areas actually shot by the video user in each video frame of the target video.

S320, clustering the real face areas in the target video by adopting the face representative features of the real face areas to obtain a representative face set of the video user in the target video.

S330, taking each video in the history video library which is actually shot and uploaded by the video user as a target video in sequence, and determining a representative face set of the video user in each video.

Optionally, when analyzing the interest portrait information of the video user at the user level, firstly, the portrait information of the video user in a large number of videos in a historical video library which is actually shot and uploaded by the video user needs to be analyzed, so that the interest portrait information of the video user at the user level can be comprehensively analyzed conveniently, and the interest portrait information of the video user at the user level can be obtained; at this time, each video in the history video library which is actually shot and uploaded by the video user can be sequentially used as a target video in the embodiment, the representative face set of the video user in each video of the history video library is determined by adopting the interest portrait generation method of the video user at the video level provided in the embodiment, and then the representative face sets in each video are integrated together for analysis again, so that the comprehensiveness of interest portrait information of the video user at the user level is ensured.

S340, re-clustering the real face areas of the video users after the combination of the representative face sets in each video to obtain the comprehensive representative face set of the video users.

Optionally, after determining the representative face set of the video user in each video, merging the representative face sets of the video user in each video, wherein different videos shot by the video user may face different users, so that dissimilar real face areas may exist in the merged integral representative face set, and therefore similarity among face representative features of each real face area in the merged integral representative face set can be analyzed through an existing clustering algorithm to re-cluster each real face area in the merged integral representative face set, and at the moment, the real face areas contained in each face clustering result are similar faces, that is, each face clustering result can represent face information of a user; the target video is obtained by actual shooting of the video user, so that the face with the largest mirror out in the combined overall representative face set can be considered to represent the video image of interest of the video user in the embodiment, therefore, after a plurality of different face clustering results are obtained, the face clustering result with the largest number of similar real face areas can be screened out, at the moment, each real face area contained in the screened face clustering result is considered to be the face information of interest of the video user, and the screened face clustering result is further taken as the comprehensive representative face set of the video user in the embodiment, and the comprehensive representative face set contains the real face areas of interest of the video user and the face representative feature of each real face area.

S350, generating image information of each real face area in the comprehensive representative face set, and fusing the image information of each real face area to generate interest image information of the video user.

Optionally, after the comprehensive representative face set of the video user is obtained, because the plurality of real face areas contained in the comprehensive representative face set can represent real face information of interest of the video user, the embodiment can judge the portrait information such as age and gender represented by each real face area of interest of the video user by analyzing the face representative feature of each real face area in the comprehensive representative face set, and then fuse the portrait information represented by each real face area in the comprehensive representative face set, thereby obtaining interest portrait information of the video user, and at this time, the portrait information of each real face area in the comprehensive representative face set of the video user is analyzed and fused, so that the integrity and accuracy of the generation of the interest portrait information of the video user are improved by avoiding the one-sided nature existing when the user of interest of the video level is analyzed by adopting a specific video frame.

According to the technical scheme provided by the embodiment, each video in the historical video library which is actually shot and uploaded by the video user is sequentially used as a target video, a representative face set of the video user in each video is determined, real face areas of the video user after the representative face sets in each video are combined are clustered again to obtain a comprehensive representative face set of the video user, further image information of each real face area in the comprehensive representative face set is generated, image information of each real face area is fused, interest image information of the video user is generated, interest image information of the video user is further generated on the basis of generating the interest image information of the video user in the video level, and the comprehensiveness and the accuracy of interest image generation of the video user are improved.

Example IV

Fig. 4 is a schematic structural diagram of an interest image generating device for a video user according to a fourth embodiment of the present invention, and specifically, as shown in fig. 4, the device may include:

the real face intercepting module 410 is configured to intercept real face regions actually photographed by the video user in each video frame of the target video respectively;

The representative face clustering module 420 is configured to cluster the real face regions in the target video by using the face representative features of the real face regions to obtain a representative face set of the video user in the target video;

the portrait generation module 430 is configured to generate portrait information representing each real face area in the face set, and fuse portrait information of each real face area to generate interest portrait information of the video user in the target video.

The interest portrait generating device for the video user provided by the embodiment is applicable to the interest portrait generating method for the video user provided by any embodiment, and has corresponding functions and beneficial effects.

Example five

Fig. 5 is a schematic structural diagram of an apparatus according to a fifth embodiment of the present invention, and as shown in fig. 5, the apparatus includes a processor 50, a storage device 51, and a communication device 52; the number of processors 50 in the device may be one or more, one processor 50 being taken as an example in fig. 5; the processor 50, the storage means 51 and the communication means 52 in the device may be connected by a bus or other means, in fig. 5 by way of example.

The device provided by the embodiment can be used for executing the interest portrait generation method of the video user provided by any embodiment, and has corresponding functions and beneficial effects.

Example six

The sixth embodiment of the present invention also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method for generating an interest image for a video user in any of the above embodiments. The method specifically comprises the following steps:

and generating image information representing each real face area in the face set, and fusing the image information of each real face area to generate interest image information of the video user in the target video.

Of course, the storage medium containing the computer executable instructions provided in the embodiments of the present invention is not limited to the above-described method operations, and may also perform the related operations in the method for generating an interest image of a video user provided in any embodiment of the present invention.

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the interest image generating apparatus for a video user, each unit and module included are only divided according to the functional logic, but not limited to the above-mentioned division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, and various modifications and variations may be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for generating an interest portrait of a video user, comprising:

generating image information of each real face area in the representative face set, and fusing the image information of each real face area to generate interest image information of the video user in the target video;

The generating image information of each real face area in the representative face set includes:

inputting the human face representative characteristics of each real human face region in the representative human face set into a pre-constructed portrait generation model to obtain sub-portrait information of the real human face region in each portrait dimension, and combining the sub-portrait information into portrait information of the real human face region.

2. The method of claim 1, wherein fusing representation information for each real face region to generate representation information of interest of the video user within the target video comprises:

for each portrait dimension, carrying out mean value analysis on sub-portrait information of each real face area in the representative face set in the portrait dimension, and determining interest sub-portrait information of the video user in the target video in the portrait dimension;

and combining the interest sub-picture information of the video user in each portrait dimension in the target video to obtain the interest portrait information of the video user in the target video.

3. The method of claim 1, wherein capturing real face regions actually captured by the video user within each video frame of the target video, respectively, comprises:

Each video frame of the target video is respectively cut off a frame area actually shot by a video user as a corresponding new video frame;

and respectively cutting out corresponding face target areas in each new video frame, and screening out corresponding real face areas from the face target areas.

4. A method according to claim 3, wherein capturing the frame area actually photographed by the video user as the corresponding new video frame in each video frame of the target video includes:

if the target video is a live video, determining an actual shooting area of the video user in the target video, and respectively intercepting a frame picture in the actual shooting area in each video frame to serve as a corresponding new video frame;

and if the target video is a non-shot video, taking each video frame in the target video as a corresponding new video frame.

5. The method of claim 4, further comprising, before each video frame of the target video respectively cuts out a frame area actually photographed by the video user:

if the video frame size proportion in the target video accords with the preset video proportion in a time, determining that the target video is a time video, otherwise, adopting an edge detection operator to carry out linear detection on the middle area of each video frame in the target video;

And if each video frame in the target video detects a straight line and the straight line exceeds a preset length, determining that the target video is a shot video.

6. A method according to claim 3, wherein each new video frame is provided with a corresponding face target region, comprising:

inputting each new video frame into a pre-constructed face detection model respectively, and marking an initial face area in the new video frame and the face confidence of the initial face area;

and taking the initial face region with the face confidence exceeding the preset confidence threshold value in each new video frame as a corresponding face target region, and intercepting according to the marking position of the face target region.

7. A method according to claim 3, wherein screening out the corresponding real face region from the face target region comprises:

respectively inputting each face target area into a pre-constructed face classification model to obtain a real face score of the face target area;

and taking the face target area with the real face score exceeding a preset real threshold value as a corresponding real face area.

8. The method of any one of claims 1-7, further comprising:

Taking each video in the history video library which is actually shot and uploaded by the video user as the target video in sequence, and determining a representative face set of the video user in each video;

reclustering the real face areas of the video users after the combination of the representative face sets in each video to obtain the comprehensive representative face set of the video users;

and generating image information of each real face area in the comprehensive representative face set, and fusing the image information of each real face area to generate interest image information of the video user.

9. An interest portrayal generating apparatus for a video user, comprising:

the portrait generation module is used for generating portrait information of each real face area in the representative face set, fusing portrait information of each real face area and generating interest portrait information of the video user in the target video;

The portrait generation module is also used for:

10. An interest portrayal generating apparatus for a video user, said apparatus comprising:

one or more processors;

a storage means for storing one or more programs;

the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of generating an representation of interest for a video user as claimed in any one of claims 1 to 8.

11. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, implements a method of generating an representation of interest for a video user as claimed in any one of claims 1 to 8.