CN115934988A

CN115934988A - Live cover image extraction method, device, equipment and medium

Info

Publication number: CN115934988A
Application number: CN202211701647.8A
Authority: CN
Inventors: 郑康元; 宫凯程; 陈增海; 王璞
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-04-07

Abstract

The application discloses a live-broadcast cover image extraction method, a device and a medium thereof, wherein the method comprises the following steps: acquiring a plurality of face area images from a live video stream; synchronously determining a plurality of evaluation scores of each face region image corresponding to a plurality of evaluation dimensions, and summarizing the plurality of evaluation scores into a comprehensive score of the corresponding face region image; and screening out a face area image according to the comprehensive score to be used for determining a cover image of the live video stream. The advantages and disadvantages of the face region images are comprehensively measured through a plurality of evaluation dimensions, a more comprehensive system can effectively determine the high-quality face region images, the cover images manufactured according to the method are more representative, and the image information of the anchor user of the live broadcast video stream can be borne, so that the live broadcast cover constructed by the method can obtain larger user flow, the effect of effective popularization is achieved, and the interaction of a live network platform is activated.

Description

Live cover image extraction method, device, equipment and medium

Technical Field

The present application relates to the field of webcast technologies, and in particular, to a live cover image extraction method, and a corresponding device, electronic device, and computer-readable storage medium.

Background

In a network live broadcast scene, an anchor user pushes a video stream to a live broadcast room, so that the application purposes of talent and skill display, information sharing, knowledge education and the like are realized, the anchor user participates in social labor through activities to obtain profits, and the overall social benefit is promoted.

A promotion entrance is generated for the live broadcast room, so that a user can watch live broadcast video stream of the live broadcast room through the promotion entrance, the user is attracted to enter the live broadcast room, convenience is provided for the user, the development of network live broadcast activities can be promoted, and social benefits can be comprehensively improved.

The promotion entry of live broadcast room demonstrates with the form of live broadcast front cover usually, and the main part of live broadcast front cover is front cover image, and high-quality front cover image can attract more users' concern, promotes the promotion of the user flow of corresponding live broadcast room.

The cover image manufactured by the traditional technology does not generally consider the comprehensive quality of the cover image, but intercepts related image frames from a live broadcast video stream of a live broadcast room according to a certain condition, directly takes the image frames as the cover image of the live broadcast video stream, so that the quality of the cover image is uneven, the effect of an information carrier of the live broadcast room is difficult to play, and the effect of attracting user flow for the live broadcast room is also difficult to play naturally.

Therefore, the technology for making the cover image for the live broadcast cover still has a space for improvement, and other feasible ways need to be explored so as to obtain a high-quality cover image for the live broadcast cover of the live broadcast room.

Disclosure of Invention

A primary object of the present application is to solve at least one of the above problems and provide a live cover image extraction method, and a corresponding apparatus, electronic device, and computer-readable storage medium.

In order to meet various purposes of the application, the following technical scheme is adopted in the application:

one of the purposes of the present application is to provide a live-broadcast cover image extraction method, which includes the following steps:

acquiring a plurality of face area images from a live video stream;

synchronously determining a plurality of evaluation scores of each face region image corresponding to a plurality of evaluation dimensions, and summarizing the plurality of evaluation scores into a comprehensive score of the corresponding face region image;

and screening out a face area image according to the comprehensive score to be used for determining a cover image of the live video stream.

Optionally, obtaining a plurality of face region images from a live video stream includes:

performing face detection based on the image frames of the live video stream, and determining a plurality of target image frames carrying face content and selection frames corresponding to face areas;

extracting a region image from the corresponding target image frame according to the selection frame;

clustering the face feature vectors of the plurality of region images, determining a maximum cluster from a plurality of clusters obtained by clustering, and taking the region image in the maximum cluster as a face region image.

Optionally, extracting a region image from the corresponding target image frame according to the selection frame includes:

extracting a corresponding region image from the corresponding target image frame according to the selection frame, and performing super-resolution enhancement processing on the region image;

and performing posture correction processing on the face content in the area image subjected to the super-resolution enhancement processing.

Optionally, the step of synchronously determining a plurality of evaluation scores of each face region image corresponding to a plurality of evaluation dimensions, and summarizing the plurality of evaluation scores into a comprehensive score of the corresponding face region image includes:

inputting each face region image into an image encoder of a face scoring model to extract image characteristic information of the face region image;

synchronously passing the image characteristic information through a plurality of classifiers which are arranged in the portrait scoring model and correspond to a plurality of evaluation dimensions, and determining an evaluation score corresponding to each evaluation dimension;

and fusing the plurality of evaluation scores corresponding to each face region image into a comprehensive score of the corresponding face region image.

Optionally, before acquiring a plurality of face region images from a live video stream, the method includes:

calling training samples in a training data set to input the portrait scoring model, predicting classification results of the portrait scoring model mapped to classifiers corresponding to the multiple evaluation dimensions, wherein part of the training samples are face region images;

adopting a plurality of supervision labels provided by a plurality of corresponding evaluation dimensions mapped with the training samples to correspondingly calculate the classification loss value of each classification result; the plurality of evaluation dimensions comprises any number of a plurality of dimensions as follows: the method comprises the steps of representing whether a face image in a training sample belongs to a first dimension of a real person, representing whether the face image in the training sample is a second dimension of a large head photo, representing whether the face image in the training sample is a third dimension of a high-quality image, representing whether the face image in the training sample is a complete fourth dimension, representing whether the face image in the training sample is a covered fifth dimension, representing whether the face image in the training sample belongs to a sixth dimension of the front face of the face, and representing whether the face image in the training sample contains a seventh dimension of a smile expression;

and fusing the classification loss values into a total loss value, and performing gradient updating on the portrait scoring model according to the total loss value until the portrait scoring model is iteratively trained to a convergence state.

Optionally, screening out a face region image according to the comprehensive score for determining a cover image of the live video stream, including:

screening out a plurality of face region images with the comprehensive scores higher than a preset threshold value as target face images;

cutting a screenshot containing the target face image from an image frame of a source of the target face image corresponding to the size specification of a live broadcast cover;

pushing each screenshot to terminal equipment of a main broadcasting user to which the live video stream belongs;

and acquiring the screenshot appointed by the anchor user, and setting the screenshot as a cover image of the live video stream.

screening out the face region image with the highest comprehensive score as a target face image;

performing super-resolution enhancement on the target face image by preset times to obtain an enhanced face image;

performing super-resolution enhancement of the preset multiple on an incoming image frame of the target face image to obtain an enhanced incoming image frame;

overlaying and synthesizing the enhanced face image to the enhanced source image frame according to the position corresponding relation to obtain a high-quality image frame;

and cutting a screenshot containing the target face image from the high-quality image frame corresponding to the size specification of the live broadcast cover to be set as a cover image of the live broadcast video stream.

A live cover image extraction apparatus provided in accordance with one of the objects of the present application includes:

the face image acquisition module is used for acquiring a plurality of face area images from a live video stream;

the multi-dimensional synchronous scoring module is set to synchronously determine a plurality of evaluation scores of a plurality of evaluation dimensions corresponding to each face region image, and the plurality of evaluation scores are collected into a comprehensive score of the corresponding face region image;

and the live broadcast cover screening module is used for screening the face area image according to the comprehensive score to determine the cover image of the live broadcast video stream.

The electronic device comprises a central processing unit and a memory, wherein the central processing unit is used for calling and running a computer program stored in the memory to execute the steps of the live cover image extraction method.

A computer-readable storage medium, which stores a computer program implemented according to the live cover image extraction method in the form of computer-readable instructions, and when the computer program is called by a computer, executes the steps included in the method.

A computer program product, provided to adapt to another object of the present application, comprises computer programs/instructions which, when executed by a processor, implement the steps of the method described in any of the embodiments of the present application.

Compared with the prior art, the method and the system have the advantages that a plurality of face region images are extracted from the live broadcast video stream, a plurality of evaluation values of each face region image corresponding to a plurality of evaluation dimensions are synchronously determined, the evaluation values of each face region image are gathered into comprehensive scores, each face region image is screened according to the comprehensive scores, the high-quality face region images are determined to be used for making the cover image of the live broadcast video stream, the quality of the face region images is comprehensively measured through the evaluation dimensions, the high-quality face region images can be effectively determined, the made cover image is more representative and can bear image information of a main broadcast user of the live broadcast video stream, the live broadcast cover is constructed, larger user flow can be obtained, the effect of effective popularization is achieved, and interaction of a network live broadcast platform is activated.

Drawings

The above and/or additional aspects and advantages of the present application will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is an exemplary network architecture adopted for a webcast service in a webcast scenario of the present application;

FIG. 2 is a schematic flow chart diagram of an embodiment of a live cover image extraction method of the present application;

fig. 3 is a schematic flowchart of a process of extracting a face region image in an embodiment of the present application;

FIG. 4 is a schematic flow chart illustrating a process of determining a composite score using a portrait scoring model according to an embodiment of the present application;

FIG. 5 is an exemplary network architecture for a portrait scoring model of the present application;

FIG. 6 is a schematic flow chart of training a face scoring model in an embodiment of the present application;

FIG. 7 is a schematic illustration of a process for setting a cover image by a host user in an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating a process for automatically setting a cover image according to an embodiment of the present application;

fig. 9 is a schematic block diagram of a live cover image extraction apparatus of the present application;

fig. 10 is a schematic structural diagram of an electronic device used in the present application.

Detailed Description

Referring to fig. 1, an exemplary network architecture adopted by the application scenario in the present application includes a terminal device 80, a media server 81, and an application server 82. The terminal device 80 may be configured to run a live broadcast room terminal program, so that a main broadcast user or a viewer user can use a live broadcast room function, for example, the main broadcast user uploads a live broadcast video stream to the media server 81 through the terminal device 80 of the main broadcast user, or pushes the live broadcast video stream of a target user to the terminal device 80 of the viewer user through the media server 81 for playing, and the like. The media server 81 is mainly responsible for pushing the live video stream of each anchor user to the live room of each anchor user. The application server 82 may be used to deploy webcast services to maintain live-room-based interactions between anchor users and audience users.

According to the method for extracting the live cover image, a computer program product can run in the media server 81, the application server 82 and the terminal device 80 or any other device, all steps of the method are executed through the running of the computer program product, the technical scheme of the method is realized, and therefore according to the face area image extracted from the live video stream of the anchor user, a high-quality face area image suitable for making the cover image of the live video stream is preferably selected, so that the live cover packaged by the cover image can be effectively popularized as a live room popularization entrance of the anchor user.

Referring to fig. 2, based on the above exemplary scenario and related principle description, the live cover image extraction method of the present application, in one embodiment thereof, includes the following steps:

step S1100, acquiring a plurality of face area images from a live video stream;

in an exemplary webcast application scenario of the present application, the live video stream may be a live video stream generated in real time by a anchor user performing a live activity in a live broadcast room, or may be a historical live video stream stored in a related database after the anchor user has completed the live activity.

The method can adapt to different deployed devices of a computer program product realized by programming according to the method, and can acquire the live video stream from the terminal device of a main broadcast user, acquire the live video stream from the media server, and acquire the live video stream pushed by the media server from the terminal device of an audience user.

The live video stream is composed of a plurality of image frames, the image frames are stored in an image space of corresponding equipment, each image frame in the image space can be obtained from the image space for detection and processing, an image frame carrying face content is determined, and a corresponding face area image is extracted from the image frame.

In one embodiment, a face detection model is used to perform face detection on each image frame, the image frame is input into a face detection model based on deep learning, a selection frame of a face image in the image frame is predicted by the face detection model, and the image frame is determined to be the face image according to the selection frame, so that image content of a part corresponding to the selection frame can be intercepted from the image frame to be used as the face region image. The face detection model, in one embodiment, may be implemented using a mature Yolo series model.

In one embodiment, the face region image extracted from the image frame generally contains original image information, including not only the portrait of the person therein, but also other background images. In other embodiments, it is also allowed to perform image segmentation on the face region image by means of an image segmentation model, and only the image content of the portrait of the person in the face region image is reserved.

In some embodiments, when a plurality of face region images need to be determined from a live video stream, a plurality of image frames in the live video stream may be detected in batch to determine the plurality of face region images, and then the processing of the subsequent steps may be performed. In other embodiments, a single image frame in the live video stream may be dynamically acquired each time to perform detection, and a face region image therein is determined to perform subsequent processing, and such loop iteration is performed to realize real-time detection of each image frame in the live video stream and gradually accumulate a plurality of face region images.

Step S1200, synchronously determining a plurality of evaluation scores of a plurality of evaluation dimensions corresponding to each face region image, and summarizing the plurality of evaluation scores into a comprehensive score of the corresponding face region image;

for any of the face region images, a plurality of evaluation scores corresponding to a plurality of evaluation dimensions may be determined. The plurality of evaluation dimensions can be planned and set in advance, and the evaluation can be realized by designing a corresponding human face image evaluation system and providing the plurality of evaluation dimensions in the evaluation system. And each evaluation score corresponding to all the evaluation dimensions is determined in batch at one time among the evaluation dimensions in a synchronous mode. The evaluation scores corresponding to the evaluation dimensions obtained by each face region image can be summarized into the same comprehensive score, so that the quality of each corresponding face region image can be conveniently measured on the whole.

In one embodiment, for each evaluation score obtained from the same face region image, all evaluation scores may be averaged to serve as a composite score corresponding to the face region image. In another embodiment, the evaluation scores of the same face region image can be weighted and fused to obtain a corresponding comprehensive score. The above embodiments can be flexibly selected by those skilled in the art.

In an embodiment, the function of determining the corresponding evaluation score for each evaluation dimension may respectively train the deep learning model corresponding to each evaluation dimension to predict the corresponding evaluation score, and when the evaluation score corresponding to each evaluation dimension needs to be determined synchronously, each thread invokes the deep learning model corresponding to the corresponding evaluation dimension to determine the corresponding evaluation score by activating the thread corresponding to each evaluation dimension, so that the evaluation scores corresponding to all the evaluation dimensions are determined synchronously by multithreading.

In another embodiment, a single deep learning model integrating the scoring function of each scoring dimension may be trained, and the single deep learning model is trained to a convergence state and then used as a portrait scoring model, so that image feature information can be extracted from the face region image input thereto based on the same backbone network, and then the image feature information is passed through classifiers corresponding to each scoring dimension to predict respective scoring scores corresponding to each scoring dimension, thereby synchronously determining the scoring scores of each scoring dimension of the same face region image.

The face image evaluation system can set corresponding evaluation classification according to the imaging quality characteristics, the face content characteristics and the like of the face region image, namely, set the classification of the imaging quality characteristics for evaluating the face region image and the classification of the face content characteristics for evaluating the face region image, and set corresponding evaluation dimensionality for each evaluation classification. For example:

the classification corresponding to the imaging image quality characteristics can comprise a first dimension for representing whether the face image belongs to a real person, a second dimension for representing whether the face image is a big photo, a third dimension for representing whether the face image is a high-quality image and the like.

The classification corresponding to the face content features may include a fourth dimension representing whether the face image is complete, a fifth dimension representing whether the face image is covered, a sixth dimension representing whether the face image belongs to the front face of the face, a seventh dimension representing whether the face image includes a smile expression, and the like.

The above is only one planning example for illustrating a facial image evaluation system. In practice, any number of multiple evaluation dimensions can be set according to the two categories of the imaging image quality features and the human face content features, and meanwhile, each part of the evaluation dimensions in the categories corresponding to the imaging image quality features and the human face content features is selected to form multiple evaluation dimensions required by the application for evaluating the human face area image.

According to the planning example of the facial image evaluation system, the evaluation dimensionality is jointly set from the two aspects of imaging image quality and facial content, so that whether the quality of the facial area image meets the composition condition of the cover image can be more completely measured.

The human face image evaluation system actually constructs an algorithm principle of an aesthetic degree recognition algorithm, and determines a comprehensive score for the human face region image by implementing the aesthetic degree recognition algorithm, so that the image quality of the human face region image and the attention degree of the human face region image to audiences can be effectively measured, and the comprehensive score can be essentially understood as an aesthetic degree. The aesthetic degree defined by the method is comprehensively determined according to the evaluation scores of a plurality of evaluation dimensions in a pre-planned face image evaluation system, and compared with the method adopting a single evaluation dimension, the method has a better significance, can play a comprehensive systematic evaluation role on the face region image, and can represent the real value of the corresponding face region image for making a cover image.

And step S1300, screening out a face area image according to the comprehensive score for determining a cover image of the live video stream.

Through the process, the corresponding comprehensive score can be obtained for each face region image, and each face region image can be optimized according to the comprehensive score and needs, so that the face region image with the higher comprehensive score is screened out and used as a high-quality face region image for determining the cover image of the live video stream.

The plurality of face region images for which the composite score is determined may be filtered in a variety of ways. In one embodiment, the face region images are reversely sorted according to the comprehensive scores, and then a plurality of face region images with corresponding numbers sorted in the front are selected as high-quality face region images according to the preset screening number. In another embodiment, the comprehensive score of each face region image may be compared with the score threshold according to a preset score threshold, and when the comprehensive score is higher than the score threshold, the corresponding face region image is used as a high-quality face region image, otherwise, the corresponding face region image may be omitted.

In some embodiments, the high-quality face region image may also be selected to be directly used as a cover image of the live video stream for generating a live cover of the live video stream. In other embodiments, after selecting from the high-quality face area images, a screenshot adapted to the size specification of a live broadcast cover can be cut out from the source image frame by referring to the source image frame of the high-quality face area images, and then the screenshot is used as a cover image for generating the live broadcast cover of the live broadcast video stream. The cover image generated according to the above embodiments may be used as a promotion entry for a corresponding live room that produces the live video stream, for example, to be published to a promotion page of a live room application to attract more user traffic to the live room.

According to the embodiment, a plurality of face area images are extracted from a live broadcast video stream, a plurality of evaluation scores of the face area images corresponding to a plurality of evaluation dimensions are determined synchronously, the evaluation scores of the face area images are collected into comprehensive scores, the face area images are screened according to the comprehensive scores, high-quality face area images are determined and used for making cover images of the live broadcast video stream, the advantages and the disadvantages of the face area images are comprehensively measured through the evaluation dimensions, a more comprehensive system can effectively determine the high-quality face area images, the made cover images are more representative and can bear image information of a main broadcast user of the live broadcast video stream, the live broadcast cover constructed by the method can obtain larger user flow, and the method plays a role of effective popularization, so that interaction of a network live broadcast platform is activated.

On the basis of any embodiment of the present application, please refer to fig. 3, which is a method for acquiring a plurality of face region images from a live video stream, including:

step S1110, performing face detection based on the image frames of the live video stream, and determining a plurality of target image frames carrying face content and selection frames corresponding to face areas;

when the image frames of the live video stream are subjected to face detection, in one embodiment, a face detection model is firstly adopted to perform face detection on the image frames, the face content in the image frames is identified by the face detection model, coordinate information and confidence of a selection frame corresponding to the face content are output, whether the selection frame is enough to be trusted is judged according to whether the confidence is higher than a preset threshold, the selection frame which is not enough to be trusted is not processed, and the image frames are determined to contain the face content and become target image frames only aiming at the selection frame which is enough to be trusted. The method comprises the steps that face detection can be carried out on each image frame in a live video stream, a plurality of target image frames can be obtained after face detection is carried out on the image frames, the target image frames contain face content, and coordinate information of corresponding selection frames is given by a face detection model according to the area where the face content is located. The coordinate information of the selection box is represented as (x 1, y1, x2, y 2), for example, so that the selection box, that is, the position of the face region in the image frame, can be determined according to the coordinate information.

Step 1120, extracting a region image from the corresponding target image frame according to the selection frame;

in order to enable the comprehensive scoring of the application to be more accurate, corresponding regional images can be extracted from the target image frames through corresponding selection frames of all the target image frames, the regional images are the image range where the face content is located, the interference of stray information such as a background can be reduced by removing other image contents of the target image frames and only taking the corresponding regional images, and the application can be ensured to obtain a relatively accurate scoring effect.

Step S1130 clusters the face feature vectors of the plurality of region images, determines a maximum cluster class from the plurality of cluster classes obtained by clustering, and takes a region image in the maximum cluster class as a face region image.

For each regional image, the image features of each regional image can be extracted by an image encoder trained to a convergence state in advance, so as to obtain the corresponding face feature vector. The image encoder can adopt a deep learning network constructed based on a convolutional neural network, and perform classification task training by adopting sufficient training samples, so that the classification task training is carried out until convergence, and the feature representation capability of extracting deep semantic information from the region image representing the human face content to obtain the human face feature vector is obtained. The choice of the image encoder can be flexibly determined by those skilled in the art, including but not limited to Convolutional Neural Network (CNN), residual Network (Resnet), and the like.

Further, the face feature vectors of the area images are used as sample points, any feasible clustering algorithm is applied to perform clustering processing on all the face feature vectors corresponding to all the area images, a plurality of cluster types are obtained through clustering, each cluster type comprises a plurality of sample points, then the largest cluster type with the largest number of sample points is determined as the cluster type which can best express the face information of the real person of the main broadcasting user of the live video stream, the real person main broadcasting is usually a real host in the live broadcasting site and is regarded as a real person main broadcasting user, and other cluster types can be clusters corresponding to the face images of bystanders or collaborators in the live broadcasting site. Accordingly, each region image corresponding to the largest cluster can be used as a face region image required for subsequent processing.

Therefore, the clustering algorithm plays a role here, not only can determine all effective area images which most effectively represent face information of anchor users be determined, but also interference face images of other non-hosts can be effectively eliminated. The method has a very positive effect on the subsequent use of the face image corresponding to the regional image for making the cover image of the live video stream, and can generate the cover image of the face content of the original person of the real anchor user.

The clustering algorithm used in the present application, including but not limited to k-means clustering algorithm, hierarchical clustering algorithm, SOM clustering algorithm or FCM clustering algorithm, can be flexibly selected by those skilled in the art.

According to the embodiment, the area images containing the face content are extracted by carrying out face detection on the image frames in the live broadcast video stream, then the clustering algorithm is applied to cluster the face feature vectors of the face content in the area images, the area image corresponding to the largest cluster is determined to be the real human face image of the anchor user, the real human identity of the anchor user is accurately positioned, the interference of the human face image of other people is effectively removed, compared with the situation that the face image is simply obtained from the live broadcast video stream without considering the real human identity, the technical advantage that the constructed cover image can better bear the identity information of the anchor user is obviously achieved, and the method has a great positive effect on the popularization of the live broadcast room.

On the basis of any embodiment of the present application, extracting a region image from a corresponding target image frame according to the selection frame includes:

step S1121, extracting a corresponding region image from the corresponding target image frame according to the selection frame, and performing super-resolution enhancement processing on the region image;

a single face may exist in a target image frame, or a plurality of faces may exist in the target image frame, and a selection frame of a corresponding face can be generally obtained through prediction of a face detection model. In some cases, for example, when a plurality of persons appear in an image frame, a plurality of selection boxes may appear in the same target image frame, and for this case, corresponding region images may be extracted from the same target image respectively corresponding to the selection boxes.

After extracting the corresponding area image from the target image frame, the super-resolution enhancement processing may be further performed on each area image by using a super-resolution enhancement model, and the corresponding area image is enlarged to a larger scale, for example, 2 times of the original image, so that the area image is clearer, and a more accurate feature representation effect may be achieved when the face feature vector is subsequently extracted, or when the image feature information needs to be extracted when scoring of multiple evaluation dimensions is performed.

And step S1122, performing posture correction processing on the human face content in the area image subjected to the super-resolution enhancement processing.

Further, the pose correction processing may be performed on the face content in each region image, for this reason, in an embodiment, each key point in the face image is detected by a face key point detection model, for example, the coordinates of five points of the face: the method comprises the steps of obtaining face key point grids by a left eye central point, a right eye central point, a nose tip point, a left mouth angular point and a right mouth angular point, then converting by contrasting with standard face grids to obtain a similar conversion matrix, and applying the similar conversion matrix to a regional image to correct face content in the regional image to a correct standard posture. And when the facial feature vectors of the region images subjected to posture correction are extracted subsequently or image feature information is required to be extracted by grading of multiple evaluation dimensions, a more accurate feature representation effect can be realized in the same way.

According to the embodiments, in the process of extracting the region image corresponding to the face content in the target image frame according to the selection frame, in consideration of the subsequent requirement for feature representation of the face content, the region image is optimized by means of super-resolution enhancement, posture correction and the like, so that the face feature vector and the image feature information obtained by feature representation can be more accurate, and various subsequent technical means to be realized, such as clustering processing, scoring processing and the like, can obtain reliable basic data and obtain accurate results.

On the basis of any embodiment of the application, the super-resolution enhancement model of the application can be selected and trained in advance so as to serve links needing to carry out super-resolution enhancement processing on images in the application. The super-resolution enhancement model can be implemented by an RRDBNet model disclosed in the industry or other similar models. Then, the training is carried out according to the following training process:

first, a high definition face picture data set is collected, and an industry public data set FFHQ may be used, where the data set contains 7 ten thousand face images with high definition, and the resolution of each picture is 1024 × 1024. All pictures can be downsampled to 512x512 for subsequent use in training.

Then, training sample pairs are generated. And adding Gaussian noise with random intensity and Gaussian blur to the face image obtained in the previous step, and performing 2-time down-sampling to obtain a low-resolution face image, wherein the low-resolution face image and the high-resolution original image form a training sample pair.

Finally, calling each training sample pair to train the super-resolution enhancement model, taking the low-resolution face image in the same training sample pair as input in the training process, adopting the corresponding high-resolution original image to supervise and calculate the loss value of the model prediction result, and adopting a plurality of loss functions to calculate the loss value, wherein the selectable loss function types comprise:

1. l1 paradigm loss: loss _L1 = pred-gt/, where pred is the model prediction result and gt is the high resolution artwork.

2. Loss of perception:

wherein +>

Is a pre-training model disclosed in the industry>

Indicates that gt is taken as->

The feature map of the ith layer of the model network obtained at the time of input,

indicates pred as->

The loss function is calculated by selecting the feature maps from the 1 st feature map to the Nth feature map of the ith layer of the model network obtained in the input. Loss _Perceptual The method is used for guiding semantic consistency between the image after the super-resolution and the original image, so that the super-resolution effect is real and natural.

3. Generating the antagonistic loss: denoted Loss _GAN Including LoSs _D 、Loss _G ，Loss _D ＝(D(gt)-1) ² +(D(pred)-0) ² 、LoSs _G ＝(D(pred)-1) ² ，Loss _GAN The method is a method for generating confrontation training, D is a discriminator, the present embodiment does not limit the model structure of D, the input is an image (gt or pred), the output is a single-channel feature map, each value on the feature map represents the prediction result of the corresponding original image area, ideally, if the input is gt, the values of the feature map should all be 1 (representing the real image), otherwise, the values are 0 (representing the algorithm generated image). Loss _D Has the function of training the discriminator D to be able to distinguish between a real image and an algorithmically generated image, loss _G The method has the function of training the image super-resolution enhancement model to enable the predicted image and the real image to be as close as possible, so that the discriminator D can be deceived to be unable to distinguish the predicted image and the real image.

The super-resolution enhancement model is trained according to the process, and after the super-resolution enhancement model is trained to a convergence state, the super-resolution enhancement model can be used in an online reasoning stage, so that the super-resolution enhancement model can perform super-resolution enhancement processing on images of all corresponding links in the application, such as the area image, the face area image, the image frame and the like, as required, and the original image is amplified and enhanced.

On the basis of any embodiment of the present application, please refer to fig. 4, determining a plurality of evaluation scores corresponding to a plurality of evaluation dimensions for each face region image synchronously, and summarizing the plurality of evaluation scores into a comprehensive score for the corresponding face region image, includes:

step S1210, inputting each face region image into an image encoder of a face scoring model to extract image characteristic information of the face region image;

in this embodiment, based on the deep learning principle, a face scoring model is implemented in advance, and is used to perform a scoring operation on a face region image. As shown in fig. 5, the portrait scoring model includes an image encoder and a plurality of classifiers set corresponding to evaluation dimensions of the evaluation to be evaluated, and each classifier is connected to the image encoder to form a shunting set. The image encoder is used for extracting deep semantic features of the face region image input into the image encoder to obtain corresponding image feature information. Each classifier is used for mapping the image characteristic information to a classification space thereof, and obtaining a classification probability corresponding to each category in the classification space as a classification result. And after carrying out iterative training to a convergence state by adopting a corresponding training sample in advance, the portrait scoring model is used in an online reasoning stage of the application to obtain a face region image according to input, and determine the evaluation score capability of the face region image in the corresponding evaluation dimension of each classifier.

Similarly, the image encoder may be a deep learning Network constructed based on a Convolutional Neural Network, and the selection type of the deep learning Network can be flexibly determined by a person skilled in the art, including but not limited to a Convolutional Neural Network (CNN), a Residual Network (Resnet), and the like.

In an embodiment, the image encoder obtained through the face score model training may also be configured to generate a corresponding face feature vector for a face region image of the present application, where the face feature vector may be a high-dimensional vector obtained by mapping the image feature information to a high-dimensional space after the image encoder extracts corresponding image feature information from the face region image.

Therefore, after a face region image is input into the portrait scoring model, an image encoder in the portrait scoring model performs convolution operation on the portrait region image, deep semantic features in the portrait region image are extracted, and corresponding image feature information is obtained.

Step S1220, synchronously passing the image characteristic information through a plurality of classifiers which are arranged in the portrait scoring model and correspond to a plurality of evaluation dimensions, and determining an evaluation score corresponding to each evaluation dimension;

and after the image characteristic information of the face region image is obtained by the image encoder, synchronously inputting the image characteristic information into each classifier, performing classification mapping on each classifier through one or more fully-connected layers, mapping the image characteristic information into a classification space of a corresponding output layer, and obtaining the classification probability corresponding to each category in the classification space.

In this embodiment, it is considered that each evaluation dimension is based on boolean logic to set an evaluation criterion, and therefore, each classifier may adopt a two-classifier or a Sigmoid function structure, and thus, the classification probability corresponding to the yes or no judgment state can be calculated by the Sigmoid function. The classification probability corresponding to the representation 'yes' can be directly used as the evaluation score of the corresponding evaluation dimension to represent the degree that the corresponding face region image is the content represented by the corresponding evaluation dimension. It is understood that the classification probabilities output by the classifiers are normalized to the same value interval, and therefore the corresponding classification probabilities are unified to the same dimension.

As can be understood from the above description, after a face region image is input into the face evaluation model, the corresponding classification probability of each face region image can be obtained as the evaluation score of the corresponding evaluation dimension by the assigned class of the classifier corresponding to each evaluation dimension. And determining the corresponding evaluation score of each face region image under each evaluation dimension.

Step 1230, fusing the plurality of evaluation scores corresponding to each face region image into a comprehensive score of the corresponding face region image;

and for each face region image, fusing all evaluation scores of the face region images to obtain a corresponding comprehensive score. In this embodiment, the comprehensive Score of each face region image may be determined according to the following formula _i ：

Score _i ＝α ₁ *Value ₁ +α ₂ *Value ₂ +…+α _n *Value _n

Wherein i represents a specific face region image, n represents a specific evaluation dimension, value _n Representing the corresponding rating score, α, of the rating dimension n _n The preset weight representing the evaluation dimension n can be flexibly set by a person skilled in the art, for example, the weight of the evaluation dimension corresponding to the imaging image quality feature is set to be higher than the weights of other evaluation dimensions.

In one embodiment, α ₁ +α ₂ +…+α _n =1, whereby the composite score can be normalized to [0,1]The numerical value interval of (2) shows the degree of the comprehensive score more intuitively.

According to the embodiment, the evaluation scores corresponding to the evaluation dimensions are synchronously predicted for the same face region image by means of the portrait scoring model, and then the evaluation scores are fused into corresponding comprehensive scores, so that the method is very efficient, and the portrait scoring model is obtained by predicting the evaluation dimensions based on the deep semantic features of the face region image, so that the deep reasoning can be performed on the image content according to the deep semantic features, and the predicted evaluation scores are more accurate; in addition, in the process of determining the comprehensive score of each face region image, a technical means for flexibly setting the weight of the evaluation score of each evaluation dimension is provided, so that the score of the face region image can adapt to the variation of the evaluation standard within a certain degree, the comprehensive score under the corresponding evaluation standard is provided, and the subsequent accurate screening of the face region image with expected quality is facilitated.

On the basis of any embodiment of the present application, referring to fig. 6, before acquiring a plurality of face region images from a live video stream, the method includes:

step S2100, calling training samples in a training data set to input the portrait scoring model, predicting classification results of the portrait scoring model mapped to classifiers corresponding to the multiple evaluation dimensions, wherein part of the training samples are face region images;

the face scoring model of the present application may be pre-trained to converge so that it is suitable for determining the score of each evaluation dimension for the face region images. To this end, a training data set is prepared, which contains a large number of training samples sufficient to train the face scoring model to convergence, and a plurality of supervised labels are associated with each training sample.

The content of the training sample may be a face region image prepared according to the method adopted in the foregoing description of the present application, and each of the supervised labels thereof is provided corresponding to a corresponding evaluation dimension of each classifier of the face scoring model, so that each supervised label may be used for corresponding to a classification result of one classifier to calculate a classification loss value.

In order to facilitate the training of the negative case supervision model, part of the training samples may not belong to the face region image, and accordingly, a supervision label representing a negative case is provided for the training samples.

When iterative training is carried out on the portrait evaluation model, a training sample is called for each iteration and is input into the portrait evaluation model, image characteristic information is extracted by an image encoder in the portrait evaluation model and is output to each classifier in a routing mode, and each classifier carries out classification mapping according to the image characteristic information to obtain a corresponding classification result.

Step S2200, adopting a plurality of supervision labels provided by a plurality of corresponding evaluation dimensions mapped with the training sample, and correspondingly calculating the classification loss value of each classification result;

as an example, the classifier in the portrait evaluation model of this embodiment may be set to seven correspondingly according to a preset evaluation system, and the seven corresponding evaluation dimensions are respectively set to: the method comprises the steps of representing whether a face image in a training sample belongs to a first dimension of a real person, representing whether the face image in the training sample is a second dimension of a large head photo, representing whether the face image in the training sample is a third dimension of a high-quality image, representing whether the face image in the training sample is a complete fourth dimension, representing whether the face image in the training sample is a covered fifth dimension, representing whether the face image in the training sample belongs to a sixth dimension of the front face of the face, and representing whether the face image in the training sample contains a seventh dimension of a smile expression. Accordingly, it is understood that there are seven supervision labels provided for the training samples, and whether the corresponding training sample belongs to the content represented by the corresponding evaluation dimension is represented in the form of a binarization label. For example, for the surveillance label corresponding to the first dimension, when the corresponding training sample belongs to a face image of a real person, it may be represented as [1,0], and when the corresponding training sample does not belong to a face image of a real person or does not belong to a face image, it may be represented as [0,1], so as to play a role of the surveillance label.

In practical use, the portrait scoring model can set a corresponding number of classifiers according to the number of evaluation dimensions required to be adopted in an evaluation system, so that when supervision labels are provided for training samples, the corresponding number of supervision labels can be set corresponding to the evaluation dimensions. The generation of the supervision label can be a result of manual labeling, and can also be a result of identification and determination in the raw material for preparing the training sample. The method can be specifically manufactured as follows:

for the first dimension supervision label, whether a corresponding training sample belongs to a real person or not can be represented, a real person image and a non-real person image can be collected as the training sample, automatic marking is carried out according to the collection type to determine that the supervision label is used for training, for example, the real person label is 1, the non-real person label is 0, during training, the label is predicted by a corresponding classifier, and then the supervision label is used for calculating a classification loss value, wherein the non-real person image mainly comprises images obtained by scenes such as an unmanned background, a quadratic element, a movie, a synthesis, a game, sports class and animation.

For the supervised label of the second dimension, whether the corresponding training sample belongs to a large photo or not is represented, the label is predicted through a corresponding classifier during training, for example, the non-large photo label is 1, the large photo label is 0, in the process of manufacturing the training sample, a detection frame of a face region image in the training sample can be output through a face detection model, the face detection frame is compared with a full picture to obtain a face occupation ratio, if the occupation ratio is larger than a preset threshold value, for example, 0.5, the face occupation ratio is determined, the supervised label is set as the label corresponding to the large photo, otherwise, the supervised label is set as the label not corresponding to the large photo, and automatic marking is completed.

For the third dimension supervision label, whether the corresponding training sample belongs to a high-quality image or not can be characterized by judging according to the image definition, when the training sample image is clear, the supervision label is set to be a clear type 1, otherwise, the supervision label is set to be a fuzzy type 0. When a training sample is made, graying the image of the training sample, convolving the image with a specified Laplacian kernel, calculating a variance value of response, outputting a clear label if the variance value is larger than a specified threshold 5000, and outputting a fuzzy label if the variance value is not larger than the specified threshold 5000, thereby completing automatic marking.

For the supervision label of the fourth dimension, it characterizes whether the corresponding training sample belongs to a complete face image, i.e. it represents whether the face therein is complete, so its supervision label can be set according to the fact, for example, when the face is complete, the corresponding supervision label is set as a complete label 1, otherwise, it is set as a missing label 0. By calculating the aspect ratio of the detection frame obtained by performing face detection on the training sample as described above, if the aspect ratio is smaller than a threshold value, for example, 0.50, the supervision tag may be set as the missing tag, otherwise, the supervision tag is set as the complete tag. Thereby completing the automatic marking.

For the supervision label of the fifth dimension, which characterizes whether the face image in the corresponding training sample is covered by other images such as special effects, when it is in a normal state that is not covered substantially, the supervision label thereof may be set as a normal label 1, and when it is covered by a large area, the supervision label is set as an abnormal label 0. In order to set the supervision label, the area of the special effect area added by broadcasting is calculated through a preset special effect detection algorithm and is divided by the area of the original image of the training sample to obtain the special effect ratio. If the special effect ratio is larger than a preset threshold value, such as 0.3, the special effect ratio is set as an abnormal label, otherwise, the special effect ratio is set as a normal label, and therefore automatic marking of the training sample is achieved.

And for the supervision label of the sixth dimension, whether the human posture in the corresponding training sample belongs to the positive or not is represented, whether the human face in the training sample is positive or not is also represented, and the recognition of the human angle can be considered, so that the corresponding training sample is collected according to whether the human face is in the positive or not, the supervision label is set according to the collection type of the training sample or the detection result of the human posture detection model on the training sample, the supervision label of the training sample belonging to the positive image is set as a positive label 1, and the supervision label of the training sample in the large-angle inclined posture is set as an angle abnormal label 0, so that the automatic marking is realized.

For the supervision label of the seventh dimension, whether the representation training sample contains the smiling expression or not can be recognized by adopting a facial expression recognition model, when the smiling expression is contained in the supervision label, the supervision label is set to be a natural state label 1, otherwise, the supervision label is set to be an unnatural state label 0, and therefore automatic marking is achieved.

According to the generation example of the supervision labels related to the training sample, the criteria are provided by the supervision labels corresponding to different evaluation dimensions, the criteria can be automatically generated, the manufacturing cost is low, the great effect is achieved on the training process of the portrait evaluation model, and the portrait evaluation model can obtain corresponding scoring capability under the supervision of the supervision labels.

After the portrait evaluation model performs classification mapping on a training sample in each evaluation dimension to obtain the classification probability of the classification space corresponding to each evaluation dimension, the class corresponding to the maximum classification probability is actually the class predicted by the corresponding classifier, and accordingly, the classification loss value corresponding to the classification result can be calculated by adopting the supervision label corresponding to the evaluation dimension. When calculating the classification loss value, a cross entropy loss function can be used for calculation. Accordingly, for a training sample, corresponding to each evaluation dimension, a corresponding classification loss value can be obtained.

And S2300, fusing the classification loss values into a total loss value, and performing gradient updating on the portrait scoring model according to the total loss value until the portrait scoring model is iteratively trained to a convergence state.

In order to decide the iterative process of the portrait evaluation model and implement the weight update of the portrait evaluation model, further, the classification loss values under each evaluation dimension corresponding to each training sample may be fused into a total loss value, the fusion mode may be any one of addition, averaging, weighted sum and the like, then, a target threshold value representing whether the portrait evaluation model reaches the convergence state is used, the total loss value is compared with the target threshold value, when the total loss value reaches the target threshold value, the portrait evaluation model is represented to have reached the convergence state, and the training thereof may be terminated and the online reasoning phase may be used; and if the total loss value does not reach the target threshold value, the portrait evaluation model does not reach the convergence state, so that gradient updating can be carried out on the portrait evaluation model according to the classification loss value, the weight parameters of all links of the portrait evaluation model are corrected through back propagation to further approach convergence, and the like until the portrait evaluation model is trained to the convergence state.

In other embodiments, a batch updating mode may be applied instead of the above single training sample updating mode, specifically, the total loss values corresponding to a plurality of training samples in the same batch may be summarized to determine a comprehensive loss value, and the total loss value is replaced by the comprehensive loss value and compared with the target threshold value to determine whether the model converges and whether the weight needs to be updated, so as to improve the model training efficiency and make the model more robust.

According to the embodiment, the portrait evaluation model obtains the capability of synchronously predicting the evaluation scores corresponding to the evaluation dimensions for the face region images through a simple structure and an efficient training process, is suitable for batch processing of a large number of face region images, can quickly determine the evaluation scores of the evaluation dimensions corresponding to the face region images, is suitable for serving a scene of a network live broadcast platform, which needs to process a large amount of live broadcast video streams, and can effectively and quantitatively express the quality of the face region images extracted from the live broadcast video streams, and can improve the production efficiency of cover images of live broadcast covers depending on the face region images.

On the basis of any embodiment of the present application, please refer to fig. 7, screening out a face region image according to the comprehensive score for determining a cover image of the live video stream includes:

step S3100, screening out a plurality of face region images with the comprehensive scores higher than a preset threshold value as target face images;

when a plurality of face region images are determined from a live video stream of an anchor user and a comprehensive score corresponding to the face region images is further determined, the face region images can be screened according to the comprehensive score, so that high-quality face region images are preferably selected and can be used as target face images for subsequent processing.

In this embodiment, a face region image with a comprehensive score higher than the preset threshold value in the plurality of face region images may be determined as the high-quality face region image in a manner of comparing the preset threshold value, and further used as a target face image for subsequent processing.

Step S3200, cutting a screenshot containing the target face image from an incoming image frame of the target face image corresponding to the size specification of a live cover;

the live broadcast cover needed to be made by the network live broadcast platform usually has a corresponding size specification, and on the basis of the given size specification, a corresponding cover image can be determined according to the target face image. For this purpose, for each target face image, a corresponding source image frame is determined, that is, an image frame of a face area image of the target face image is generated, then, according to the size specification, the image frame is positioned in the source image frame, so that the corresponding face area image in the source image frame is included in a range framed by the size specification, then, a screenshot is performed on the source image frame according to the range, and the screenshot includes the corresponding target face image.

In one embodiment, the face image may exceed or be much smaller than the size specification of the live cover, and accordingly, the source image frame may be scaled accordingly, so that the face content therein can appear in a range framed by the size specification at a better predetermined ratio to highlight the face content.

In another embodiment, the super-resolution enhancement processing may be performed on the source image frame to improve the image quality, and the size specification may be applied to cut out the screenshot containing the target face image on the basis of the source image frame with improved image quality.

Therefore, corresponding to a plurality of target face images determined in the same live video stream, corresponding screenshots can be determined from corresponding source image frames, so that a plurality of screenshots are obtained, and the face contents in the screenshots are all the preferable results of scoring corresponding face area images.

S3300, pushing each screenshot to terminal equipment of a main broadcasting user to which the live video stream belongs;

in order to achieve a better human-computer interaction effect, the screenshots determined in the above process can be further formed into list data and pushed to the terminal device of the anchor user to which the live video stream belongs, so that the anchor user can select one of the screenshots, then designate the corresponding screenshot, and submit a cover setting request.

And step S3400, acquiring the screenshot appointed by the anchor user, and setting the screenshot as a cover image of the live video stream.

In response to the cover setting request, a screenshot designated by the anchor user can be correspondingly obtained, and the screenshot is the content of the cover image which the anchor user wants to set as the live video stream of the anchor user. Therefore, the screenshot appointed by the anchor user is set as a cover image of the live video stream to form a live cover, and the live cover can be used as a promotion entrance of a live room of the anchor user and promoted according to the live room recommendation logic of the network live platform.

According to the embodiment, on the basis that the high-quality face area image is determined, the intelligent screenshot is carried out according to the size specification of the live broadcast cover, the live broadcast user is provided with one of the selected screenshot, the screenshot selected by the live broadcast user is set to be the cover image of the live broadcast video stream, the construction of the live broadcast cover is completed, the man-machine interaction is more efficient, the trouble that the live broadcast user manufactures the cover image by himself is avoided, the generation efficiency of the cover image is improved, the manufacturing efficiency of the live broadcast cover is improved, and the user experience of a network live broadcast platform is obviously improved.

On the basis of any embodiment of the present application, please refer to fig. 8, screening out a face region image according to the comprehensive score for determining a cover image of the live video stream includes:

s4100, screening the face region image with the highest comprehensive score as a target face image;

when a plurality of face region images are determined from a live video stream of a main user and a comprehensive score corresponding to the face region images is further determined, the face region images can be screened according to the comprehensive score, so that high-quality face region images in the face region images are selected and can be used as target face images for subsequent processing.

In this embodiment, the comprehensive scores of the face region images are inversely sorted to determine a face region image with the highest comprehensive score among the face region images as a high-quality face region image, and further as a target face image for subsequent processing.

Step S4200, performing super-resolution enhancement on the target face image by preset times to obtain an enhanced face image;

in order to improve the image quality of the target face image, in this embodiment, a super-resolution enhancement model is adopted, and the target face image is subjected to super-resolution enhancement processing according to a preset multiple, for example, 2 times, so as to obtain an amplified enhanced face image.

In one embodiment, before performing super-resolution enhancement on the target face image, if the target face image is in a non-correct face pose, a pose correction operation may be performed on the target face image as described above.

Step S4300, performing the super-resolution enhancement of the preset multiple on the source image frame of the target face image to obtain an enhanced source image frame;

considering that the image quality of the source image frame of the target face image frame should also be uniform, the super-resolution enhancement model may also be applied to perform the super-resolution enhancement processing on the source image frame in the same manner, and accordingly, according to the preset multiple, the image quality of the source image frame is also improved and the scale of the source image frame is enlarged.

Step S4400, overlaying and synthesizing the enhanced face image to the enhanced source image frame according to the position corresponding relation, and obtaining a high-quality image frame;

in the process of extracting the regional image from the source image frame of the target face image and performing a subsequent series of processing, the enhanced face image of the target face image may not necessarily correspond to the enhanced source image frame of the source image frame in size through various transformations, and for this case, the enhanced face image may be inversely transformed according to a similar transformation matrix determined when a face key point detection model is applied to the source image frame, and transformed into a state corresponding to the face content in the enhanced source image frame, and then the enhanced face image is synthesized into the enhanced source image frame in an image synthesis manner, so that the enhanced face image becomes the foreground of the enhanced source image frame, and covers the original face content in the enhanced source image frame, and the enhanced source image frame becomes a high-quality image frame.

In one embodiment, after obtaining the enhanced face image in step S4200, an image segmentation model is further used to perform image segmentation on the enhanced face image, so as to obtain a portrait mask therein and a portrait of a person determined according to the portrait mask. The portrait mask mainly indicates the face content and the area corresponding to the hair, so that the obtained portrait of the person mainly comprises the face and the hair of the person. On the basis, the portrait of the person and the portrait mask are inversely transformed according to the similarity matrix, so that the portrait of the person corresponds to the enhanced source image frame in scale, the portrait of the person is directly covered in the enhanced source image frame according to the portrait mask to obtain a corresponding high-quality image frame, accordingly, the portrait of the person becomes a foreground in the high-quality image frame, and the portrait of the person is a result image subjected to independent super-resolution enhancement processing, so that the whole high-quality image frame is high-definition, fine and natural in expression.

And S4500, cutting a screenshot containing the target face image from the high-quality image frame corresponding to the size specification of a live broadcast front cover to be set as a front cover image of the live broadcast video stream.

The live broadcast cover required to be manufactured by the network live broadcast platform usually has corresponding dimension, and the cover image can be directly cut out according to the high-quality image frame on the basis of the given dimension. And then, screenshot is carried out on the high-quality image frame according to the range, and the screenshot contains corresponding face content.

In one embodiment, the range of the face content may exceed or be much smaller than the size specification of the live-broadcast cover, and accordingly, the high-quality image frames may be scaled accordingly, so that the face content therein can appear in the range framed by the size specification at a better preset ratio to highlight the face content.

After the screenshot is determined, the screenshot can be directly set as a cover image of the live video stream, a corresponding live cover is constructed, the live cover can be used as a promotion entrance of a live room of the anchor user, and the live room is promoted according to the live room recommendation logic of the live network platform.

According to the embodiment, each face region image determined in the live broadcast video stream is optimized according to the comprehensive scores, a high-quality face region image corresponding to the highest comprehensive score is selected as a target face image, the target face image is subjected to super-resolution enhancement, the enhanced target face image is covered into an enhanced source image frame obtained by performing corresponding super-resolution enhancement on a source image frame, finally, the region where the face content is located is intercepted from the enhanced source image frame according to the size specification of a live broadcast cover to serve as a cover image, manual intervention is eliminated in the whole process, the automatic implementation is very efficient, the network platform can automatically and intelligently obtain the face image with the best quality according to the change of the live broadcast video stream of a live broadcast room, the cover image of the live broadcast video stream is automatically generated to dynamically update the live broadcast cover, the activities of people in the live broadcast video stream are reflected in real time through the live broadcast cover, the activity information of the live broadcast room is efficiently transmitted, and the service experience of the network platform is comprehensively improved.

Referring to fig. 9, an apparatus for extracting a live cover image adapted to one of the purposes of the present application includes a facial image acquisition module 1100, a multidimensional synchronization scoring module 1200, and a live cover screening module 1300, wherein the facial image acquisition module 1100 is configured to acquire a plurality of facial region images from a live video stream; the multi-dimensional synchronous scoring module 1200 is configured to synchronously determine a plurality of evaluation scores of each face region image corresponding to a plurality of evaluation dimensions, and summarize the plurality of evaluation scores into a comprehensive score of the corresponding face region image; the live broadcast cover screening module 1300 is configured to screen out face region images according to the comprehensive scores to determine cover images of the live broadcast video streams.

On the basis of any embodiment of the present application, the face image obtaining module 1100 includes: the face detection unit is used for carrying out face detection on the basis of the image frames of the live video stream and determining a plurality of target image frames carrying face content and selection frames corresponding to face areas; the region extraction unit is used for extracting a region image from the corresponding target image frame according to the selection frame; and the main angle aiming unit is used for clustering the face characteristic vectors of the plurality of regional images, determining the maximum cluster from the plurality of clusters obtained by clustering, and taking the regional image in the maximum cluster as the face regional image.

On the basis of any embodiment of the present application, the region extraction unit includes: a screenshot-by-frame subunit, which is configured to extract a corresponding region image from the corresponding target image frame according to the selection frame and perform super-resolution enhancement processing on the region image; and the posture correction subunit is used for performing posture correction processing on the human face content in the regional image subjected to the super-resolution enhancement processing.

On the basis of any embodiment of the present application, the multidimensional synchronization scoring module 1200 includes: the characteristic extraction unit is used for inputting each face region image into an image encoder of the face scoring model to extract image characteristic information of the face region image; the classification mapping unit is used for synchronously passing the image characteristic information through a plurality of classifiers which are arranged in the portrait scoring model and correspond to a plurality of evaluation dimensions, and determining an evaluation score corresponding to each evaluation dimension; and the score integration unit is used for integrating the plurality of evaluation scores corresponding to each face region image into an integrated score of the corresponding face region image.

On the basis of this application arbitrary embodiment, the live cover image extraction element of this application includes: the sample prediction module is used for calling training samples in a training data set to input the portrait scoring model and predicting classification results of the classifiers which are mapped to the multiple evaluation dimensions, wherein part of the training samples are face region images; a loss calculation module configured to calculate a classification loss value of each classification result correspondingly using a plurality of supervised labels provided by a plurality of corresponding evaluation dimensions mapped with the training sample; the plurality of evaluation dimensions comprises any number of a plurality of dimensions as follows: the method comprises the steps of representing whether a face image in a training sample belongs to a first dimension of a real person, representing whether the face image in the training sample is a second dimension of a large head photo, representing whether the face image in the training sample is a third dimension of a high-quality image, representing whether the face image in the training sample is a complete fourth dimension, representing whether the face image in the training sample is a covered fifth dimension, representing whether the face image in the training sample belongs to a sixth dimension of the front face of the face, and representing whether the face image in the training sample contains a seventh dimension of a smile expression; and the iteration updating module is set to fuse the classification loss values into a total loss value, and implement gradient updating on the portrait scoring model according to the total loss value until the portrait scoring model is iteratively trained to a convergence state.

On the basis of any embodiment of this application, live cover screening module 1300 includes: the face screening unit is used for screening out a plurality of face region images with the comprehensive scores higher than a preset threshold value as target face images; the original image cutting unit is set to correspond to the size specification of a live broadcast cover and cuts a screenshot containing the target face image from an incoming image frame of the target face image; the screenshot pushing unit is used for pushing each screenshot to terminal equipment of a main broadcasting user to which the live video stream belongs; and the cover setting unit is used for acquiring the screenshot appointed by the anchor user and setting the screenshot as the cover image of the live video stream.

On the basis of any embodiment of this application, live cover screening module 1300 includes: the face screening unit is used for screening the face region image with the highest comprehensive score as a target face image; the original image enhancement unit is used for performing super-resolution enhancement on the target face image by preset times to obtain an enhanced face image; the face enhancement unit is arranged for performing the super-resolution enhancement of the preset multiple on the source image frame of the target face image to obtain an enhanced source image frame; the image synthesis unit is used for covering and synthesizing the enhanced face image to the enhanced source image frame according to the position corresponding relation to obtain a high-quality image frame; and the cover setting unit is set to correspond to the size specification of a live broadcast cover, and cuts out a screenshot containing the target face image from the high-quality image frame to be the cover image of the live broadcast video stream.

In order to solve the technical problem, an embodiment of the application further provides an electronic device. As shown in fig. 10, the internal structure of the electronic device is schematically illustrated. The electronic device includes a processor, a computer-readable storage medium, a memory, and a network interface connected by a system bus. The computer readable storage medium of the electronic device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions can enable a processor to realize a live cover image extraction method when being executed by the processor. The processor of the electronic device is used for providing calculation and control capability and supporting the operation of the whole electronic device. The memory of the electronic device may store computer readable instructions, and when executed by the processor, the computer readable instructions may cause the processor to execute the live cover image extraction method of the present application. The network interface of the electronic equipment is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 10 is merely a block diagram of some of the structures associated with the present solution and does not constitute a limitation on the electronic devices to which the present solution applies, and that a particular electronic device may include more or less components than those shown, or combine certain components, or have a different arrangement of components.

In this embodiment, the processor is configured to execute specific functions of each module and its sub-module in fig. 9, and the memory stores program codes and various data required for executing the modules or sub-modules. The network interface is used for data transmission to and from a user terminal or a server. The memory in this embodiment stores program codes and data required for executing all modules/sub-modules in the live-feed cover image extraction device of the present application, and the server can call the program codes and data of the server to execute the functions of all sub-modules.

The present application also provides a storage medium having computer-readable instructions stored thereon, which, when executed by one or more processors, cause the one or more processors to perform the steps of the live-cover-image extraction method of any of the embodiments of the present application.

The present application also provides a computer program product comprising computer programs/instructions which, when executed by one or more processors, implement the steps of the method as described in any of the embodiments of the present application.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments of the present application can be implemented by a computer program, which can be stored in a computer-readable storage medium, and when the computer program is executed, the processes of the embodiments of the methods can be included. The storage medium may be a computer-readable storage medium such as a magnetic disk, an optical disk, a Read-Only Memory (ROM), or a Random Access Memory (RAM).

In conclusion, the method and the system have the advantages that the quality of the face region image is comprehensively measured through a plurality of evaluation dimensions, the system is more comprehensive, the high-quality face region image can be effectively determined, the cover image which is manufactured according to the method is more representative, the image information of the anchor user of the live broadcast video stream can be more born, the live broadcast cover constructed according to the method can obtain larger user flow, the effect of effective popularization is achieved, and therefore the interaction of a network live broadcast platform is activated.

Claims

1. A live cover image extraction method is characterized by comprising the following steps:

acquiring a plurality of face area images from a live video stream;

2. The live cover image extraction method of claim 1, wherein obtaining a plurality of face region images from a live video stream comprises:

and clustering the face characteristic vectors of the plurality of region images, determining a maximum cluster class from a plurality of cluster classes obtained by clustering, and taking the region image in the maximum cluster class as a face region image.

3. The method for extracting the live cover image according to claim 2, wherein extracting the region image from the corresponding target image frame according to the selection frame comprises:

4. The live cover image extraction method of claim 1, wherein a plurality of evaluation scores corresponding to a plurality of evaluation dimensions are synchronously determined for each face region image, and the plurality of evaluation scores are aggregated into a comprehensive score of the corresponding face region image, and the method comprises the following steps:

5. The live cover image extraction method of claim 3, wherein prior to obtaining the plurality of face region images from the live video stream, comprising:

adopting a plurality of supervision labels provided by a plurality of corresponding evaluation dimensions mapped with the training sample to correspondingly calculate the classification loss value of each classification result;

6. The live cover image extraction method according to any one of claims 1 to 5, wherein screening out a face region image according to the composite score for determining a cover image of the live video stream includes:

screening out a plurality of face area images with the comprehensive scores higher than a preset threshold value as target face images;

7. The live cover image extraction method according to any one of claims 1 to 5, wherein screening out a face region image according to the composite score for determining a cover image of the live video stream includes:

performing the super-resolution enhancement of the preset multiple on the source image frame of the target face image to obtain an enhanced source image frame;

8. A live-feed cover image extraction device is characterized by comprising:

and the live broadcast cover screening module is used for screening a face area image according to the comprehensive score and determining a cover image of the live broadcast video stream.

9. An electronic device comprising a central processor and a memory, wherein the central processor is configured to invoke execution of a computer program stored in the memory to perform the steps of the method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores, in the form of computer-readable instructions, a computer program implemented according to the method of any one of claims 1 to 7, which, when invoked by a computer, performs the steps comprised by the corresponding method.