CN116708953A

CN116708953A - Live cover generation method and device

Info

Publication number: CN116708953A
Application number: CN202310738169.6A
Authority: CN
Inventors: 杜平杰; 殷雅俊
Original assignee: Beijing Huafang Technology Co ltd
Current assignee: Beijing Huafang Technology Co ltd
Priority date: 2023-06-20
Filing date: 2023-06-20
Publication date: 2023-09-05

Abstract

The embodiment of the invention provides a live cover generation method and equipment, wherein the method comprises the following steps: and acquiring live video of a target live broadcasting room in the current period, and extracting a plurality of target images from the live video. And determining the characteristic information corresponding to the live video. And generating a target cover image according to the plurality of target images and the characteristic information, and replacing the current cover image of the target living broadcast room with the target cover image. According to the scheme, the target cover image is generated according to the characteristic information corresponding to the target images and the live video, so that the quality of the generated live cover can be improved, and the generated live cover can better reflect live contents. And through regularly gathering live video, can in time switch the live cover of target live room into with current live content assorted target live cover to realize the live cover of dynamic adjustment, the audience of being convenient for select corresponding live room according to the live cover, improved the click rate and the click conversion rate of live room.

Description

Live cover generation method and device

Technical Field

The present invention relates to the field of computer processing technologies, and in particular, to a method and apparatus for generating a live cover.

Background

In live scenes, live cover quality is critical to the click rate and click conversion rate of a live room. A cover which has excellent quality and can fully embody live contents can better attract target audience to click and watch.

On a traditional video live broadcast platform, live broadcast covers are generally uploaded by a host. However, the live cover uploaded by the host by itself is often less correlated with the live content of the host, reducing the user browsing experience. In addition, after the live cover is set before the live broadcast, the live cover is not adjusted any more, but for the live video scene, the live content in different time periods may have larger difference, and then the situation that the live cover is inconsistent with the current live content may occur.

Disclosure of Invention

The embodiment of the invention provides a live cover generation method and a device, which are used for improving the quality of a generated live cover, so that the generated live cover can better reflect live contents, and the live cover can be dynamically generated along with the change of the live contents.

In a first aspect, an embodiment of the present invention provides a method for generating a live cover, where the method includes:

Collecting live video of a target live broadcasting room in a current period;

extracting a plurality of target images from the live video;

determining characteristic information corresponding to the live video;

and generating a target cover image according to the target images and the characteristic information, and replacing the current cover image of the target living broadcast room with the target cover image.

In a second aspect, an embodiment of the present invention provides a live cover generating apparatus, including:

the acquisition module is used for acquiring live video of the target live broadcasting room in the current period;

the extraction module is used for extracting a plurality of target images from the live video;

the determining module is used for determining the characteristic information corresponding to the live video;

and the generation module is used for generating a target cover image according to the plurality of target images and the characteristic information and replacing the current cover image of the target living broadcast room with the target cover image.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon, which when executed by the processor, causes the processor to at least implement the live cover generation method as described in the first aspect.

In a fourth aspect, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of a management device, causes the processor to at least implement a live cover generation method as described in the first aspect.

According to the live broadcast cover generation scheme provided by the embodiment of the invention, live broadcast videos of a target live broadcast room are collected according to a preset period, a live broadcast cover matched with live broadcast content is generated according to the collected live broadcast videos in the current period, and the live broadcast cover is dynamically adjusted in a fixed period. Specifically, a live video of a target live broadcasting room in a current period is collected, and a plurality of target images are extracted from the live video. And then, determining the characteristic information corresponding to the live video, generating a target cover image according to the plurality of target images and the characteristic information, and replacing the current cover image of the target live broadcasting room with the target cover image.

In the scheme, the target cover images are generated by extracting the target images from the live video in the current period according to the target images and the characteristic information corresponding to the live video, namely, the multidimensional information related to the live content in the current period is adopted when the live cover is generated, so that the quality of the generated live cover can be improved, and the generated live cover can better reflect the live content. Moreover, by regularly collecting live videos, a target live cover can be dynamically generated along with the change of live contents, and the current live cover of the target live room is timely switched to be a target live cover matched with the current live contents, so that the live cover is dynamically adjusted, a viewer can conveniently select a corresponding live room according to the live cover, the click rate and click conversion rate of the live room can be improved, and meanwhile, the hit rate and the impression effect of the viewer in selecting the interested live room can be improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for generating a live cover according to an embodiment of the present invention;

fig. 2 is a flowchart for determining feature information corresponding to a live video according to an embodiment of the present invention;

FIG. 3 is a flowchart of generating a target cover image according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a live cover generating device according to an embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device corresponding to the live cover generating apparatus provided in the embodiment shown in fig. 4.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The terminology used in the embodiments of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention. As used in this application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise, the "plurality" generally includes at least two, but does not exclude the case of at least one. It should be understood that the term "and/or" as used herein is merely one relationship describing the association of the associated objects, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship. The words "if", as used herein, may be interpreted as "at … …" or "when … …", depending on the context.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a product or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such product or system. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a commodity or system comprising such elements.

In addition, the sequence of steps in the method embodiments described below is only an example and is not strictly limited.

With the rapid development of computer and internet technologies, the live broadcasting field is also vigorous. Live ecological loading is in the condition that live goods and live contents are numerous and miscellaneous. How to find and watch the game live content of interest of the audience from a huge amount of live rooms at the first time is considered first, and the live cover which is the first hand information of the audience to contact the live rooms becomes key, however, on a traditional live platform, the live cover is usually uploaded by a host, or a live screenshot is usually directly used as the live cover. However, the relevance of the cover uploaded by the host and the live content is weak, so that a cover party situation often exists, and the browsing experience of the user is reduced. The method of using the live screenshot as the live cover is too long and uniform, the personality of the live room cannot be highlighted, and the audience cannot intuitively select the live screenshot according to the interest of the audience. Moreover, live highlight clips often last for a period of time, and single-frame screenshots hardly fully express the richness and variability of video content, and also hardly determine which frame image is specifically used as a live cover.

In order to solve the above technical problems, in the embodiment of the present invention, a live cover generation method is provided, which generates a target cover image by extracting multiple target images capable of reflecting a highlight live content from a live video, and extracting the multiple target images and feature information corresponding to the live video from the live video in a current period, that is, the quality of the live cover generated by combining multidimensional information related to the live content in the current period is higher, so that the current live content can be better reflected. In addition, through regularly gathering live video, can in time switch the current live cover of target live room into with current live content assorted target live cover to realize the dynamic adjustment live cover.

The live cover generation method of the scheme is described in detail through the following embodiments.

The live cover generation method provided by the embodiment of the invention can be executed by an electronic device, and the electronic device can be a terminal device such as a PC, a notebook computer, a smart phone and the like, and can also be a server. The server may be a physical server comprising a separate host, or may be a virtual server, or may be a cloud server or a server cluster. An exemplary description of the live cover generation process is provided below.

Fig. 1 is a flowchart of a live cover generation method according to an embodiment of the present invention. Referring to fig. 1, the embodiment provides a live cover generation method, where an execution body of the method may be a live cover generation device, and it may be understood that the device may be implemented as software, or a combination of software and hardware, and specifically, as shown in fig. 1, the live cover generation method may include the following steps:

101. and collecting the live video of the target live broadcasting room in the current period.

102. And extracting a plurality of target images from the live video.

103. And determining the characteristic information corresponding to the live video.

104. And generating a target cover image according to the plurality of target images and the characteristic information, and replacing the current cover image of the target living broadcast room with the target cover image.

The live broadcast cover generation method provided by the embodiment of the invention can generate the target live broadcast cover matched with the current live broadcast content for various live broadcast rooms. For live video, live content is more time-efficient, and live content at different time periods may have larger differences. In the embodiment of the invention, in order to enable the generated live broadcast cover to better reflect the current live broadcast content, the target live broadcast cover can be dynamically generated according to the live broadcast content, live broadcast video of the target live broadcast room is acquired according to a preset period, the target live broadcast cover is generated according to the corresponding live broadcast video of the target live broadcast room in the current period, and the current live broadcast cover of the target live broadcast room is switched to a new target live broadcast cover.

Specifically, first, a live video of a target live broadcasting room in a current period is collected. The target live broadcasting room may be any live broadcasting room that is being live broadcasting, and the target live broadcasting room may be determined by the live broadcasting cover generating device (or a server that provides a live broadcasting service), for example, the live broadcasting cover generating device may determine a live broadcasting room with low heat, a live broadcasting room with poor live broadcasting cover quality, a live broadcasting room corresponding to a new host, and the like as the target live broadcasting room. The target live broadcasting room can also be determined by actively triggering the live broadcasting cover generation request by the host broadcasting user, for example, after a host broadcasting user triggers the live broadcasting cover generation request to the live broadcasting cover generation device, the live broadcasting room corresponding to the host broadcasting user is determined as the target live broadcasting room, and the live broadcasting video corresponding to the live broadcasting room is collected.

In practical application, the live video of the target live broadcasting room can be acquired according to a preset period, and the preset period can be determined according to the change frequency of live broadcasting contents or the change frequency of live broadcasting subjects of the live broadcasting room. For example, live video of the target live broadcast room is collected every 5 minutes, live video of the target live broadcast room is collected every 10 minutes, and the duration of each collected live video can be set according to actual requirements.

When live broadcasting is carried out, the live broadcasting content of the live broadcasting room can change at any time, so that the generated live broadcasting cover better accords with the current live broadcasting content or the current live broadcasting theme, when the target live broadcasting cover is generated, the live broadcasting video of the target live broadcasting room in the current period can be collected, and a plurality of target images matched with the live broadcasting content are determined according to the live broadcasting video of the target live broadcasting room in the current period. Specifically, a plurality of target images can be extracted from live video. The target images can be extracted from the live video according to a preset step length, or target highlight segments in the live video can be positioned first, and a plurality of target images can be extracted from the target highlight segments.

Specifically, in an alternative embodiment, a specific implementation manner of extracting multiple target images from a live video may include: splitting a live video into a plurality of video clips; determining a first heat attribute corresponding to each video clip according to the online people number information corresponding to each video clip; determining a second heat attribute corresponding to each video clip based on audience interaction information corresponding to each video clip; determining a target highlight segment from the plurality of video segments based on the first heat attribute and the second heat attribute; and intercepting a plurality of target images from the target highlight. The audience interaction information may include audience bullet screen information, audience viewing conditions, and the like. That is, when capturing multiple target images, the target highlight is determined according to feedback conditions of the audience, and then the multiple target images are captured from the target highlight, so that the obtained target images not only can well reflect the current live broadcast content, but also contain the content interested by the audience.

It is difficult for a single frame of target image to fully express the richness and variability of the live video content, so after a plurality of target images are acquired, the feature information corresponding to the live video can be determined. The feature information may include temporal features and spatial features of the live video, so as to capture the temporal information and the spatial information in the live video according to the feature information.

And then, generating a target cover image according to the characteristic information corresponding to the target images and the live video. The method comprises the steps that a plurality of target images can well reflect live broadcast contents corresponding to live broadcast highlight moments in a current period, characteristic information corresponding to live broadcast videos can reflect time information and space information of the live broadcast videos in the current period, namely multidimensional information is combined to generate target cover images, the quality of the generated target cover images is higher, the generated target cover images are more matched with the current live broadcast contents, the artistry is better, and audience users are attracted to watch the target cover images.

From the above description, it is clear that: the method comprises the steps of collecting live video of a target live broadcasting room according to a preset period, dynamically generating target live broadcasting cover images which are matched with current live broadcasting content and accord with viewer preference according to the collected live video, and generating high-quality target cover images which are more matched with the live broadcasting content according to the preset period.

And finally, switching the current live cover image of the target live broadcasting room into a target cover image, and timely adjusting the live cover of the target live broadcasting room to enable the live cover of the target live broadcasting room to be matched with the live content. In the embodiment of the invention, the target live cover is generated according to the live video in the current period, then the target cover image which is more matched with the live content can be generated in the next preset period by adopting the same method, and the current live cover of the target live room is adjusted, namely the live cover is dynamically adjusted along with the change of the live content, so that the live cover of the target live room is always matched with the current live content.

According to the embodiment of the invention, the target cover image is generated by extracting the target images from the live video in the current period and according to the target images and the characteristic information corresponding to the live video, namely, the multidimensional information related to the live content in the current period is adopted when the live cover is generated, so that the quality of the generated live cover can be improved, and the generated live cover can better reflect the live content. Moreover, by regularly collecting live videos, a target live cover can be dynamically generated along with the change of live contents, and the current live cover of the target live room is timely switched to be a target live cover matched with the current live contents, so that the live cover is dynamically adjusted, a viewer can conveniently select a corresponding live room according to the live cover, the click rate and click conversion rate of the live room can be improved, and meanwhile, the hit rate and the impression effect of the viewer in selecting the interested live room can be improved.

The above embodiment describes the process of generating the live cover, and when the target cover image is generated, the generated target cover image can better reflect the current live content by combining the characteristic information of the live video. In order to facilitate understanding of a specific implementation process of determining feature information corresponding to a live video, an exemplary implementation process thereof is described with reference to fig. 2.

Fig. 2 is a flowchart of determining feature information corresponding to a live video according to an embodiment of the present invention. Referring to fig. 2, the embodiment provides a specific implementation manner of determining feature information corresponding to a live video from a target highlight, and the method may specifically include the following steps:

201. and determining a video feature vector and a text feature vector corresponding to the target highlight, wherein the video feature vector is used for representing time information and space information of the target highlight, and the text feature vector is used for representing text information included in the target highlight.

202. And determining the feature information corresponding to the live video according to the video feature vector and the text feature vector.

The target image can only express live broadcast content at a certain moment, and the richness and the variability of the live broadcast content in the current period can not be captured, so that the richness and the variability of the live broadcast content in the current period can be better obtained, and the characteristic information of the live broadcast video can be combined when the target cover image is generated, so that the quality of the generated target cover image is better.

The actually collected live video may include a plurality of live video segments, if each live video segment corresponds to different live subject or live content, when determining feature information corresponding to the live video, a processing process will be complex, so that in order to reduce processing complexity, when determining feature information corresponding to the live video, feature information corresponding to the live video can be determined based on a target highlight segment in the live video, so that not only can complexity of generating a target cover image be reduced, but also the generated target cover image can be matched with current live content.

Specifically, the target highlight is determined from the live video, and then the feature information corresponding to the target highlight is determined. The feature information may include a video feature vector for characterizing temporal information and spatial information of the target highlight and a text feature vector for characterizing text information included in the target highlight. Therefore, the time information and the space information of the target highlight can be obtained according to the video feature vector of the target highlight, and the text information of the target highlight can be obtained according to the text feature vector of the target highlight, so that the effect of the target cover generated later is better.

In an alternative embodiment, the specific implementation process of determining the video feature vector corresponding to the target highlight segment may be: acquiring video frame extraction parameters; acquiring a plurality of continuous video frames from the target highlight according to the video frame extraction parameters; and inputting a plurality of continuous video frames into the neural network model to obtain video feature vectors corresponding to the target highlight clips. The convolutional neural network mainly comprises a plurality of convolutional layers, a pooling layer, an activation function and a full connection layer. Firstly, inputting a plurality of continuous video frames into a convolution layer for feature extraction, and then, carrying out downsampling through a pooling layer to reduce feature dimensions. And then, carrying out nonlinear transformation on the characteristics through an activation function, and enhancing the expression capacity of the model. And finally, carrying out classification or regression tasks by the full connection layer to obtain the video feature vector.

The neural network model may be a 3D convolutional neural network, and the video feature vector corresponding to the target highlight is obtained by using the pre-trained 3D convolutional neural network, so that the original spatial information of the target highlight can be well reserved.

After the video feature vector corresponding to the target highlight is determined, the text feature vector corresponding to the target highlight is then determined. The text feature vector is used for representing text information included in the target highlight segment, and the text information can be barrage information, appreciation information, anchor speaking information and the like.

In an alternative embodiment, the specific implementation process of determining the text feature vector corresponding to the target highlight segment may be: extracting audio data corresponding to the target highlight; performing voice recognition on the audio data to obtain first text information; dividing the target highlight segment to obtain an image frame sequence corresponding to the target highlight segment; and carrying out text recognition on the image frame sequence to obtain second text information, and determining text feature vectors corresponding to the target highlight according to the first text information and the second text information.

In practical application, the obtained target highlight segment generally includes the anchor voice information, the audience continuous-play voice information, other anchor continuous-play voice information, the live broadcast title, the anchor nickname, the bullet screen information sent by the audience, the audience viewing information and the like, so in order to obtain various relevant information of the current live broadcast content, audio data including the anchor voice information, the audience continuous-play voice information and other anchor continuous-play voice information can be extracted from the target highlight segment, then voice recognition can be performed on the audio data by using a voice recognition model, and the recognized audio data is converted into text information so as to obtain first text information corresponding to the audio data. The trained voice recognition model is adopted to process the audio data, so that more accurate text characteristic information can be obtained.

The speech recognition model mainly comprises an encoder and an encoder. The encoder is mainly used for converting audio data of a user into a vector representation for speech recognition. The decoder is mainly used for completing the recognition of voice to words so as to recognize all words spoken by a user in the audio data, and finally outputting a voice recognition result corresponding to the user to obtain text information corresponding to the audio data of the user. Alternatively, a plurality of encoders in cascade may be included in the encoder, and each encoder may include two sublayers therein: an attention layer and a feedforward neural network layer. A plurality of decoders may be included in a cascade, each including an attention layer and a feedforward neural network layer. The number of encoders included in the encoder may be set according to actual requirements, and similarly, the number of encoders included in the decoder may be set according to actual requirements, which is not limited herein. In addition, the attention layer in the decoder herein may include a self-attention layer and an attention layer.

Then, when text recognition is performed on the live title, the anchor nickname, the bullet screen information sent by the audience, and the audience viewing information in the target highlight, the target highlight can be converted into images, and then text recognition is performed on each image to obtain various text information. Specifically, the target highlight segment is segmented to obtain an image frame sequence corresponding to the target highlight segment, then the image frame sequence is input into an image recognition model, text information included in the image frame sequence is recognized, and second text information corresponding to the image frame sequence is obtained. The image recognition model can be specifically a BLIP-2 model, and the trained image recognition model is utilized to process the image frame sequence, so that more accurate text characteristic information can be obtained. And then, determining the text feature vector corresponding to the target highlight according to the first text information and the second text information.

In order to improve the accuracy of the obtained text feature vector, in an alternative embodiment, a speech model may be used to determine the text feature vector corresponding to the target highlight. Specifically, the first text information and the second text information are input to a language model to obtain text feature vectors, and the language model is trained to extract the text feature vectors in the first text information and the second text information. The language model may be a transducer model, and the specific type of the language model is not limited.

The first text information and the second text information are analyzed and processed through the trained language model, and the text feature vector corresponding to the target highlight video is obtained, so that the accuracy and reliability of text feature vector acquisition are effectively guaranteed, the quality and efficiency of generating the target cover image based on the text feature vector are guaranteed, and the stability and reliability of the method are further improved.

And finally, determining the feature information corresponding to the live video according to the video feature vector and the text feature vector, wherein the obtained feature information not only comprises the video feature information corresponding to the live video but also comprises various text information contained in the live video, and more comprehensively obtaining various dimensional information related to the live video, so that when the target cover image is generated based on the feature information corresponding to the live video, the target cover image can more accurately display the live video, attract the attention of audiences, and improve the satisfaction degree and viscosity of the anchor user.

According to the embodiment of the invention, the video feature vector and the text feature vector corresponding to the target highlight fragment are determined, and the feature information corresponding to the live video is determined according to the video feature vector and the text feature vector, so that not only can various dimensional information related to the live content be obtained, but also the quality and expressive force of the target cover image generated based on the feature information can be greatly improved, meanwhile, the target cover image can accurately display the live content, the attention of a viewer is attracted, and the click rate and click conversion rate of the user are remarkably improved.

And after the feature information corresponding to the live video is determined, generating a target cover image according to the target images and the feature information. In order to better understand the generation process of the target cover image, a specific implementation process of generating the target cover image will be exemplarily described with reference to fig. 3.

FIG. 3 is a flowchart of generating a target cover image according to an embodiment of the present invention. Referring to fig. 3, the embodiment provides a specific implementation manner of generating a target cover image according to a plurality of target images and feature information, and specifically, the method may include the following steps:

301. The target highlight clips are input into a deep learning network model to obtain portrait pictures in the target highlight clips, and the deep learning network model is trained to extract portrait pictures in the video.

302. And generating a target cover image according to the portrait picture, the plurality of target images and the characteristic information.

In the embodiment of the invention, when the target cover image is generated, the portrait pictures included in the target highlight clips can be combined, firstly, the portrait pictures in the target highlight clips are obtained, and then the target cover image is generated according to the portrait pictures, the plurality of target images and the characteristic information.

Wherein, the portrait buckle in the target highlight can be taken out by using a portrait matting algorithm. The target highlight clips can also be processed by using a pre-trained deep learning network model so as to obtain portrait pictures in the target highlight clips. Specifically, the target highlight clip is input into a deep learning network model to obtain the portrait pictures in the target highlight clip, and the deep learning network model is trained to extract the portrait pictures in the video.

The deep learning network model can be subjected to learning training by a large number of sample portrait pictures. Optionally, the deep learning network model may be specifically a lightweight real-time semantic segmentation model (PP-LiteSeg model), so as to accurately predict the label of each pixel in the image, thereby accurately extracting the portrait picture in the image.

After the portrait pictures included in the target highlight clips are acquired, a target cover image is generated according to the portrait pictures, the multiple target images and the feature information. For example, in a video live broadcast in practical application, a live broadcast is generally controlled by a host, and the host is indispensable in a live broadcast, so that when generating a target cover image, the target cover image can be generated in combination with portrait information of the host. Specifically, a main broadcasting portrait picture can be firstly obtained from a live broadcasting video, and then a target cover image is generated according to the portrait picture, a plurality of target images and characteristic information, so that the generated target cover image not only comprises the main broadcasting portrait but also comprises elements related to the current live broadcasting content.

In an optional embodiment, according to the portrait picture, the plurality of target images, and the feature information, a specific implementation manner of generating the target cover image may be: inputting the portrait pictures, the multiple target images and the characteristic information into a condition generation countermeasure network to generate multiple original cover images; and determining the target cover image according to the plurality of original cover images and the characteristic information. In the embodiment of the invention, the condition generation countermeasure network can be utilized to process a plurality of original cover images and the characteristic information, the characteristic information is used as the condition input of the condition generation countermeasure network, the condition constraint is introduced, and the generation of the target cover image is guided.

Wherein the condition generation countermeasure network CGAN is a modification of the original generation countermeasure network GAN, which mainly includes a generator and a discriminator, wherein additional information is added as a condition. The condition GAN is achieved by feeding additional information to the arbiter and generator as part of the input layer. The CGAN adds an additional condition input relative to the original GAN model, and the condition is sequentially input into the generator and the discriminator, so that the model introduces condition constraint while generating the synthesized data, and the data of the specified condition can be synthesized after training is finished. For example, the training sample is a human head portrait, provided that the male and female are conditioned, and the male head portrait or the female head portrait can be synthesized by conditioning after the training of the CGAN is finished.

After the plurality of original cover images are generated, the original cover image with the best effect can be determined from the plurality of original images according to the characteristic information to serve as the target cover image. Specifically, the characteristic information corresponding to the original cover images and the live video can be analyzed and processed by using the convolutional neural network, so as to determine the quality scores corresponding to the original cover images. The convolutional neural network can be used for iterative training step by step according to personal preference of a host player and the use condition of a historical target cover image. For example, selecting a relatively satisfactory cover picture from a plurality of groups of original cover images generated in advance by a host as training data of the convolutional neural network to train the convolutional neural network; the convolutional neural network can be trained according to the historical use condition of each target cover image and the feedback condition of the user, or the online number condition of the audience users corresponding to each target cover image when the target cover images are used as live covers is obtained, the preference of the audience users is determined, and the convolutional neural network is trained according to the preference of the audience users.

In another alternative embodiment, when the target cover image is generated, the cover style may be set directly according to the anchor, style migration processing is performed on the plurality of original cover images to obtain a plurality of style migration images, and then the target cover image is determined from the plurality of style migration images. Specifically, acquiring a cover style set by a host, wherein the host is a host corresponding to a target live broadcasting room; adopting an image style migration algorithm, and processing a plurality of original cover images based on the cover styles so as to obtain a plurality of style migration images; determining the quality scores corresponding to the style migration images according to the style migration images and the characteristic information; based on the component scores, a target cover image is determined from the plurality of style migration images. In order to make the finally generated target cover image conform to the personal style preference of the anchor, the target cover image can be generated by combining the cover style set by the anchor or the cover style set by the anchor when the target cover image is generated.

The specific implementation process for determining the quality scores corresponding to the style migration images according to the style migration images and the characteristic information may include: and inputting the plurality of style migration images and the characteristic information into a convolutional neural network to obtain the quality scores corresponding to the plurality of style migration images, wherein the convolutional neural network is trained to be used for determining the quality scores corresponding to the style migration images.

In the embodiment of the invention, the target highlight clips are input into the deep learning network model to obtain the portrait pictures in the target highlight clips, the deep learning network model is trained to be used for extracting the portrait pictures in the video, and then the target cover images are generated according to the portrait pictures, the plurality of target images and the characteristic information, so that the generated target cover images can more accurately display live contents, attract the attention of audiences, and remarkably improve the click rate and the click conversion rate of the audience users.

In order to facilitate understanding of the live cover generation method, an example is described in connection with a specific application scenario.

When the method is applied specifically, a host player sends a live cover generation request to a live cover generation device, wherein the request comprises cover style pictures set by the host player. The live cover generation device responds to a live cover generation request triggered by a host, and collects live videos of a host live room in a current period. The live video is segmented into a plurality of video segments, a first heat attribute corresponding to each video segment is determined according to the online people number information corresponding to each video segment, a second heat attribute corresponding to each video segment is determined based on audience interaction information corresponding to each video segment, and a target highlight segment is determined from the video segments based on the first heat attribute and the second heat attribute. And intercepting a plurality of target images from the target highlight.

And then, acquiring video frame extraction parameters, acquiring a plurality of continuous video frames from the target highlight fragment according to the video frame extraction parameters, and inputting the plurality of continuous video frames into the neural network model to acquire the video feature vector corresponding to the target highlight fragment. And then, extracting audio data corresponding to the target highlight, and performing voice recognition on the audio data to obtain first text information. The target highlight segment is subjected to segmentation processing to obtain an image frame sequence corresponding to the target highlight segment. And carrying out text recognition on the image frame sequence to obtain second text information, and determining text feature vectors corresponding to the target highlight according to the first text information and the second text information. And then, determining the feature information corresponding to the live video according to the video feature vector and the text feature vector.

The target highlight clips are input into a deep learning network model to obtain portrait pictures in the target highlight clips, the deep learning network model being trained for extracting portrait pictures in the video. Inputting the portrait pictures, the plurality of target images and the characteristic information into a condition generation countermeasure network to generate a plurality of original cover images. Acquiring cover style images set by a host, processing a plurality of original cover images based on the cover style images by adopting an image style migration algorithm to acquire a plurality of style migration images, inputting the plurality of style migration images and characteristic information into a convolutional neural network to acquire respective corresponding quality scores of the plurality of style migration images, wherein the convolutional neural network is trained to be used for determining the quality scores corresponding to the style migration images. Finally, a target cover image is determined from the plurality of style migration images based on the component scores.

And after the next period is reached, acquiring the live video of the main broadcasting live broadcasting room in the period, and determining a plurality of target images and characteristic information corresponding to the live video according to the live video in the period. And generating a target cover image according to the target images and the characteristic information, and replacing the current cover image of the live broadcasting room with the target cover image. According to the method, target cover images corresponding to all periods are sequentially generated in sequence, and the current cover image of the live broadcasting room is switched to the target cover image, so that the target cover image is dynamically generated according to the change of live broadcasting contents, and the current live broadcasting cover of the live broadcasting room is dynamically adjusted.

The relevant content of the embodiment, which is not described in the present embodiment, may refer to the relevant description in the foregoing embodiment, which is not repeated here.

Ring signature devices of one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these means may be configured by the steps taught by the present solution using commercially available hardware components.

Fig. 4 is a schematic structural diagram of a live cover generating device according to an embodiment of the present invention, where, as shown in fig. 4, the device includes: the device comprises an acquisition module 11, an extraction module 12, a determination module 13 and a generation module 14.

And the acquisition module 11 is used for acquiring the live video of the target live broadcasting room in the current period.

An extracting module 12, configured to extract a plurality of target images from the live video.

And the determining module 13 is used for determining the characteristic information corresponding to the live video.

And the generating module 14 is configured to generate a target cover image according to the multiple target images and the feature information, and replace the current cover image of the target living broadcast room with the target cover image.

Alternatively, the extraction module 12 may be specifically configured to: splitting the live video into a plurality of video clips; determining a first heat attribute corresponding to each video clip according to the online population information corresponding to each video clip; determining a second heat attribute corresponding to each video clip based on audience interaction information corresponding to each video clip; determining a target highlight segment from a plurality of the video segments based on the first heat attribute and the second heat attribute; and intercepting a plurality of target images from the target highlight segments.

Alternatively, the determining module 13 may specifically be configured to: determining a video feature vector and a text feature vector corresponding to the target highlight, wherein the video feature vector is used for representing time information and space information of the target highlight, and the text feature vector is used for representing text information included in the target highlight; and determining the feature information corresponding to the live video according to the video feature vector and the text feature vector.

Alternatively, the determining module 13 may specifically be configured to: acquiring video frame extraction parameters; acquiring a plurality of continuous video frames from the target highlight according to the video frame extraction parameters; and inputting the plurality of continuous video frames into a neural network model to obtain video feature vectors corresponding to the target highlight segments.

Alternatively, the determining module 13 may specifically be configured to: extracting audio data corresponding to the target highlight; performing voice recognition on the audio data to obtain first text information; dividing the target highlight segment to obtain an image frame sequence corresponding to the target highlight segment; performing text recognition on the image frame sequence to obtain second text information; and determining a text feature vector corresponding to the target highlight according to the first text information and the second text information.

Optionally, the determining module 13 may be further specifically configured to: the first text information and the second text information are input to a language model to obtain the text feature vector, and the language model is trained to extract the text feature vector in the first text information and the second text information.

Alternatively, the generating module 14 may specifically be configured to: inputting the target highlight into a deep learning network model to obtain portrait pictures in the target highlight, wherein the deep learning network model is trained to be used for extracting portrait pictures in videos; and generating a target cover image according to the portrait picture, the target images and the characteristic information.

Alternatively, the generating module 14 may specifically be configured to: inputting the portrait picture, the plurality of target images and the characteristic information into a condition generation countermeasure network to generate a plurality of original cover images; and determining a target cover image according to the plurality of original cover images and the characteristic information.

Alternatively, the generating module 14 may specifically be configured to: acquiring a cover style set by a host, wherein the host is the host corresponding to the target live broadcasting room; processing the plurality of original cover images based on the cover styles by adopting an image style migration algorithm to obtain a plurality of style migration images; determining the quality scores corresponding to the style migration images according to the style migration images and the characteristic information; and determining a target cover image from the plurality of style migration images based on the component scores.

Optionally, the generating module 14 may be further specifically configured to: and inputting the plurality of style migration images and the characteristic information into a convolutional neural network to obtain the quality scores corresponding to the plurality of style migration images, wherein the convolutional neural network is trained to be used for determining the quality scores corresponding to the style migration images.

The apparatus shown in fig. 4 may perform the steps of the live cover generation method in the foregoing embodiment, and the detailed execution and technical effects are referred to the description in the foregoing embodiment, which is not repeated herein.

In one possible design, the structure of the live cover generating apparatus shown in fig. 4 may be implemented as an electronic device, as shown in fig. 5, where the electronic device may include: a first processor 21, a first memory 22, a first communication interface 23. Wherein the first memory 22 has stored thereon executable code which, when executed by the first processor 21, causes the first processor 21 to at least implement the steps of the live cover generation method of the previous embodiments.

In addition, embodiments of the present invention provide a non-transitory machine-readable storage medium having executable code stored thereon, which when executed by a processor of an electronic device, causes the processor to at least implement a live cover generation method as provided in the previous embodiments.

The apparatus embodiments described above are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

From the above description of the embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by adding necessary general purpose hardware platforms, or may be implemented by a combination of hardware and software. Based on such understanding, the foregoing aspects, in essence and portions contributing to the art, may be embodied in the form of a computer program product, which may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A live cover generation method, comprising:

collecting live video of a target live broadcasting room in a current period;

extracting a plurality of target images from the live video;

determining characteristic information corresponding to the live video;

2. The method of claim 1, wherein extracting a plurality of target images from the live video comprises:

splitting the live video into a plurality of video clips;

determining a first heat attribute corresponding to each video clip according to the online population information corresponding to each video clip;

determining a second heat attribute corresponding to each video clip based on audience interaction information corresponding to each video clip;

determining a target highlight segment from a plurality of the video segments based on the first heat attribute and the second heat attribute;

and intercepting a plurality of target images from the target highlight segments.

3. The method according to claim 2, wherein the determining the feature information corresponding to the live video includes:

Determining a video feature vector and a text feature vector corresponding to the target highlight, wherein the video feature vector is used for representing time information and space information of the target highlight, and the text feature vector is used for representing text information included in the target highlight;

and determining the feature information corresponding to the live video according to the video feature vector and the text feature vector.

4. The method of claim 3, wherein the determining a video feature vector for the target highlight segment comprises:

acquiring video frame extraction parameters;

acquiring a plurality of continuous video frames from the target highlight according to the video frame extraction parameters;

and inputting the plurality of continuous video frames into a neural network model to obtain video feature vectors corresponding to the target highlight segments.

5. The method of claim 3, wherein the determining a text feature vector corresponding to the target highlight segment comprises:

extracting audio data corresponding to the target highlight;

performing voice recognition on the audio data to obtain first text information;

dividing the target highlight segment to obtain an image frame sequence corresponding to the target highlight segment;

Performing text recognition on the image frame sequence to obtain second text information;

and determining a text feature vector corresponding to the target highlight according to the first text information and the second text information.

6. The method of claim 5, wherein determining a text feature vector corresponding to the target highlight segment based on the first text information and the second text information comprises:

the first text information and the second text information are input to a language model to obtain the text feature vector, and the language model is trained to extract the text feature vector in the first text information and the second text information.

7. The method of claim 2, wherein the generating the target cover image from the plurality of target images and the feature information comprises:

inputting the target highlight into a deep learning network model to obtain portrait pictures in the target highlight, wherein the deep learning network model is trained to be used for extracting portrait pictures in videos;

and generating a target cover image according to the portrait picture, the target images and the characteristic information.

8. The method of claim 7, wherein the generating the target cover image from the portrait picture, the plurality of target images, and the feature information comprises:

inputting the portrait picture, the plurality of target images and the characteristic information into a condition generation countermeasure network to generate a plurality of original cover images;

and determining a target cover image according to the plurality of original cover images and the characteristic information.

9. The method of claim 8, wherein the determining the target cover image from the plurality of original cover images and the feature information comprises:

acquiring a cover style set by a host, wherein the host is the host corresponding to the target live broadcasting room;

processing the plurality of original cover images based on the cover styles by adopting an image style migration algorithm to obtain a plurality of style migration images;

determining the quality scores corresponding to the style migration images according to the style migration images and the characteristic information;

and determining a target cover image from the plurality of style migration images based on the quality scores.

10. The method of claim 9, wherein determining the respective quality scores of the plurality of style migration images from the plurality of style migration images and the feature information comprises:

and inputting the plurality of style migration images and the characteristic information into a convolutional neural network to obtain the quality scores corresponding to the plurality of style migration images, wherein the convolutional neural network is trained to be used for determining the quality scores corresponding to the style migration images.

11. An electronic device, comprising: a memory, a processor, a communication interface; wherein the memory has executable code stored thereon that, when executed by the processor, causes the processor to perform the live cover generation method of any of claims 1 to 10.