CN116055762A

CN116055762A - Video synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN116055762A
Application number: CN202211646714.0A
Authority: CN
Inventors: 侯顺伟; 熊浩军; 陈嘉莉; 王政
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2022-12-19
Filing date: 2022-12-19
Publication date: 2023-05-02

Abstract

The disclosure discloses a video synthesis method and device, electronic equipment and a storage medium, and relates to the technical field of image processing, in particular to the field of video editing. The specific implementation scheme is as follows: acquiring an initial video stream, and encoding and decoding the initial video stream to obtain a video fragment set; performing face clustering processing on the video fragment set to obtain a face clustering library corresponding to the video fragment set; acquiring at least one target face feature corresponding to a target face image, and determining a target video segment set corresponding to the at least one target face feature from a face cluster library; and synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set. The time and cost for synthesizing the video can be reduced by adopting the scheme.

Description

Video synthesis method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of image processing, and in particular relates to a video synthesis method and device, electronic equipment and a storage medium.

Background

With the development of scientific technology, the living standard of people is improved, and people can participate in various multi-person activities to meet the living and entertainment demands of the people. In the related art, each participant may be made with its corresponding highlight collection to leave a valuable record for each participant involved in the activity. However, the time and cost to make their corresponding highlight highlights for each participant is high,

Disclosure of Invention

The present disclosure provides a video synthesizing method and apparatus, an electronic device, and a storage medium, and is mainly aimed at reducing the time and cost of synthesizing video.

According to an aspect of the present disclosure, there is provided a video compositing method, including:

acquiring an initial video stream, and encoding and decoding the initial video stream to obtain a video fragment set;

performing face clustering processing on the video segment sets to obtain face clustering libraries corresponding to the video segment sets, wherein any face feature in the face clustering libraries corresponds to one video segment subset;

acquiring at least one target face feature corresponding to a target face image, and determining a target video segment set corresponding to the at least one target face feature from the face cluster library;

and synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set.

According to another aspect of the present disclosure, there is provided a video compositing apparatus, comprising:

the video stream acquisition unit is used for acquiring an initial video stream, and encoding and decoding the initial video stream to obtain a video fragment set;

The set clustering unit is used for carrying out face clustering processing on the video segment sets to obtain face clustering libraries corresponding to the video segment sets, wherein any face feature in the face clustering libraries corresponds to one video segment subset;

the collection acquisition unit is used for acquiring at least one target face feature corresponding to the target face image and determining a target video fragment collection corresponding to the at least one target face feature from the face clustering library;

and the video synthesis unit is used for synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set.

According to another aspect of the present disclosure, there is provided an electronic device including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method of any one of the preceding aspects.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method of any one of the preceding aspects.

In one or more embodiments of the present disclosure, a video clip set is obtained by acquiring an initial video stream and encoding and decoding the initial video stream; performing face clustering processing on the video fragment set to obtain a face clustering library corresponding to the video fragment set, wherein any face feature in the face clustering library corresponds to one video fragment subset; acquiring at least one target face feature corresponding to a target face image, and determining a target video segment set corresponding to the at least one target face feature from a face cluster library; and synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set. Thus, the time and cost of synthesizing video can be reduced.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

fig. 1 is a flow diagram of a video compositing method according to a first embodiment of the disclosure;

fig. 2 is a flow diagram of a video compositing method according to a second embodiment of the disclosure;

FIG. 3 is a flow diagram of acquiring a video clip in TS format according to an embodiment of the present disclosure;

FIG. 4 is a flow diagram of video composition according to an embodiment of the present disclosure;

fig. 5 (a) is a schematic structural diagram of a first video compositing apparatus for implementing the video compositing method of an embodiment of the disclosure;

fig. 5 (b) is a schematic structural diagram of a second video compositing apparatus for implementing the video compositing method of an embodiment of the disclosure;

fig. 6 is a block diagram of an electronic device for implementing a video compositing method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

With the development of scientific technology, the living standard of people is improved, and people can participate in various multi-person activities to meet the living and entertainment demands of the people. For example, each city can hold a marathon race for people to participate, and the marathon race is a sports item for a competitor, and is a city popularization item for a holding place. Precious records can be left for participants by making a video of the collection, and meanwhile, the participants and key opinion leadership (Key Opinion Consumer, KOC) can deepen the impression of holding a city.

According to some embodiments, when the video is produced, face features of each participant need to be collected before the activity, a face feature library is established, fragments corresponding to each participant are intercepted from the activity video according to the face feature library after the activity is finished, and finally the video is processed into the video corresponding to each participant according to the post transcoding of the fixed template.

It is readily appreciated that when the number of participants is high, for example, up to thirty-four thousand participants per marathon, the time and cost spent building a face feature library prior to the event is high. In addition, since there may be a plurality of participants in one segment, one segment may be intercepted many times, and each time the segment is intercepted, it is subjected to a post-transcoding process, which takes time and costs high.

The present disclosure is described in detail below with reference to specific examples.

In a first embodiment, as shown in fig. 1, fig. 1 is a flow diagram of a video compositing method according to a first embodiment of the disclosure, which method may be implemented in dependence on a computer program, and may be run on a device that performs video compositing. The computer program may be integrated in the application or may run as a stand-alone tool class application.

The video synthesizing apparatus may be an electronic device having a video synthesizing function, including but not limited to: wearable devices, handheld devices, personal computers, tablet computers, vehicle-mounted devices, smart phones, computing devices, or other processing devices connected to a wireless modem, etc. Electronic devices in different networks may be called different names, for example: user equipment, access electronics, subscriber units, subscriber stations, mobile stations, remote electronics, mobile devices, consumer electronics, wireless communication devices, user agents or user equipment, cellular telephones, cordless telephones, personal digital assistants (personal digital assistant, PDAs), fifth Generation mobile communication technology (5th Generation Mobile Communication Technology,5G) networks, fourth Generation mobile communication technology (the 4th Generation mobile communication technology,4G) networks, third Generation mobile communication technology (3 rd-Generation, 3G) networks, or electronics in future evolution networks, and the like.

Specifically, the video synthesis method comprises the following steps:

s101, acquiring an initial video stream, and encoding and decoding the initial video stream to obtain a video clip set;

according to some embodiments, the initial video stream refers to a complete video stream acquired for an activity from an activity start time to an activity end time in a certain activity scene.

In some embodiments, codec refers to both encoding and decoding. Decoding and decoding both refer to the process of converting information from one form or format to another, with encoding and decoding being the inverse of each other. The decoding specifically refers to a process of restoring a digital code to what it represents or converting an electric pulse signal, an optical signal, a radio wave, or the like to information, data, or the like that it represents by a specific method, and is a process of restoring a received symbol or code to information. Coding specifically refers to the process of coding characters, numbers or other objects into numbers by a preset method or converting information and data into a preset electric pulse signal.

For example, when the initial video stream is encoded and decoded, the initial video stream may be first decoded into a file in a first format, and then the file in the first format may be encoded into a file in a second format, where the second format may be different from the file format corresponding to the initial video stream.

According to some embodiments, a set of video clips refers to a set of at least one video clip that is aggregated. The at least one video segment is obtained by decoding the initial video stream and encoding the initial video stream according to a duration threshold, for example, the video stream of every 5 seconds after the initial video stream is decoded may be encoded into one video segment, and the video stream of every 10 seconds after the initial video stream is decoded may be encoded into one video segment.

It is easy to understand that when the electronic device performs video composition, the electronic device may acquire an initial video stream, and perform encoding and decoding on the initial video stream to obtain a video clip set.

S102, performing face clustering processing on the video fragment set to obtain a face clustering library corresponding to the video fragment set;

according to some embodiments, face clustering refers to face clustering in the deep learning era, firstly, features extracted from a Convolutional Neural Network (CNN) are adopted to map face pictures to a high-dimensional vector, and the mapped faces are distributed in different cones in a feature space, so that cosine similarity can be used for measuring similarity. Or if the face features are normalized by two norms, the face features are distributed on a sphere, so that the L2 distance can be used for measurement. And combining the features corresponding to the same person through a clustering algorithm, so as to obtain a face clustering library corresponding to the video fragment set. At this time, the face features corresponding to the same person may be combined together in a cluster form in the face cluster library.

In some embodiments, any face feature of the same person may be identified in multiple video clips, and thus any face feature in the face cluster library corresponds to a subset of video clips. Conversely, a video clip may also correspond to a plurality of facial features of a plurality of people.

It is easy to understand that when the electronic device obtains the video segment set, the electronic device can perform face clustering processing on the video segment set to obtain a face cluster library corresponding to the video segment set.

S103, obtaining at least one target face feature corresponding to the target face image, and determining a target video segment set corresponding to the at least one target face feature from a face cluster library;

according to some embodiments, the target face image refers to a face image corresponding to any target person when the target person needs to intercept its own corresponding video from the initial video stream. The specific process when obtaining at least one target face feature corresponding to the target face image may be the same with reference to the recognition process in the face cluster.

In some embodiments, when determining the set of target video segments corresponding to at least one target face feature from the face cluster library, a location of the at least one target face feature in the face cluster library may be determined, for example, when the cluster corresponding to the same person is a sphere, by determining in which sphere the at least one target face feature is. Then, for example, all face features in the sphere can be obtained, and video clips corresponding to the face features can be determined, so that a target video clip set corresponding to the target face features can be obtained.

It is easy to understand that when the electronic device obtains the face clustering library, the electronic device may obtain at least one target face feature corresponding to the target face image, and determine a target video clip set corresponding to the at least one target face feature from the face clustering library.

S104, synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set.

It is easy to understand that when the electronic device obtains the target video segment set corresponding to the at least one target face feature, the electronic device may synthesize a video corresponding to the target face image according to at least one target video segment in the target video segment set.

In summary, according to the method provided by the embodiment of the present disclosure, an initial video stream is obtained, and encoding and decoding are performed on the initial video stream to obtain a video clip set; performing face clustering processing on the video fragment set to obtain a face clustering library corresponding to the video fragment set; acquiring at least one target face feature corresponding to a target face image, and determining a target video segment set corresponding to the at least one target face feature from a face cluster library; and synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set. Therefore, by carrying out face clustering processing on the acquired video segment set and searching the target video segment from the face clustering library obtained by clustering according to the target face image, the face feature library does not need to be established in advance, and the time and cost of video synthesis can be reduced.

Referring to fig. 2, fig. 2 is a flow chart of a video compositing method according to a second embodiment of the disclosure. In particular, the method comprises the steps of,

s201, obtaining an initial video stream, and decoding the initial video stream to obtain a decoded initial video stream;

according to some embodiments, the initial video stream may be, for example, a multi-channel video stream, that is, the initial video stream may include at least one sub-video stream. At this time, the electronic device may decode the at least one sub-video stream through at least one ffmpeg process in the streaming media access subsystem, where one sub-video stream corresponds to one ffmpeg process.

It is easy to understand that when the initial video stream is obtained, at least one ffmpeg process may be used to decode the at least one sub-video stream, respectively, to obtain at least one decoded sub-video stream.

S202, performing compression processing on the decoded initial video stream to obtain a processed initial video stream;

according to some embodiments, the compression process refers to the process of post-processing the decoded video. For example, when the decoded initial video stream is subjected to compression processing, a watermark and an icon logo can be burnt in an image track corresponding to the decoded initial video stream, so as to obtain the processed initial video stream.

In some embodiments, an image track refers to a sequence window that places added video material and can be edited, in which the placed video can be edited and effects added.

S203, coding and splitting the processed initial video stream into at least two video clips to obtain a video clip set;

according to some embodiments, when the processed initial video stream is encoded and split into at least two video segments, the specific process is as described above, and the processed initial video stream may be encoded according to a duration threshold. For example, a video stream every 5 seconds in the processed initial video stream may be encoded as one video clip, and a video stream every 10 seconds in the processed initial video stream may be encoded as one video clip.

In some embodiments, encoding in a uniform format is performed at the time of encoding, and thus, it is possible to reduce the occurrence of encoding inconsistency at the time of synthesizing video. For example, video clips in Transport Stream (TS) format can be uniformly encoded. The TS format is a packaged format, which is all called MPEG2-TS. MPEG2-TS is a standard data container format for transmitting and storing audio and video, program and system information protocol data. The video clips in the TS format have the characteristic of convenient synthesis, and the video clips in the corresponding TS format can be copied together according to time sequence during synthesis.

In some embodiments, fig. 3 is a flow diagram of acquiring video clips in a TS format according to an embodiment of the present disclosure. As shown in fig. 3, the three sub-video streams are encoded and decoded by three ffmpeg processes in the streaming media access subsystem, so as to obtain video segments in a TS format corresponding to each sub-video stream, and all the video segments are converged into one set, so as to obtain a video segment set.

It is easy to understand that the post-processing is performed in the process of encoding and decoding the initial video stream, the video clips in multiple TS formats can be directly and quickly combined in the following process, only the time of de-packaging the synthesized video is consumed, the video does not need to be encoded and decoded secondarily during synthesis, the operation of encoding complexity of O (n×m) can be reduced to O (1), and the time and cost consumed by video synthesis can be reduced.

In some embodiments, decapsulation refers to the process of disassembling protocol packets, processing information in the packet header, and retrieving traffic information data in the payload. When unpacking, binary data reading is only needed once, for example, TS header of a TS format video clip is read, and the synthesized TS format video can be unpacked into an mp4 format file by adding the header of the mp4 format.

S204, monitoring at least one sub-video stream through a file monitoring mechanism, and putting the video segments with the number threshold into a video segment set to be put in storage under the condition that the number of the video segments generated by any one of the at least one sub-video stream reaches the number threshold;

according to some embodiments, the process of encoding the processed initial video stream by the duration threshold is a dynamic process that is processed sequentially from a start time to an end time of the processed initial video stream. Therefore, in order to reduce the time for acquiring the face clustering library, the video segments can be processed in batches, and when the number of the video segments generated by any one of the at least one sub-video stream reaches the number threshold, the video segments with the number threshold are put into a video segment set to be put into a warehouse, and the video segments in the video segment set to be put into a warehouse are subjected to warehouse operation. The process of carrying out face feature recognition on the video clips in the video clip set to be put in storage is the put in storage process.

In some embodiments, the file listening (inotify) mechanism may specifically listen for creation, modification, movement, deletion of files and folders. Therefore, the video clip is monitored in the process of encoding and decoding at least one path of sub-video stream by adopting an inotify mechanism, so that the accuracy and the efficiency of monitoring can be improved. The inotify mechanism may specifically be an inotify mechanism in a linux system.

In some embodiments, the number threshold is not specific to a fixed threshold. The number threshold may be adjusted based on the length of time corresponding to each video clip. For example, when the duration of each video clip corresponds to 5 seconds, the number threshold may be 60, for example; the number threshold may be, for example, 30 when each video clip corresponds to a duration of 10 seconds.

S205, carrying out face feature recognition on an image frame corresponding to any video clip in the video clip set to be put into storage to obtain at least one face feature;

according to some embodiments, when face feature recognition is performed on an image frame corresponding to any video clip in a video clip set to be put in storage, all video clips in the video clip set to be put in storage are combined first to form a video to be put in storage. Then, the video to be put in storage is subjected to frame cutting, and the cut image frames are subjected to face feature recognition to obtain at least one face feature.

S206, under the condition that all video clips in the video clip set to be put in are recognized by face features, the video clip set to be put in is emptied, and under the condition that the number of video clips generated by any one of at least one sub-video stream reaches a number threshold, the video clips with the number threshold are put in the video clip set to be put in;

According to some embodiments, if the number of the last video segments generated by any one of the at least one sub-video stream does not reach the number threshold, the remaining video segments are directly put into the video segment set to be put into storage for subsequent processing.

S207, clustering at least one face feature to obtain a face cluster library under the condition that face feature recognition is completed on all video clips in the video clip set;

in some embodiments, when at least one face feature is clustered, multiple outgoing mirror information of the same target person may be clustered. Specifically, the resulting face features may be in the form of a plurality of groups. For example, it may be: (fram1+loc3, fram2+loc4, fram3+loc5), (fram1+loc1, fram2+loc2). Specifically, the frame1+loc3 may be, for example, a face feature 1, coordinates (left coordinate 1, upper coordinate 1, right coordinate 1, lower coordinate 1) in the image frame. The frame2+loc4 may be, for example, the face feature 2, coordinates (left coordinate 2, upper coordinate 2, right coordinate 2, lower coordinate 2) in the image frame. The frame3+loc5 may be, for example, the face feature 3, coordinates (left coordinate 3, upper coordinate 3, right coordinate 3, lower coordinate 3) in the image frame. Therefore, the video clips in which the face features appear, the time points of the occurrence and the positions of the occurrence are recorded in the face cluster library.

S208, obtaining at least one target face feature corresponding to the target face image, and determining a target video segment set corresponding to the at least one target face feature from a face cluster library;

according to some embodiments, when determining a target video segment set corresponding to at least one target face feature, human body detection and optical character recognition can be performed on the video segment set to obtain a human body recognition information set; overlapping and fusing the human body identification information set and the human face clustering library to obtain a human body information clustering library corresponding to the video fragment set; the human body information clustering library can be used for determining a target video segment subset corresponding to the target face image.

In some embodiments, human body detection refers to human body region tracking, and by tracking human body regions of any person in a video fragment set, the human body range of the any person is identified, and information identified by optical characters corresponding to the any person can be determined, so that a human body identification information set and a human face clustering library can be overlapped and fused to obtain a human body information clustering library corresponding to the video fragment set. Therefore, the target video segment set corresponding to at least one target face feature is determined from the human body information clustering library, the accuracy of acquiring the target video segment set can be improved, and the personalized requirements of different users on video synthesis can be met.

In some embodiments, optical character recognition (Optical Character Recognition, OCR) refers to the process in which an electronic device examines characters in an image, determines their shape by detecting dark and light patterns, and then translates the shape into computer text using a character recognition method. For example, in a marathon race, the number of the race carried by each competitor may be identified.

In some embodiments, when the target face image is acquired, the target person may upload its face image through a mobile phone application, a computer browser, or the like. When the electronic equipment acquires the target face image, a video corresponding to the target face image can be generated, and the generated video is sent to a target person in a mode of mobile phone application, a computer browser and the like.

S209, synthesizing a video corresponding to the target face image according to at least one target video segment in the target video segment set.

According to some embodiments, according to a timestamp corresponding to any one of the at least one target video segment, any one target video segment is inserted into a corresponding position of a preset template video, so as to obtain a video corresponding to a target face image.

In some embodiments, fig. 4 is a flow diagram of video composition according to an embodiment of the present disclosure. As shown in fig. 4, first, at least one video clip may be selected from the target video clip set as required, for example, in a marathon race, a target video clip in a TS format corresponding to the beginning, the outgoing mirror 1, the distant view, the outgoing mirror 2, and the end may be selected. And then, obtaining a first video in a TS format corresponding to the target face image. Then, background music can be selected to be combined into a first video in a TS format, so as to obtain a second video in the TS format. Finally, the second video in the TS format can be unpacked to obtain the video in the MP4 format corresponding to the target face image.

For example, in a marathon race, a plurality of live stations may be uniformly arranged on the race route, each live station acquiring a sub-video stream. For example, when the marathon race is fully weighed 10 kilometers, a live machine may be set every one kilometer. Then, the target video segments corresponding to the live broadcasting machine position are acquired in the target video segment set, but not every live broadcasting machine position can acquire the target video segments, for example, only the target video segments corresponding to the first live broadcasting machine position, the third live broadcasting machine position and the seventh live broadcasting machine position may be acquired in the target video segment set. At this time, the preset template video includes the template video segment corresponding to each live machine position and the corresponding time stamp thereof, and for the live machine position where the target video segment is not acquired, the corresponding template video segment is directly put in.

In addition, because the time stamp corresponding to the target video segment can be obtained, when the video corresponding to the target person in the marathon race is generated, the running speed information of the target person can be determined according to the time stamps corresponding to the two adjacent target video segments and the distance between live broadcasting machine positions corresponding to the two adjacent target video segments. The running information may then be superimposed into the synthesized video to improve the quality of the video generation.

In some embodiments, when generating the video corresponding to the target face image, a human inspector policy may also be added. Specifically, synthesizing an initial video corresponding to a target face image according to at least one target video segment; if the initial video does not meet the video composition requirement, modifying the initial video according to modification information input for the initial video to obtain a video corresponding to the target face image. Therefore, when the initial video does not meet the video synthesis requirement, a person can select a target video segment with higher quality to synthesize according to the outgoing information of the target person in the target video segment set, so that error identification of the electronic equipment can be corrected, a spam strategy is implemented, and the quality of video generation can be improved.

In the embodiment of the disclosure, firstly, an initial video stream is obtained, and the initial video stream is decoded to obtain a decoded initial video stream; pressing the decoded initial video stream to obtain a processed initial video stream; encoding and splitting the processed initial video stream into at least two video clips to obtain a video clip set; therefore, the video does not need to be coded and decoded twice in the final synthesis process, and the time and cost for generating the video can be reduced. Then, monitoring at least one sub-video stream through a file monitoring mechanism, and putting the video clips with the number threshold into a video clip set to be put in storage under the condition that the number of the video clips generated by any one of the at least one sub-video stream reaches the number threshold; carrying out face feature recognition on the image frames corresponding to any video clip in the video clip set to be put into storage to obtain at least one face feature; under the condition that all video clips in the video clip set to be put in are recognized by face features, the video clip set to be put in is emptied, and under the condition that the number of video clips generated by any one of at least one path of sub-video streams reaches a number threshold, the video clips with the number threshold are put in the video clip set to be put in; under the condition that face feature recognition is completed on all video clips in the video clip set, clustering at least one face feature to obtain a face cluster library; therefore, the face clustering processing is carried out in the video segment generation process, so that the efficiency of obtaining the face clustering library can be improved, and the video generation time can be reduced. Finally, at least one target face feature corresponding to the target face image is obtained, and a target video fragment set corresponding to the at least one target face feature is determined from a face cluster library; and synthesizing the video corresponding to the target face image according to at least one target video segment in the target video segment set. Therefore, by carrying out face clustering processing on the acquired video segment set and searching the target video segment from the face clustering library obtained by clustering according to the target face image, the face feature library does not need to be established in advance, and the time and cost of video synthesis can be reduced.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.

The following are device embodiments of the present disclosure that may be used to perform method embodiments of the present disclosure. For details not disclosed in the embodiments of the apparatus of the present disclosure, please refer to the embodiments of the method of the present disclosure.

Referring to fig. 5 (a), a schematic structural diagram of a first video compositing apparatus for implementing the video compositing method of an embodiment of the disclosure is shown. The video compositing device may be implemented as all or part of the device by software, hardware, or a combination of both. The video composition apparatus 500 includes a video stream acquisition unit 501, a set clustering unit 502, a set acquisition unit 503, and a video composition unit 504, wherein:

a video stream obtaining unit 501, configured to obtain an initial video stream, and encode and decode the initial video stream to obtain a video clip set;

the set clustering unit 502 is configured to perform face clustering on the video segment set to obtain a face clustering library corresponding to the video segment set, where any face feature in the face clustering library corresponds to a subset of video segments;

A set obtaining unit 503, configured to obtain at least one target face feature corresponding to the target face image, and determine a target video segment set corresponding to the at least one target face feature from a face cluster library;

the video synthesis unit 504 is configured to synthesize a video corresponding to the target face image according to at least one target video clip in the target video clip set.

Optionally, the video stream obtaining unit 501 is configured to perform encoding and decoding on an initial video stream, and is specifically configured to:

decoding the initial video stream to obtain a decoded initial video stream;

pressing the decoded initial video stream to obtain a processed initial video stream;

and encoding and splitting the processed initial video stream into at least two video clips to obtain a video clip set, wherein the file format of the video clips is a transport stream format.

Optionally, the video stream obtaining unit 501 is configured to perform compression processing on the decoded initial video stream, and when obtaining the processed initial video stream, the video stream obtaining unit is specifically configured to:

and burning the watermark and the icon in the image track corresponding to the decoded initial video stream to obtain the processed initial video stream.

Alternatively, fig. 5 (b) is a schematic structural diagram of a second video compositing apparatus for implementing the video compositing method of an embodiment of the disclosure. As shown in fig. 5 (b), the video compositing apparatus 500 further includes:

an information obtaining unit 505, configured to perform human body detection and optical character recognition on the video clip set to obtain a human body identification information set;

the set stacking unit 506 is configured to stack and fuse the set of human body identification information and the face cluster library to obtain a human body information cluster library corresponding to the set of video segments, where the human body information cluster library is used to determine a target video segment subset corresponding to the target face image.

Optionally, the initial video stream includes at least one sub-video stream, and the set clustering unit 502 is configured to perform face clustering on the video segment set to obtain a face cluster library corresponding to at least two video segments, which is specifically configured to:

monitoring at least one sub-video stream through a file monitoring mechanism, and putting the video segments with the number threshold into a video segment set to be put in storage under the condition that the number of the video segments generated by any one of the at least one sub-video stream reaches the number threshold;

carrying out face feature recognition on the image frames corresponding to any video clip in the video clip set to be put into storage to obtain at least one face feature;

Under the condition that all video clips in the video clip set to be put in are recognized by face features, the video clip set to be put in is emptied, and under the condition that the number of video clips generated by any one of at least one path of sub-video streams reaches a number threshold, the video clips with the number threshold are put in the video clip set to be put in;

and under the condition that the face feature recognition is completed on all the video clips in the video clip set, clustering at least one face feature to obtain a face cluster library.

Optionally, the video synthesis unit 504 is configured to synthesize, according to at least one target video clip in the target video clip set, a video corresponding to the target face image, specifically configured to:

and inserting any target video segment into the corresponding position of the preset template video according to the timestamp corresponding to any target video segment in at least one target video segment, so as to obtain the video corresponding to the target face image.

synthesizing an initial video corresponding to the target face image according to at least one target video segment;

If the initial video does not meet the video composition requirement, modifying the initial video according to modification information input for the initial video to obtain a video corresponding to the target face image.

It should be noted that, in the video compositing apparatus provided in the above embodiment, when the video compositing method is executed, only the division of the above functional modules is used as an example, in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to complete all or part of the functions described above. In addition, the video synthesizing device and the video synthesizing method provided in the above embodiments belong to the same concept, which embody the detailed implementation process in the method embodiments, and are not repeated here.

The foregoing embodiment numbers of the present disclosure are merely for description and do not represent advantages or disadvantages of the embodiments.

In summary, the device provided in the embodiments of the present disclosure obtains an initial video stream through a video stream obtaining unit, and encodes and decodes the initial video stream to obtain a video clip set; the method comprises the steps that a face clustering unit performs face clustering on a video fragment set to obtain a face clustering library corresponding to the video fragment set, wherein any face feature in the face clustering library corresponds to a video fragment subset; the method comprises the steps that a set acquisition unit acquires at least one target face feature corresponding to a target face image, and a target video fragment set corresponding to the at least one target face feature is determined from a face cluster library; and the video synthesis unit synthesizes the video corresponding to the target face image according to at least one target video fragment in the target video fragment set. Therefore, by carrying out face clustering processing on the acquired video segment set and searching the target video segment from the face clustering library obtained by clustering according to the target face image, the face feature library does not need to be established in advance, and the time and cost of video synthesis can be reduced.

In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 6 illustrates a schematic block diagram of an example electronic device 600 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 6, the apparatus 600 includes a computing unit 601 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 602 or a computer program loaded from a storage unit 608 into a Random Access Memory (RAM) 603. In the RAM 603, various programs and data required for the operation of the device 600 may also be stored. The computing unit 601, ROM 602, and RAM 603 are connected to each other by a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.

Various components in the device 600 are connected to the I/O interface 605, including: an input unit 606 such as a keyboard, mouse, etc.; an output unit 607 such as various types of displays, speakers, and the like; a storage unit 608, such as a magnetic disk, optical disk, or the like; and a communication unit 609 such as a network card, modem, wireless communication transceiver, etc. The communication unit 609 allows the device 600 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunication networks.

The computing unit 601 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 601 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 601 performs the respective methods and processes described above, such as a video composition method. For example, in some embodiments, the video compositing method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 608. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 600 via the ROM 602 and/or the communication unit 609. When the computer program is loaded into RAM 603 and executed by the computing unit 601, one or more steps of the video compositing method described above may be performed. Alternatively, in other embodiments, the computing unit 601 may be configured to perform the video compositing method by any other suitable means (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), the internet, and blockchain networks.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual Private Server" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. A method of video composition, comprising:

2. The method of claim 1, wherein the encoding and decoding the initial video stream to obtain a set of video segments comprises:

decoding the initial video stream to obtain a decoded initial video stream;

performing compression processing on the decoded initial video stream to obtain a processed initial video stream;

and encoding and splitting the processed initial video stream into at least two video clips to obtain a video clip set, wherein the file format of the video clip is a transmission stream format.

3. The method according to claim 2, wherein the compressing the decoded initial video stream to obtain a processed initial video stream comprises:

4. The method according to claim 1, wherein the method further comprises:

performing human body detection and optical character recognition on the video fragment set to obtain a human body recognition information set;

And carrying out superposition fusion on the human body identification information set and the human face clustering library to obtain a human body information clustering library corresponding to the video segment set, wherein the human body information clustering library is used for determining the target video segment subset corresponding to the target human face image.

5. The method of claim 1, wherein the initial video stream includes at least one sub-video stream, and the face clustering is performed on the video segment set to obtain a face cluster library corresponding to the at least two video segments, including:

monitoring the at least one path of sub-video stream through a file monitoring mechanism, and placing the video fragments with the number threshold into a video fragment set to be put in storage under the condition that the number of the video fragments generated by any path of sub-video stream in the at least one path of sub-video stream reaches the number threshold;

under the condition that face feature recognition is completed on all video clips in the video clip set to be put in storage, the video clip set to be put in storage is emptied, and under the condition that the number of video clips generated by any one of the at least one sub-video streams reaches a number threshold, the video clips with the number threshold are put in the video clip set to be put in storage;

And clustering the at least one face feature under the condition that face feature recognition is completed on all video clips in the video clip set, so as to obtain a face cluster library.

6. The method according to claim 1, wherein synthesizing the video corresponding to the target face image according to at least one target video clip in the target video clip set comprises:

and inserting any target video segment into a corresponding position of a preset template video according to a time stamp corresponding to any target video segment in the at least one target video segment to obtain a video corresponding to the target face image.

7. The method according to claim 1, wherein synthesizing the video corresponding to the target face image according to at least one target video clip in the target video clip set comprises:

synthesizing an initial video corresponding to the target face image according to the at least one target video segment;

and if the initial video does not meet the video synthesis requirement, modifying the initial video according to modification information input for the initial video to obtain a video corresponding to the target face image.

8. A video compositing apparatus comprising:

9. The apparatus of claim 8, wherein the video stream obtaining unit is configured to, when performing encoding and decoding on the initial video stream to obtain a video clip set, specifically:

decoding the initial video stream to obtain a decoded initial video stream;

10. The apparatus of claim 9, wherein the video stream obtaining unit is configured to perform compression processing on the decoded initial video stream to obtain a processed initial video stream, and specifically configured to:

11. The apparatus of claim 8, wherein the apparatus further comprises:

the information acquisition unit is used for carrying out human body detection and optical character recognition on the video fragment set to obtain a human body recognition information set;

and the collection superposition unit is used for carrying out superposition fusion on the human body identification information collection and the human face clustering library to obtain a human body information clustering library corresponding to the video segment collection, wherein the human body information clustering library is used for determining the target video segment subset corresponding to the target human face image.

12. The apparatus of claim 8, wherein the initial video stream includes at least one sub-video stream, and the set clustering unit is configured to perform face clustering on the set of video segments to obtain a face cluster library corresponding to the at least two video segments, and is specifically configured to:

13. The apparatus according to claim 8, wherein the video synthesis unit is configured to synthesize, according to at least one target video clip in the set of target video clips, a video corresponding to the target face image, specifically configured to:

14. The apparatus according to claim 8, wherein the video synthesis unit is configured to synthesize, according to at least one target video clip in the set of target video clips, a video corresponding to the target face image, specifically configured to:

15. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; it is characterized in that the method comprises the steps of,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-7.

16. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-7.

17. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-7.