CN114845136B

CN114845136B - Video synthesis method, device, equipment and storage medium

Info

Publication number: CN114845136B
Application number: CN202210740529.1A
Authority: CN
Inventors: 谢炜航
Original assignee: Beijing Xintang Sichuang Educational Technology Co Ltd
Current assignee: Beijing Xintang Sichuang Educational Technology Co Ltd
Priority date: 2022-06-28
Filing date: 2022-06-28
Publication date: 2022-09-16
Anticipated expiration: 2042-06-28
Also published as: WO2024001661A1; CN114845136A

Abstract

The disclosure relates to the technical field of computers, and discloses a video synthesis method, a video synthesis device, video synthesis equipment and a storage medium. The method is applied to a server and comprises the following steps: receiving a user video stream; the user video stream is a video stream obtained by shooting through a camera of a user terminal; recording a target virtual scene by using a target visual angle camera independent of a user visual angle camera to generate a scene video stream under a target visual angle; the target virtual scene is a virtual scene corresponding to a theme virtual space displayed in the user terminal; and fusing the user video stream and the scene video stream to generate a composite video stream. By the technical scheme, the requirements on the equipment performance and the network of the user terminal are reduced, and the video synthesis efficiency and the fluency of the synthesized video stream are improved because the scene video stream has no problems of slow uploading, frame loss and the like; and the content consistency of the composite video stream and the target virtual scene is improved.

Description

Video synthesis method, device, equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a video synthesis method, apparatus, device, and storage medium.

Background

With the development of internet technology, resource sharing platforms provide a plurality of video-related functions. For example, video fusion is performed on a real camera picture of a user and virtual scene content in a specific theme scene to generate a composite video for later consumption by the user.

The current video synthesis scheme mainly comprises a manual editing mode and a server-side automatic synthesis mode. The manual editing mode is to use video editing software to edit the real camera picture and the virtual scene content of the user. The automatic synthesis mode of the server is that the user terminal obtains the real camera picture and the virtual scene content of the user and sends the real camera picture and the virtual scene content to the server for automatic synthesis processing.

However, the manual editing mode is time-consuming and labor-consuming, and cannot meet the requirement of batch video synthesis processing; the server side automatic synthesis mode has high requirements on the network and the user terminal performance, and the phenomenon of unsmooth synthesized video picture is easily caused.

Disclosure of Invention

In order to solve the above technical problem, the present disclosure provides a video composition method, apparatus, device, and storage medium.

In a first aspect, the present disclosure provides a video synthesis method applied to a server, including:

receiving a user video stream; the user video stream is a video stream obtained by shooting through a camera of a user terminal;

recording a target virtual scene by using a target visual angle camera independent of a user visual angle camera to generate a scene video stream under a target visual angle; the target virtual scene is a virtual scene corresponding to a theme virtual space displayed in the user terminal;

and fusing the user video stream and the scene video stream to generate a composite video stream.

In a second aspect, the present disclosure provides a video compositing apparatus configured at a server, the apparatus including:

the user video stream receiving module is used for receiving a user video stream; the user video stream is a video stream obtained by shooting through a camera of a user terminal;

the scene video stream generation module is used for recording a target virtual scene by using a target visual angle camera independent of a user visual angle camera to generate a scene video stream under a target visual angle; the target virtual scene is a virtual scene corresponding to a theme virtual space displayed in the user terminal;

and the first composite video stream generation module is used for fusing the user video stream and the scene video stream to generate a composite video stream.

In a third aspect, the present disclosure provides an electronic device, including:

a processor; and

a memory for storing a program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to perform a video compositing method as illustrated in any embodiment of the disclosure. In a fourth aspect, the present disclosure provides a non-transitory computer-readable storage medium having stored thereon computer instructions for causing a computer to perform a video compositing method as described in any embodiment of the present disclosure.

One or more technical solutions provided in the embodiments of the present disclosure can receive a user video stream obtained by shooting through a camera of a user terminal, and record a target virtual scene corresponding to a theme virtual space displayed in the user terminal by using a target view camera independent of the user view camera, so as to generate a scene video stream at a target view; fusing the user video stream and the scene video stream to generate a composite video stream; on one hand, the method realizes the automatic generation of the synthesized video stream in the server, and avoids the problems of time and labor waste of artificially synthesized video; on the other hand, the scene video stream is recorded by the server, so that the problem of unsmooth composite video due to equipment performance, network and the like in the process of recording the scene video stream by the user terminal and uploading the scene video stream to the server is avoided, the requirements on the equipment performance and the network of the user terminal are reduced, the problems of slow uploading, frame loss and the like of the scene video stream are solved, and the video synthesis efficiency and the smoothness of the composite video stream are improved; in another aspect, the scene video stream is obtained by recording the target virtual scene, so that the content consistency of the composite video stream and the target virtual scene is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and together with the description, serve to explain the principles of the disclosure.

In order to more clearly illustrate the embodiments or technical solutions in the prior art of the present disclosure, the drawings used in the description of the embodiments or prior art will be briefly described below, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without inventive exercise.

Fig. 1 is a flowchart of a video synthesis method provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram illustrating a display of a user video stream provided by an embodiment of the present disclosure;

fig. 3 is a schematic diagram illustrating a display of a composite video stream provided by an embodiment of the present disclosure;

fig. 4 is a flowchart of another video composition method provided by the embodiments of the present disclosure;

fig. 5 is a flowchart of another video composition method provided by the embodiments of the present disclosure;

fig. 6 is a schematic structural diagram of a video synthesizing apparatus provided in an embodiment of the present disclosure;

fig. 7 is a schematic structural diagram of an electronic device provided in an embodiment of the present disclosure.

Detailed Description

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present disclosure are shown in the drawings, it is to be understood that the present disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present disclosure. It should be understood that the drawings and the embodiments of the disclosure are for illustration purposes only and are not intended to limit the scope of the disclosure.

It should be understood that the various steps recited in the method embodiments of the present disclosure may be performed in a different order, and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the present disclosure is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description. It should be noted that the terms "first", "second", and the like in the present disclosure are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in this disclosure are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that "one or more" may be used unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present disclosure are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

The video synthesis method provided by the embodiment of the disclosure is mainly applicable to the situation of video synthesis of a user video stream acquired by a camera of a user terminal and a scene video stream corresponding to a virtual scene. In some embodiments, the video synthesis method can be applied to fusion of a real camera picture of a user and special-effect audio and video contents to generate a synthesized special-effect video in a theme scene of a short video. In other embodiments, the video synthesis method may be applied to seamlessly merge real camera pictures of users into virtual scenes of corresponding topics under an educational topic, a game topic, or a live-air topic, and generate a synthetic video (e.g., a playback video containing user pictures) under the corresponding topic.

The video synthesis method provided by the embodiment of the present disclosure may be executed by a video synthesis apparatus, where the apparatus may be implemented in a software and/or hardware manner, and the apparatus may be integrated in an electronic device corresponding to a server, for example, a notebook computer, a desktop computer, a server, or a server cluster.

Fig. 1 is a flowchart of a video synthesis method provided in an embodiment of the present disclosure. Referring to fig. 1, the video synthesis method specifically includes:

and S110, receiving the video stream of the user.

The user video stream is a video stream obtained by shooting through a camera of the user terminal.

Specifically, according to the above description, the video composition in the embodiment of the present disclosure is to fuse a real picture acquired by a camera of a user terminal and a scene picture corresponding to a virtual scene. Therefore, the server receives the user video stream transmitted by the user terminal.

In some embodiments, S110 comprises: a user video stream is received from a user terminal over a real-time communication transport protocol.

Specifically, in the related art, a user video stream is transmitted from a user terminal to a server according to a Transmission Control Protocol (TCP). However, the data volume of the user video stream is relatively large, and the TCP transmission protocol needs to perform three-way handshake, which easily causes the problems of transmission delay and frame loss. Therefore, in this embodiment, a Real-time communication (RTC) protocol is used to transmit the user video stream. The reason is that the RTC transmission protocol carries redundant fields, which can be used to accurately determine whether packet loss exists, and UDP transmission on the link is unidirectional transmission, and three-way handshaking is not required, so that the transmission protocol has a low requirement on the network, and thus transmission of the user video stream has a very strong weak network resistance, network delay of user video stream transmission is reduced, and the problem of frame loss is avoided to a certain extent.

And S120, recording the target virtual scene by using a target visual angle camera independent of the user visual angle camera to generate a scene video stream under the target visual angle.

The user visual angle camera is a virtual camera in the rendering engine corresponding to a viewing visual angle when a user views the target virtual scene through the user terminal. The target viewing angle is a viewing angle required for the composite video stream, and may be, for example, a viewing angle at which an observer other than the user is present. The target perspective camera is a virtual camera in the rendering engine corresponding to the target perspective. The target virtual scene is a virtual scene corresponding to the theme virtual space displayed in the user terminal. The theme virtual space is a network space corresponding to the application scenario. Illustratively, the theme virtual space includes an online live room, a virtual game room, or a virtual educational space. The scene video stream is a video stream generated by recording a target virtual scene.

Specifically, in the related art, a scene video stream is recorded through a user terminal, so that the user terminal is required to upload the scene video stream, and the problems of uploading delay and frame loss of the scene video stream exist, thereby causing video blocking. Therefore, in the embodiment of the present disclosure, a target view camera is directly started in a server corresponding to a user terminal, and a target virtual scene running in the server is recorded along a target view by using the target view camera, so as to generate a scene video stream at the target view.

For example, for an application scene (such as a cloud game, a live cloud, a cloud classroom, and the like) in which an application program main body runs in a cloud, a target virtual scene synchronized with a user terminal originally runs in a server corresponding to the cloud, and at this time, a target view camera can be directly started in the server corresponding to the cloud to record the target virtual scene, so that a scene video stream is obtained.

For another example, for an application scenario (e.g., a general game, online education, etc.) in which an application main body runs on a user terminal, a target virtual scenario may not be run in a server because the main body part of the application is not run in the server, and at this time, a service needs to be started in the server corresponding to the user terminal to run the target virtual scenario, and a target view camera is started in the service. And when the server receives a scene recording instruction, the server starts to record the target virtual scene by using the target visual angle camera to obtain a scene video stream.

It should be noted that, in order to avoid the influence of recording the scene video stream on the application program function corresponding to the application scene normally used by the user, the server may record and render the target virtual scene in a back-end processing manner, that is, the generation process of the scene video stream is independent of the operation process of the application program main body corresponding to the application scene. As for the execution main body of the process of generating the scene video stream, it may be an independent thread opened in the execution server of the application main body, or may be a restarted server.

Referring to fig. 2, taking an online lecture application scene in online education as an example, a video stream of a three-dimensional virtual lecture hall scene rendered by a user perspective camera is displayed in a user terminal, and a real user picture collected by a camera of the user terminal is displayed at a position of an upper left corner. The server may record the target virtual scene from the target view in addition to responding to the display request of the user terminal, as shown in fig. 3. In fig. 3, the server records a three-dimensional virtual lecture hall scene with a target view angle camera corresponding to a viewer view angle, so as to generate a scene video stream at the viewer view angle.

And S130, fusing the user video stream and the scene video stream to generate a composite video stream.

Specifically, the user video stream is embedded in the scene video stream at a certain position in the server side, so as to generate a composite video stream containing a user real picture and a virtual scene picture.

In some embodiments, a preset view is included in the target virtual scene. The preset view refers to a view layer which is preset in a target virtual scene and is used for bearing a user video stream. The position of the preset view can be set by self-definition; the position of the preset view may also be determined according to the type and/or spatial position of each virtual object contained in the target virtual scene. For example, for the above example of the three-dimensional virtual lecture hall scene, the target virtual scene includes a virtual screen for playing lecture-related information, and a preset view may be set at the position of the virtual screen. As another example, the preset view may be set at a free area where there are few virtual objects in the target virtual scene.

Accordingly, S130 includes: and fusing the user video stream to a preset view in the scene video stream to generate a composite video stream.

Specifically, the server may input the user video stream into a preset view to embed the user video stream into the scene video stream, and the result is a composite video stream. As shown in fig. 3, the virtual screen is set to be a preset view, and then the server embeds the user video stream at the virtual screen in the three-dimensional virtual lecture hall scene to generate an online lecture playback video of the audience view angle.

The video synthesis method provided by the embodiment of the disclosure can receive a user video stream obtained by shooting through a camera of a user terminal, and record a target virtual scene corresponding to a theme virtual space displayed in the user terminal by using a target view camera independent of the user view camera to generate a scene video stream under a target view; fusing the user video stream and the scene video stream to generate a composite video stream; on one hand, the method realizes the automatic generation of the synthesized video stream in the server, and avoids the problems of time and labor waste of artificially synthesized video; on the other hand, the scene video stream is recorded by the server, so that the problem of unsmooth composite video due to equipment performance, network and the like in the process of recording the scene video stream by the user terminal and uploading the scene video stream to the server is avoided, the requirements on the equipment performance and the network of the user terminal are reduced, the problems of slow uploading, frame loss and the like of the scene video stream are solved, and the video synthesis efficiency and the smoothness of the composite video stream are improved; in another aspect, the scene video stream is obtained by recording the target virtual scene, so that the content consistency of the composite video stream and the target virtual scene is improved.

Fig. 4 is a flowchart of another video composition method provided by the embodiment of the present disclosure. The method adds a related step of generating a response containing the action of the virtual object according to the operation instruction of the user. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 4, the video composition method includes:

and S410, receiving the video stream of the user.

And S420, receiving a user operation instruction.

The user operation instruction is an operation instruction generated in the theme virtual space by a user through manipulating a user terminal, and is used for controlling an execution action, such as moving, jumping and the like, of a virtual character corresponding to the user in the theme virtual space.

Specifically, in the running process of the application program, a user may execute some operations for controlling the virtual object in the theme virtual space by operating the user terminal, and the user terminal may convert the user operation into a corresponding user operation instruction, and trigger the application program to control the virtual object to execute a corresponding action response (i.e., a virtual object action response) according to the user operation instruction.

Based on the above description, the process of recording the scene video stream by the server and the process of responding to the user operation instruction by the application program are independent. Then, in order to make the recorded scene video stream consistent with the operation result of the application program viewed by the user, the server may pull the user operation instruction so as to recover the same virtual object action response in the target virtual scene.

In some embodiments, the server may establish a communication connection between the process of recording the scene video stream and the process of running the application program in response to the user operation instruction, so as to transmit the user operation instruction generated in the application program to the process of recording the scene video stream.

For example, for an application scene in which the main body of the application program runs in the cloud, the server may establish communication connection between the main bodies, such as services or threads, which respectively run the two processes, so as to transmit a user operation instruction generated by the application program to the process of recording the scene video stream.

For another example, for an application scenario in which the application program main body runs on the user terminal, a communication connection may be established between the user terminal and a server running the target virtual scenario, so as to send a user operation instruction generated in the user terminal to the server.

In other embodiments, the server creates a virtual user, associates the virtual user with the theme virtual space, and shares the user operation instruction from the theme virtual space.

Specifically, in order to improve the obtaining efficiency and the synchronism of the user operation instruction, the server may create a new virtual user, and associate the virtual user with the theme virtual space corresponding to the user terminal, for example, add the virtual user to the virtual game room in the spectator identity. Therefore, the virtual user corresponding to the user terminal and the new virtual user are in the same theme virtual space. Therefore, the server can obtain the user operation instruction from the theme virtual space in a sharing mode in real time.

And S430, executing a virtual object action response corresponding to the user operation instruction in the target virtual scene.

Specifically, in the process of recording a scene video stream, the server executes a corresponding virtual object action response in the target virtual scene according to the obtained user operation instruction, so that the virtual object action response identical to that of the application program is presented in the target virtual scene.

And S440, recording the target virtual scene by using the target visual angle camera to generate a scene video stream containing the virtual object action response.

Specifically, the server records the target virtual scene executing the virtual object action response by using the target view camera, and obtains a scene video stream under the target view, which includes the virtual object action response.

And S450, fusing the user video stream and the scene video stream to generate a composite video stream.

According to the video synthesis method provided by the embodiment of the disclosure, a virtual object action response corresponding to a user operation instruction generated by a user terminal is executed in a target virtual scene, so that the target virtual scene also contains the virtual object action response, and a target view camera is used for recording the target virtual scene to generate a scene video stream containing the virtual object action response; the consistency between the scene video stream and the operation result of the application program watched by the user is further improved, so that the content consistency between the composite video stream and the target virtual scene is further improved.

In some embodiments, the first timestamp is carried in the user video stream and the second timestamp is carried in the user operation instruction. Here, the first time stamp and the second time stamp are each a time when a user operation instruction is generated (also referred to as an instruction time stamp), but the first time stamp is an instruction time stamp recorded in the user video stream, and the second time stamp is an instruction time stamp recorded in the user operation instruction. This is because the data amount of the user video stream and the user operation command is different, so that the user operation command arrives at the server side before the user video stream. If the information is responded after reaching the server end, the action response of the virtual object recovered in the target virtual scene can not be matched with the user video stream, and the content in the composite video stream is disordered. Therefore, in this embodiment, both the user video stream and the user operation command carry the command time stamp, so that the virtual object action response is executed according to the time stamp in the following.

Accordingly, after S420, the video composition method further includes: and caching the user operation instruction. Based on the above description, after the user operation instruction reaches the server, the server cannot directly respond, so the server will cache the user operation instruction first.

Accordingly, S430 includes: screening out target operation instructions of which the second time stamps are less than or equal to the first time stamps from all the user operation instructions; and executing the virtual object action response corresponding to the target operation instruction in the target virtual scene.

Specifically, after receiving the user video stream, the server side extracts a first time stamp in the user video stream. And then, second time stamps of all user operation instructions are obtained from the buffer space, the first time stamps are compared with the second time stamps, and at least one second time stamp which is smaller than or equal to the first time stamp is screened out. And then, the server side takes the user operation instruction corresponding to each screened second timestamp as a target operation instruction, and executes a virtual object action response corresponding to the target operation instruction in the target virtual scene so as to recover the user video stream and the virtual object action response at the previous moment in the target virtual scene. Therefore, the method and the device can ensure that the subsequently recorded scene video stream and the operation result watched by the user contain the same virtual object action response, and further ensure the time consistency between the virtual object action response in the scene video stream and the virtual object action response in the operation result watched by the user, thereby further improving the synchronization between the scene video stream and the user video stream.

Fig. 5 is a flowchart of another video composition method provided by the embodiment of the present disclosure. The video compositing method adds the relevant steps of generating a composite video stream from a video template. Wherein explanations of the same or corresponding terms as those of the above embodiments are omitted. Referring to fig. 5, the video composition method includes:

and S510, receiving the user video stream.

Specifically, the server can continue to execute S520-S530 or execute S540-S550 according to the application requirements (such as video synthesis speed, video synthesis precision, etc.).

S520, recording the target virtual scene by using a target visual angle camera independent of the user visual angle camera, and generating a scene video stream under the target visual angle.

And S530, fusing the user video stream and the scene video stream to generate a composite video stream.

And S540, determining a target video template corresponding to the target virtual scene from all preset video templates based on the template screening conditions.

The template screening condition is a preset dimension for screening each preset video template. The preset video template is a preset video template which comprises a blank part capable of fusing the external video and a non-changeable video part, and the non-changeable video part can comprise a preset character image, a preset special effect component and the like. In the embodiment of the present disclosure, the template screening condition includes at least one of a video duration, user information, a user operation instruction, and a playing audio of the user video stream. The user information is information related to the user, for example, the user information includes a user emotion and/or a user age, and the user information is used to match a character image in a preset video template. The user operation instruction is used for matching the recording visual angle in the preset video template. And playing the audio for matching with the special effect component in the preset video template.

Specifically, a plurality of preset video templates are pre-stored in the server. After receiving the user video stream, the server may screen an adaptive preset video template from the plurality of preset video templates according to the template screening condition, and use the adaptive preset video template as a target video template.

For example, if the template screening condition includes the video duration of the user video stream, the duration of a blank portion in the preset video template may be matched according to the video duration, so as to ensure that the screened target video template may be merged into the user video stream.

For another example, if the template screening condition includes user information, the server may screen, from the preset video templates, a target video template whose video style is adapted to the user emotion according to the user emotion and/or the user age in the user information, and/or screen, from the preset video templates, a target video template whose character in the video is adapted to the user age.

For another example, if the template screening condition includes a user operation instruction, the server may determine a recording view angle according to a user view angle corresponding to the user operation instruction, and screen out a target video template that is consistent with the recording view angle from each preset video template. For example, for the example of the three-dimensional virtual lecture hall scene, a user operation instruction in the recording process is collected, when the user operation instruction indicates that the user walks to a specific area, the recording view angle corresponding to the specific area is switched, and a preset video template corresponding to the recording view angle is switched and selected, so that transition in a video is completed.

For another example, if the template filtering condition includes the played audio, the server selects a target video template having the same or similar audio characteristics according to the audio characteristics, such as the audio pause position and the pause duration, of the played audio, and may add special effect components, such as fireworks display and applause, at corresponding positions of the target video template to optimize the target video template.

And S550, fusing the user video stream and the target video template to generate a composite video stream.

Specifically, the user video stream is added to a blank portion of the target video template, or the user video stream is embedded at a certain position of the target video template, and a composite video stream is generated.

In some embodiments, S550 may be implemented by step a and/or step B below.

And step A, fusing the user video stream to the green screen position in the target video template to generate a composite video stream.

Specifically, the position of the green screen is preset in the target video template. The server may embed the user video stream at the green screen position in the target video template to generate a composite video stream.

And step B, determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

Specifically, at least one preset time point, such as a slice head time point, a slice middle time point, and a slice tail time point, may be preset in the target video template, and each preset time point may be set with a position (i.e., a video synthesis position) for embedding a video stream, for example, the slice head time point corresponds to a video synthesis position at the upper left corner, the slice middle time point corresponds to a video synthesis position in the middle, and the slice tail time point corresponds to a video synthesis position at the lower right corner. And the server side embeds the user video stream into a video synthesis position corresponding to the corresponding preset time point in each time period to generate a synthesized video stream.

According to the video synthesis method provided by the embodiment of the disclosure, a target video template corresponding to a target virtual scene is determined from all preset video templates according to template screening conditions, and a user video stream and the target video template are fused to generate a synthesized video stream; the method and the system realize the synthesis of the real picture and the virtual scene picture of the user through the preset video template, reduce the resource consumption of the server side and further improve the generation efficiency of the synthesized video stream.

Fig. 6 is a schematic structural diagram of a video compositing apparatus according to an embodiment of the present disclosure. The video synthesis device is configured in the server. Referring to fig. 6, the video compositing apparatus 600 specifically includes:

a user video stream receiving module 610, configured to receive a user video stream; the user video stream is a video stream obtained by shooting through a camera of a user terminal;

a scene video stream generating module 620, configured to record a target virtual scene by using a target view camera independent of a user view camera, and generate a scene video stream at a target view; the target virtual scene is a virtual scene corresponding to a theme virtual space displayed in the user terminal;

and a first composite video stream generating module 630, configured to fuse the user video stream and the scene video stream to generate a composite video stream.

The video synthesis device provided by the embodiment of the disclosure can receive a user video stream obtained by shooting through a camera of a user terminal, and record a target virtual scene corresponding to a theme virtual space displayed in the user terminal by using a target view camera independent of the user view camera to generate a scene video stream at a target view; fusing the user video stream and the scene video stream to generate a composite video stream; on one hand, the method realizes the automatic generation of the synthesized video stream in the server, and avoids the problems of time and labor waste of artificially synthesized video; on the other hand, the scene video stream is recorded by the server, so that the problem of unsmooth composite video due to equipment performance, network and the like in the process of recording the scene video stream by the user terminal and uploading the scene video stream to the server is avoided, the requirements on the equipment performance and the network of the user terminal are reduced, the problems of slow uploading, frame loss and the like of the scene video stream are solved, and the video synthesis efficiency and the smoothness of the composite video stream are improved; in another aspect, the scene video stream is obtained by recording the target virtual scene, so that the content consistency of the composite video stream and the target virtual scene is improved.

In some embodiments, the video compositing apparatus 600 further comprises a user operation instruction receiving module for:

receiving a user operation instruction before fusing a user video stream and a scene video stream to generate a composite video stream;

accordingly, the scene video stream generation module 620 includes:

the action response execution submodule is used for executing a virtual object action response corresponding to the user operation instruction in the target virtual scene;

and the scene video stream generation submodule is used for recording the target virtual scene by using the target visual angle camera and generating a scene video stream containing the action response of the virtual object.

In some embodiments, the first timestamp is carried in the user video stream, and the second timestamp is carried in the user operation instruction;

accordingly, the video synthesizing apparatus 600 further comprises a user operation instruction cache module, configured to:

after receiving the user operation instruction, caching the user operation instruction;

accordingly, the action response execution submodule is specifically configured to:

screening out target operation instructions of which the second time stamps are less than or equal to the first time stamps from all the user operation instructions;

and executing the virtual object action response corresponding to the target operation instruction in the target virtual scene.

In some embodiments, the user operation instruction receiving module is specifically configured to:

creating a virtual user and associating the virtual user to a theme virtual space;

sharing user operation instructions from the theme virtual space.

In some embodiments, the target virtual scene includes a preset view;

accordingly, the first composite video stream generating module 630 is specifically configured to:

and fusing the user video stream to a preset view in the scene video stream to generate a composite video stream.

In some embodiments, the video compositing device 600 further comprises:

the target video template determining module is used for determining a target video template corresponding to a target virtual scene from all preset video templates based on template screening conditions after receiving the user video stream; the template screening condition comprises at least one of video time length of a user video stream, user information, a user operation instruction and playing audio, the user information comprises user emotion and/or user age, and the user information is used for matching character images in a preset video template; the user operation instruction is used for matching a recording visual angle in a preset video template; playing audio for matching special effect components in a preset video template;

and the second composite video stream generation module is used for fusing the user video stream and the target video template to generate a composite video stream.

Further, the second composite video stream generating module is specifically configured to:

fusing the user video stream to a green screen position in a target video template to generate a composite video stream;

and/or determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template to generate a synthesized video stream.

In some embodiments, the user video stream receiving module 610 is specifically configured to:

a user video stream is received from a user terminal via a real-time communication transport protocol.

In some embodiments, the theme virtual space comprises a live online room, a virtual game room, or a virtual educational space.

The video synthesis device provided by the embodiment of the disclosure can execute the video synthesis method provided by any embodiment of the disclosure, and has corresponding functional modules and beneficial effects of the execution method.

It should be noted that, in the embodiment of the video synthesis apparatus, the modules and the sub-modules included in the embodiment are only divided according to functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, the specific names of the functional modules/sub-modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present disclosure.

An exemplary embodiment of the present disclosure also provides an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor. The memory stores a computer program executable by the at least one processor, the computer program, when executed by the at least one processor, is for causing the electronic device to perform a video compositing method as illustrated in any embodiment of the present disclosure.

The exemplary embodiments of the present disclosure also provide a non-transitory computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor of a computer, is configured to cause the computer to perform the video composition method explained in any of the embodiments of the present disclosure.

Exemplary embodiments of the present disclosure also provide a computer program product comprising a computer program, wherein the computer program, when executed by a processor of a computer, is adapted to cause the computer to perform the video composition method explained in any of the embodiments of the present disclosure.

Referring to fig. 7, a block diagram of a structure of an electronic device 700, which may be a server or a client of the present disclosure, which is an example of a hardware device that may be applied to aspects of the present disclosure, will now be described. Electronic device is intended to represent various forms of digital electronic computer devices, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other suitable computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.

A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706, an output unit 707, a storage unit 708, and a communication unit 709. The input unit 706 may be any type of device capable of inputting information to the electronic device 700, and the input unit 706 may receive input numeric or character information and generate key signal inputs related to user settings and/or function controls of the electronic device. Output unit 707 may be any type of device capable of presenting information and may include, but is not limited to, a display, speakers, a video/audio output terminal, a vibrator, and/or a printer. Storage unit 708 may include, but is not limited to, magnetic or optical disks. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network, such as the internet, and/or various telecommunications networks, and may include, but is not limited to, modems, network cards, infrared communication devices, wireless communication transceivers and/or chipsets, such as bluetooth (TM) devices, WiFi devices, WiMax devices, cellular communication devices, and/or the like.

Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above. For example, in some embodiments, the video compositing methods described in any of the embodiments of the present disclosure may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 707. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. In some embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform the video composition method illustrated by any embodiment of the present disclosure.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

As used in this disclosure, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, Programmable Logic Devices (PLDs)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal. The term "machine-readable signal" refers to any signal used to provide machine instructions and/or data to a programmable processor.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

Claims

1. A video synthesis method is applied to a server side and comprises the following steps:

fusing the user video stream and the scene video stream to generate a composite video stream;

wherein, before said fusing said user video stream and said scene video stream to generate a composite video stream, said method further comprises:

receiving a user operation instruction;

the recording of the target virtual scene by using the target view camera independent of the user view camera to generate the scene video stream under the target view comprises:

executing a virtual object action response corresponding to the user operation instruction in the target virtual scene;

recording the target virtual scene by using the target visual angle camera to generate the scene video stream containing the virtual object action response;

the user video stream carries a first time stamp, and the user operation instruction carries a second time stamp;

after the receiving the user operation instruction, the method further comprises:

caching the user operation instruction;

the executing of the virtual object action response corresponding to the user operation instruction in the target virtual scene comprises:

screening out target operation instructions of which the second time stamps are smaller than or equal to the first time stamps from all the user operation instructions;

2. The method of claim 1, wherein the receiving a user operation instruction comprises:

creating a virtual user and associating the virtual user to the theme virtual space;

and sharing the user operation instruction from the theme virtual space.

3. The method of claim 1, wherein the target virtual scene comprises a preset view;

the fusing the user video stream and the scene video stream to generate a composite video stream comprises:

and fusing the user video stream to the preset view in the scene video stream to generate the composite video stream.

4. The method of claim 1, wherein after said receiving the user video stream, the method further comprises:

determining a target video template corresponding to the target virtual scene from all preset video templates based on template screening conditions; the template screening condition comprises at least one of video duration, user information, a user operation instruction and playing audio of the user video stream, the user information comprises user emotion and/or user age, and the user information is used for matching character images in a preset video template; the user operation instruction is used for matching a recording visual angle in the preset video template; the playing audio is used for matching with a special effect component in the preset video template;

and fusing the user video stream and the target video template to generate the composite video stream.

5. The method of claim 4, wherein fusing the user video stream and the target video template to generate the composite video stream comprises:

fusing the user video stream to a green screen position in the target video template to generate the composite video stream;

and/or determining a video synthesis position in the target video template based on at least one preset time point in the target video template, and fusing the user video stream to the video synthesis position in the target video template to generate the synthesized video stream.

6. The method of any of claims 1 to 5, wherein receiving the user video stream comprises:

receiving the user video stream from the user terminal via a real-time communication transport protocol.

7. The method of any one of claims 1 to 5, wherein the theme virtual space comprises an online live room, a virtual game room, or a virtual educational space.

8. A video compositing apparatus, configured at a server, comprising:

a first composite video stream generating module, configured to fuse the user video stream and the scene video stream to generate a composite video stream;

the video synthesis device further comprises a user operation instruction receiving module, which is used for:

accordingly, the scene video stream generation module includes:

the scene video stream generation submodule is used for recording a target virtual scene by using the target visual angle camera and generating a scene video stream containing a virtual object action response;

correspondingly, the video synthesis device further comprises a user operation instruction cache module, configured to:

correspondingly, the action response execution submodule is specifically configured to:

9. An electronic device, comprising:

a processor; and

a memory for storing the program, wherein the program is stored in the memory,

wherein the program comprises instructions which, when executed by the processor, cause the processor to carry out the video compositing method according to any of claims 1-7.

10. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the video compositing method of any of claims 1-7.