CN113949891B

CN113949891B - Video processing method and device, server and client

Info

Publication number: CN113949891B
Application number: CN202111191080.XA
Authority: CN
Inventors: 李立锋
Original assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Current assignee: Migu Cultural Technology Co Ltd; China Mobile Communications Group Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-12-08
Anticipated expiration: 2041-10-13
Also published as: CN113949891A

Abstract

The invention discloses a video processing method, a video processing device, a server and a client, relates to the technical field of video processing, and aims to solve the problem that in the related art, a user interacts through electronic equipment in a single mode. The method comprises the following steps: acquiring a user video, wherein the user video is acquired by acquiring a video of a watching user by a first client for playing live video; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video under the condition that the user in the user video is in the target state; and transmitting the live video loaded with the user video to a second client. Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

Description

Video processing method and device, server and client

Technical Field

The present invention relates to the field of video processing technologies, and in particular, to a video processing method, a device, a server, and a client.

Background

With the development of mobile internet, people often use electronic devices such as mobile phones to process daily transactions, and users can also interact through the electronic devices. However, in the related art, the user performs interaction through the electronic device mostly through voice and text interaction, for example, when watching video, the user performs interaction through publishing a barrage, and the interaction mode is single.

Disclosure of Invention

The embodiment of the invention provides a video processing method, a video processing device, a server and a client, which are used for solving the problem that in the related art, the interaction mode of a user through electronic equipment is single.

In a first aspect, an embodiment of the present invention provides a video processing method, applied to a server, where the method includes:

acquiring a user video, wherein the user video is acquired by acquiring a video of a watching user by a first client for playing live video;

identifying the user video, and determining whether a user in the user video is in a target state;

loading the user video into the live video under the condition that the user in the user video is in the target state;

and transmitting the live video loaded with the user video to a second client.

Optionally, the identifying the user video, determining whether the user in the user video is in a target state includes:

and identifying the action amplitude and/or facial expression of the user in the user video, and determining whether the user in the user video is in a target emotion state.

Optionally, the target emotional state includes cheering state;

the identifying the action amplitude and/or facial expression of the user in the user video, and determining whether the user in the user video is in a target emotional state comprises the following steps:

determining an included angle between an arm and a trunk of a user in each frame of user image in the user video;

and under the condition that the included angle between the arm and the trunk of the user in the target frame user image is larger than a preset angle, determining that the user in the user video is in the cheering state, wherein the target frame user image is any frame image in the user video.

Optionally, the acquiring the user video includes:

acquiring a plurality of user videos, wherein the user videos are obtained by respectively acquiring videos of respective watching users for different clients playing live videos;

The loading the user video into the live video if the user in the user video is determined to be in the target state comprises the following steps:

determining the action change amplitude of the user in each user video under the condition that the user in at least two user videos is in the target state;

determining target user videos from the at least two user videos, wherein the action change amplitude of the users in the target user videos is the largest;

loading the at least two user videos in a preset area of a playing picture of the live video, wherein the target user video is located in the middle of the preset area.

Optionally, the other user videos except the target user video in the at least two user videos are loaded in a preset area of a playing picture of the live video in a first size, and the target user video is loaded in an intermediate position of the preset area in a second size, wherein the second size is larger than the first size.

Optionally, after the determining the target user video from the at least two user videos, the method further includes:

identifying the outline of the action core part of the user in the target user video;

Based on the contour, capturing a dynamic image of the action core part from the target user video;

loading the dynamic image of the action core part at a position associated with the target user video in a play picture of the live video in a third size, wherein the third size is larger than a second size, and the second size is a display size of the target user video.

Optionally, after loading the dynamic image of the action core part in the third size at the position associated with the target user video in the play frame of the live video, the method further includes:

and identifying the interaction action of the dynamic image of the action core part on the first user video, and generating the interaction display effect of the first user video in the play picture of the live video based on the interaction action, wherein the first user video is at least one of the other user videos.

Optionally, the identifying the interaction action of the dynamic image of the action core part on the first user video, and generating the interaction display effect of the first user video in the play picture of the live video based on the interaction action includes:

Acquiring the position information of the dynamic image of the action core part in the playing picture of the live video;

and generating a collision effect graph of the dynamic image of the action core part and the first user video and loading the collision effect graph into the live video under the condition that the dynamic image of the action core part and the position information of the first user video are detected to be overlapped.

Optionally, the generating a collision effect map of the dynamic image of the action core part and the first user video includes:

retracting the first user video to a first direction of a playing picture of the live video, wherein the first direction is related to an action direction of the dynamic image of the action core part;

or determining a target motion speed and a target motion direction of the first user video after collision with the dynamic image of the action core part based on the motion speed and the motion direction of the action core part in the dynamic image of the action core part; and generating a flick effect diagram of the first user video in a playing picture of the live video according to the target moving speed and the target moving direction.

gesture recognition is carried out on the target user video;

determining a second user video overlapping with the dynamic image existence position information of the action core part under the condition that the grabbing gesture of the user in the target user video is recognized, wherein the second user video is one of the other user videos;

and moving the second user video to a target position based on a moving track of the dynamic image of the action core part in a playing picture of the live video, wherein the target position is associated with an end position of the moving track.

Optionally, the loading the user video into the live video if it is determined that the user in the user video is in the target state includes at least one of:

under the condition that the user in the user video is in the target state, carrying out background segmentation on the user video based on the character image outline in the user video, and determining a background area in the user video; filling a background area in the user video by using a preset color; loading the processed user video into the live video;

Under the condition that the user in the user video is in the target state, carrying out face recognition on the user video, and determining the face position in the user video; based on the face position in the user video, cutting the user video; and loading the cut user video into the live video.

Optionally, the acquiring the user video includes:

after the capturing the plurality of user videos, the method further includes:

identifying and classifying and counting the sounds in the plurality of user videos;

determining a number duty cycle of a sound of a target category in the plurality of user videos;

and processing the sound volume of the target class under the condition that the quantity ratio is larger than a preset value.

Optionally, the processing the sound volume of the target class includes:

determining a sound volume of the target class based on the quantity duty cycle, wherein the sound volume of the target class is positively correlated with the quantity duty cycle;

And superposing the sound of the target category in the plurality of user videos according to the sound volume of the target category.

Optionally, the loading the user video into the live video if it is determined that the user in the user video is in the target state includes:

under the condition that the user in the user video is in the target state, carrying out face recognition on the user video, and determining the face position in the user video;

based on the face position in the user video, cutting the user video;

and loading the cut user video into the live video.

In a second aspect, an embodiment of the present invention further provides a video processing method, applied to a client, where the method includes:

receiving live video loaded with user video and issued by a server;

and playing the live video loaded with the user video.

Optionally, the method further comprises:

in the process of playing the live video, video acquisition is carried out on a watching user to obtain a third user video;

and uploading the third user video to the server so that the server can identify the third user video.

In a third aspect, an embodiment of the present invention further provides a video processing apparatus, applied to a server, where the video processing apparatus includes:

the first acquisition module is used for acquiring user videos, wherein the user videos are acquired by acquiring videos of watching users by a first client for playing live videos;

the first identification module is used for identifying the user video and determining whether a user in the user video is in a target state or not;

the first processing module is used for loading the user video into the live video under the condition that the user in the user video is in the target state;

and the sending module is used for sending the live video loaded with the user video to the second client.

Optionally, the first recognition module is configured to recognize motion amplitude and/or facial expression of a user in the user video, and determine whether the user in the user video is in a target emotional state.

Optionally, the first identification module includes:

the first determining unit is used for determining an included angle between an arm and a trunk of a user in each frame of user image in the user video;

and the second determining unit is used for determining that the user in the user video is in the cheering state under the condition that the included angle between the arm and the trunk of the user in the target frame user image is larger than a preset angle, wherein the target frame user image is any frame image in the user video.

Optionally, the first obtaining module is configured to obtain a plurality of user videos, where the plurality of user videos are obtained by respectively performing video acquisition on respective watching users by different clients playing live videos;

the first processing module includes:

a third determining unit, configured to determine a motion variation amplitude of a user in each user video when determining that the user in at least two user videos is in the target state;

a fourth determining unit, configured to determine a target user video from the at least two user videos, where a motion variation amplitude of a user in the target user video is the largest;

the first processing unit is used for loading the at least two user videos in a preset area of a playing picture of the live video, wherein the target user video is located in the middle of the preset area.

Optionally, the video processing device further includes:

the second identification module is used for identifying the outline of the action core part of the user in the target user video;

the intercepting module is used for intercepting the dynamic image of the action core part from the target user video based on the outline;

and the second processing module is used for loading the dynamic image of the action core part at a position associated with the target user video in a play picture of the live video in a third size, wherein the third size is larger than a second size, and the second size is the display size of the target user video.

Optionally, the video processing device further includes:

and a seventh processing module, configured to identify an interaction action of the dynamic image of the action core part on the first user video, and generate an interactive display effect of the first user video in a play frame of the live video based on the interaction action, where the first user video is at least one of the other user videos.

Optionally, the video processing device further includes:

the second acquisition module is used for acquiring the position information of the dynamic image of the action core part in the playing picture of the live video;

And the third processing module is used for generating a collision effect graph of the dynamic image of the action core part and the first user video and loading the collision effect graph into the live video under the condition that the dynamic image of the action core part and the position information of the first user video are detected to be overlapped.

Optionally, the third processing module is configured to retract the first user video in a first direction of a play frame of the live video, where the first direction is related to an action direction of the dynamic image of the action core part;

or the third processing module is used for determining the target movement speed and the target movement direction of the first user video after collision with the dynamic image of the action core part based on the movement speed and the movement direction of the action core part in the dynamic image of the action core part; and generating a flick effect diagram of the first user video in a playing picture of the live video according to the target moving speed and the target moving direction.

Optionally, the video processing device further includes:

the third recognition module is used for carrying out gesture recognition on the target user video;

The first determining module is used for determining a second user video which is overlapped with the dynamic image of the action core part in the presence of position information under the condition that the grabbing gesture of the user in the target user video is recognized, wherein the second user video is one of the other user videos;

and a fourth processing module, configured to move the second user video to a target position based on a movement track of the moving image of the action core part in the play frame of the live video, where the target position is associated with an end position of the movement track.

Optionally, the first processing module includes:

the second processing unit is used for carrying out background segmentation on the user video based on the figure image outline in the user video under the condition that the user in the user video is in the target state, and determining a background area in the user video;

the third processing unit is used for filling a background area in the user video by using a preset color;

the fourth processing unit is used for loading the processed user video into the live video;

and/or, the first processing module comprises:

The identification unit is used for carrying out face identification on the user video and determining the face position in the user video under the condition that the user in the user video is in the target state;

a sixth processing unit, configured to clip the user video based on a face position in the user video;

and the seventh processing unit is used for loading the cut user video into the live video.

the video processing apparatus further includes:

the fifth processing module is used for identifying and classifying and counting the sounds in the plurality of user videos;

a second determining module, configured to determine a number ratio of the sounds of the target categories in the plurality of user videos;

and the sixth processing module is used for processing the sound volume of the target class under the condition that the quantity ratio is larger than a preset value.

Optionally, the sixth processing module includes:

a fifth determining unit configured to determine a sound volume of the target category based on the number duty ratio, wherein the sound volume of the target category is positively correlated with the number duty ratio;

And a fifth processing unit, configured to superimpose the sound of the target category on the plurality of user videos according to the sound volume of the target category.

In a fourth aspect, an embodiment of the present invention further provides a video processing apparatus, applied to a client, where the video processing apparatus includes:

the receiving module is used for receiving live video loaded with user video and issued by the server;

and the playing module is used for playing the live video loaded with the user video.

Optionally, the video processing device further includes:

the acquisition module is used for acquiring video of a watching user in the process of playing the live video to obtain a third video;

and the uploading module is used for uploading the third user video to the server so that the server can identify the third user video.

In a fifth aspect, an embodiment of the present invention further provides an electronic device, including: a transceiver, a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing the steps in the video processing method as described in the first aspect above or the steps in the video processing method as described in the second aspect above when the computer program is executed.

In a sixth aspect, embodiments of the present invention further provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the video processing method as described in the first aspect above; or to implement the steps in the video processing method as described in the second aspect above.

In the embodiment of the invention, a user video is acquired, wherein the user video is acquired by a first client for playing live video and acquiring the video of a watching user; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video under the condition that the user in the user video is in the target state; and transmitting the live video loaded with the user video to a second client. Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

Fig. 1 is a flowchart of a video processing method applied to a server provided in an embodiment of the present invention;

fig. 2 is a flowchart of audio and video data acquisition of a client according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of identifying user action magnitudes according to an embodiment of the present invention;

fig. 4 is a schematic diagram of a live video frame after loading a user video according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of an enlarged video of a target user according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of interaction effect between a user in a target user video and other user videos according to an embodiment of the present invention;

FIG. 7 is a second schematic diagram of interaction effects between a user in a target user video and other user videos according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of clipping a user video based on a user face according to an embodiment of the present invention;

fig. 9 is a flowchart of a video processing method applied to a client according to an embodiment of the present invention;

fig. 10 is a block diagram of a video processing apparatus according to an embodiment of the present invention;

FIG. 11 is a second block diagram of a video processing apparatus according to an embodiment of the present invention;

FIG. 12 is a block diagram of a server provided by an embodiment of the present invention;

Fig. 13 is a block diagram of a client according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are some, but not all embodiments of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1, fig. 1 is a flowchart of a video processing method provided by an embodiment of the present invention, applied to a server, as shown in fig. 1, and the method includes the following steps:

step 101, obtaining a user video, wherein the user video is obtained by performing video acquisition on a watching user by a first client for playing live video.

In the embodiment of the invention, the user video may be obtained by starting a camera and a microphone to acquire video of a user watching the live video in the process of playing the live video by a first client, and the first client may be a client playing the live video and selecting to access the live video for interaction by the user. The live video can be any type of live video, in particular can be a live video with strong interactivity, such as sports live broadcast, game sports live broadcast and the like.

For example, when a client user chooses to access live interaction with a sporting event, the client enables a camera and microphone to capture video of the user.

The obtaining the user video may be receiving the user video uploaded by the first client.

When the client collects the user video, in order to ensure the synchronization of the audio and video data, the time stamp verification can be performed on the collected audio data and video data, the specific flow can be as shown in fig. 2, the audio data is collected through a microphone, the video data is collected through a camera, and the collection time stamp is recorded, so that the audio data and the video data with the time stamp are obtained, then the time stamp corresponding to the audio packet and the time stamp corresponding to the video frame can be performed on the time stamp verification, and when the time stamp is not synchronous, if the time stamp of the video frame exceeds the time stamp of the audio packet, or the time stamp of the video frame is smaller than the time of the audio packet, the frame loss of the video frame is performed, or the frame copying operation is performed; the audio and video data which are synchronized by the verification and the timestamp can be transmitted to the server by means of Web instant messaging (Web Real-Time Communication, webRTC) and the like.

Step 102, identifying the user video, and determining whether the user in the user video is in a target state.

After the server acquires the user video, the server can identify the interaction information in the user video, namely identify the user video, so as to determine whether the user in the user video is in a target state, wherein the target state can be a state determined according to actual needs, for example, a cheering state, an exciting state, an anger state, a sad state and the like. When the user in the user video is in the target state, effective user interaction information can be considered to exist in the user video.

Specifically, according to the target state to be identified, motion identification, emotion identification and the like can be performed on the user image in the user video, semantic understanding can be performed on the user sound in the user video, and whether the user is in the target state can be determined by combining the identification of the user image and the identification of the user sound.

That is, in one embodiment, the target state may be a target emotional state, for example, may be cheering, sad, exciting, and angry, and the target emotional state may be specifically set according to actual needs. In this embodiment, whether the user in the user video is in the target emotional state may be determined by identifying the motion amplitude and/or facial expression of the user in the user video, specifically, the motion amplitude of the target part, such as the arm, the leg, and the lip, of the user in the user video may be identified, whether the user performs the target motion, such as swinging, jumping, shouting, etc., or whether the user's facial expression changes may be identified, whether the user exhibits the target expression, such as difficulty, vigilance, happiness, etc., and when the user in the user video is identified to perform the target motion or exhibit the target expression, the user may be determined to be in the target emotional state.

Therefore, the target emotion state of the user can be accurately identified by identifying the action amplitude and/or the facial expression of the user in the user video, and the quality of the user interaction video displayed in the live video is further ensured.

Optionally, the target emotional state includes cheering state;

In an embodiment, the target emotional state may include a cheering state, so that the identifying of the motion amplitude and/or the facial expression of the user in the user video may specifically be identifying an angle between an arm and a torso of the user in each frame of user image in the user video, and when it is identified that the angle between the arm and the torso of the user in a frame of user image is greater than a preset angle, the user in the user video may be considered to be in the cheering state, for example, the user may raise the arm during usual cheering, so that it may be determined that the user is in the cheering state with an angle between the arm and the torso being greater than 90 degrees, 120 degrees, or the like.

Specifically, the angle between the arm and the trunk of the user may be calculated by using a bone point angle calculation method, for example, referring to fig. 3, if the key points of the left shoulder, the left arm and the left trunk are a, b, and c, and the key points of the right shoulder, the right arm and the right trunk are a ', b ', and c ', respectively, then the angle between the arm and the trunk is ++abc and ++a ' b ' c ', and when the angle ++abc or ++a ' b ' c ' is greater than a preset angle, the user is considered to be in cheering state.

Thus, according to this embodiment, the swing motion of the user can be accurately recognized, and the cheering state of the user can be recognized.

It should be noted that the interactive information, i.e., the target state, is identified and extracted based on the user video data. The identification may be performed at the server, and the identification process may be performed for each path of video data (video data uploaded by different clients), or may be performed separately at each client. The unified processing compatibility of the server is better; in the client identification process, a certain requirement on the processing capacity of the client is needed, but the pressure of the server can be greatly reduced.

Step 103, loading the user video into the live video under the condition that the user in the user video is in the target state.

In the embodiment of the present invention, when it is identified in step 102 that the user in the user video is in the target state, it may be determined that effective interaction information exists in the user video, and then the user video may be loaded into the live video, specifically, the user video may be displayed in the live video, where, in order to avoid the shielding of the live video by the user video from affecting the user viewing effect, the user video may be displayed in a corner position in the live video frame, such as a bottom, a top, a lower left corner, a lower right corner, etc., in a manner of a thumbnail video, such as reducing to a certain size, so that the user viewing the live video can see his or her video in the live video frame, thereby implementing the interaction of the user in the live video. The user can also make a target state in the process of watching the live video, so that the service end displays the video of the user in the live video picture, and the interactive interestingness is enhanced.

Optionally, the step 101 includes:

The step 103 includes:

In one embodiment, the server may receive a plurality of user videos sent by a plurality of clients, and the plurality of clients playing the live video may respectively perform video acquisition on respective watching users, and upload the respective acquired user videos to the server.

The server may identify the user state in each user video, so as to determine whether the user in each user video is in a target state, and when determining that there are at least two users in the plurality of user videos in the target state, may determine the motion change amplitude, that is, the motion amplitude, of the user in each user video in the at least two user videos, respectively, specifically, the motion change amplitude of the user in the user video may be determined by identifying motion difference values of user images in different frames in a period of time in the user video, for example, for cheering states, an angle difference value between an arm and a trunk of the user in two frames of user images in different times may be calculated, where an angle between an arm and a trunk of the user in the previous second is 90 degrees, and the motion amplitude of the arm of the user in the next second is 150 degrees, and may be determined to be 60 degrees/second.

In this way, after the motion variation amplitude of the user in each user video is calculated, the at least two user videos can be ranked according to the motion variation amplitude, and the user video with the largest motion variation amplitude is selected as the target user video.

For the at least two user videos, the at least two user videos can be loaded in preset areas of a playing picture of the live video, such as areas with smaller influence on the visual effect of the user in watching, such as the bottom, the top, the left side and the right side of the playing picture, and can be displayed in the middle of opposite ends, when the number of the user videos exceeds the maximum number of one-screen display, the at least two user videos can be circularly displayed at the bottom of the live video picture by taking the preset areas as the bottom. The target user video is used as the user video with the largest motion amplitude, and can be displayed in the middle position of the preset area, for example, in the middle of the bottom of a live video picture, namely, in the center position of all user videos. Because the motion amplitude of the user in the user video is changed in real time, the motion change amplitude of the user in each user video can be monitored in real time or regularly, and the user video with the highest monitored motion amplitude can be quickly moved to the middle position at the bottom of the live video picture under the condition that the display sequence of the current user video is kept unchanged (unless the user video is newly added or the user video is reduced).

For example, as shown in fig. 4, a plurality of user videos 41 for accessing interaction may be displayed at the bottom of a play screen 40 of a live video, and a target user video 42 with the largest motion amplitude may be displayed at the center-most position at the bottom of the screen. The user video display area at the highest motion amplitude can move to the left or the right due to the change of the user video, but the user video just can only be paved with one screen, so that a circular display mode can be adopted, and when the user video display area moves to the left, the user video at the leftmost side can move to the rightmost side for display.

Therefore, through the implementation mode, the user video with the largest motion amplitude (the best interaction performance) can be ensured to be displayed in the center of the live broadcast picture, so that the interaction enthusiasm of the user is improved, and the watching user is stimulated to actively interact in the process of watching the video so as to preempt the central position.

In other words, in one embodiment, the user video with the largest motion amplitude may be displayed in an enlarged manner, so as to highlight the interaction effect of the user video. Specifically, the target user video may be displayed in the middle position of the preset area in a second size, for example, in the middle of the bottom of the play frame of the live video, while the other user videos are displayed in the preset area in a first size, for example, in the bottom of the play frame of the live video, where the second size is greater than the first size, the second size may be 1.2 to 2 times of the first size, specifically, the visual display effect may be adjusted, the first size may be a default size, specifically, may be determined according to experimental effects displayed in different sizes, and it is ensured that the user video display area is not too small nor too large.

For example, as shown in fig. 5, a target user video 42 in the bottom center of the live video screen 40 may be displayed in enlargement, and the other user videos 41 remain displayed in a default size.

Therefore, the user with good interaction performance can be highlighted by amplifying and displaying the user video with the largest motion amplitude, and the user is stimulated to perform active interaction in the watching process.

based on the profile, capturing a dynamic image of the action core part from the target user video;

In still another embodiment, the user video with the largest motion amplitude may be highlighted in other manners, and specifically, the outline of the action core part of the user in the target user video may be identified, where the action core part may be a part such as an arm, a face, or the like, which may embody the target action of the user. Taking arm contour recognition as an example, a limb recognition algorithm can be used for recognizing the two hands of the user in the target user video, and then a contour detection algorithm is used for detecting the contour of the two hands of the user; then, based on the identified arm outline, capturing the dynamic image of the arm from the target user video, magnifying the dynamic image of the arm, and displaying the dynamic image at a position associated with the target user video in a play frame of the live video, for example, displaying the dynamic image at an upper position of the target user video, wherein the size of the dynamic image of the arm may be 2 to 5 times larger than the size of the target user video below, that is, the third size may be 2 to 5 times larger than the second size, but the dynamic image of the arm needs to be ensured not to exceed the live video frame.

For example, as shown in fig. 5, after capturing the arm dynamic image of the user in the target user video 42, the image may be displayed above the target user video 42 in an enlarged manner.

Therefore, more interesting interaction effects can be generated, and the interaction interest and enthusiasm of users are improved.

In other words, in one embodiment, the method may further support the user in the target user video to interact with other user videos loaded in the live video through the interaction action of the action core part of the target user video, specifically, when the user in the target user video performs an interaction action, such as touching, grabbing, and flicking, on other user videos (referred to as a first user video) through the action core part of the target user video, the server may identify the interaction action, and may generate, based on the interaction action, a corresponding interaction display effect of the first user video in a play frame of the live video, for example, when the interaction action is touching, an effect that the first user video is touched may be generated, when the interaction action is grabbing, an effect that the first user video is shot and then the display position is exchanged may be generated, when the interaction action is flicking, an effect that the first user video is flicked may be generated, and so on.

Thus, through the implementation mode, the user interaction effect in live broadcast can be further enhanced, the interaction interestingness is improved, and the enthusiasm of users for participating in live broadcast interaction is improved.

In other words, in one embodiment, the corresponding user in the target user video may perform interesting interaction with other user videos through a specific action, for example, the user may perform touch, catapult, etc. on other user videos through controlling the arm action, so as to achieve a physical collision effect.

The dynamic image of the motion core part may be a dynamic image generated by capturing the motion core part of the user in the target user video in real time, for example, when the user control arm in the target user video makes different motions, the motion of the user arm is correspondingly displayed above the target user video, so that the user can make motions such as touching other user video areas (each user video area may be referred to as a user video head portrait) through the arm, flicking the other user video head portrait, and the like.

Taking the action core part as an arm as an example, for identifying the actions and generating corresponding interaction effects, the server side can acquire the position information of the dynamic image of the arm in the playing picture of the live video in real time, such as acquiring the coordinate positions of each point in the dynamic image of the arm in the playing picture of the live video, and acquiring the coordinate positions of each user video in the playing picture of the live video in the at least two user videos, and can detect whether the dynamic image of the arm and the coordinate positions of each user video overlap, and under the condition that the dynamic image of the arm and the coordinate positions of a certain user video overlap, the server side indicates that the user arm touches the user video, that is, the user arm collides with the user video, so that a collision effect map of the dynamic image of the arm and the first user video can be generated, for example, when the user arm bumps a certain user video head image to the left side and the right side of the user video head image, the physical collision effect of the user video which is bumped can be presented, and the user video after the collision can be restored for a certain period of time as long as 1-3 s.

The user corresponding to the target user video can also fly other user video head portraits in the mode, and can also touch other user videos, and accordingly, the server can generate an effect diagram that the touched user video head portraits are shot or touch.

Thus, through the embodiment, the user occupying the central position can perform interesting interaction on the video head portraits of other users, and the interactive interest of the user in watching the live video is improved.

In other words, in one embodiment, the generating the collision effect map of the moving image of the moving core part and the first user video may be combined with the motion recognition of the moving core part, such as the motion recognition of the arm in the arm moving image, in the moving image of the moving core part, when the user arm is recognized to extend to a certain user video head, an effect map of the user video head touched is generated, specifically, the touched user video is retracted in the first direction, such as the bottom direction, of the play frame of the live video, when the moving core part, such as the arm, in the moving image of the moving core part touches a plurality of user video head images, downward retracting effects may be generated for the plurality of user video head images, and head image retracting effect maps with inconsistent heights may be generated according to the distances between each user video head image and the finger of the user. For example, as shown in fig. 6, when the user's hand touches the left user video head 43, these user video heads 43 may exhibit a physical collision effect that is retracting downward.

The generating of the collision effect map of the moving image of the moving core part and the first user video may be further combined with the motion recognition of the moving core part in the moving image of the moving core part, for example, when it is recognized that a user's finger ejects a certain user video head, an effect map that the user video head is flicked is generated, specifically, taking an arm moving image as an example, a moving speed and a moving direction of an arm in the arm moving image may be obtained, for example, a pixel that the arm moves every second in the playing image may be obtained to determine an arm moving speed, and an arm moving direction may be determined according to a pixel position before and after the arm moves in the playing image.

Then, according to the law of conservation of kinetic energy, the target motion speed and the target motion direction of the launched user video head portrait after collision with the arm dynamic image can be calculated, for example, it can be assumed that the mass of the first user video head portrait is m1, the mass of the user arm is m2, and m1 is equal to 1, and m2 is equal to the area ratio of the user arm head portrait in the target user video, that is, m2 is equal to the quotient of the area of the arm dynamic image in the target user video divided by the area of the target user video. Let the speed of the first user video head portrait before collision be v1, the speed of the user arm be v2, then according to the law of conservation of kinetic energy, that is, m1v1+m2v2=m1v1 '+m2v2', after deduction, v1 '= [ (m 1-m 2) v1+2m2v2 ]/(m1+m2), v2' = [ (m 2-m 1) ] v2+2m1v1 ]/(m1+m2), where v1 'is the speed of the opposite direction to v1 after collision, v2' is the speed of the opposite direction to v2 after collision, and v1 and v2 may be given initial values or obtained actual speeds. In this way, the target motion speed and the target motion direction, namely v1', of the first user video after collision with the arm dynamic image can be calculated, so that a flick effect diagram of the first user video in a playing picture of the live video can be generated according to the target motion speed and the target motion direction, and an effect diagram of the user video head portrait after flick of the user finger can be generated according to the target motion speed and the target motion direction.

In addition, the influence of the gravitational force on the moving speed of the user video head portrait can be not considered, but a global deceleration a can be set, and a can be smaller than 1, the current speed of the user video head portrait is v1' x a, when the speed is smaller than a certain threshold value, the bouncing effect disappears, and the user video head portrait can return to the original position in a straight line at a fixed speed. As for the direction of the flicking of the user video head portrait, four sides of a playing picture of the live video can be taken as mirror surfaces, when the user video head portrait touches the four sides of the picture, the moving direction of the user video head portrait is regarded as a light refraction direction, the moving direction is regarded as rays, and the live video head portrait moves according to the mirror surface refraction track. And when the touch action of the user is identified, the value of the global deceleration a can be reduced according to the physical collision, so that the speed of the user can be restrained within a certain range. I.e. the value of a is small enough, the smaller the distance of motion the user video head will produce after a collision.

Thus, according to the embodiment, the physical collision effect of the dynamic image of the user action core part and the user video can be generated, and the quality of the effect diagram can be ensured.

Gesture recognition is carried out on the target user video;

In an embodiment, the user corresponding to the target user video may further adjust the display position of the user video head portrait in the preset area of the live video frame by controlling the arm action, where the implementation manner may be that the user finger performs the grabbing action, and after grabbing a certain user video head portrait, the user video display sequence may be changed.

Specifically, the action core part may be an arm, the server may identify a capturing action of a user in the target user video through a gesture recognition algorithm, when the user in the target user video performs the capturing action, a user finger in an arm moving image above the target user video will overlap with a certain user video coordinate, so that when it is identified that a certain user video overlaps with the arm moving image in position information, the user video ID may be obtained, and based on a moving track of the arm moving image in a playing picture of the live video, the user video may be moved to a target position associated with an end position of the moving track according to the moving track, specifically, after the user finger captures a user video head portrait, an adjustment sequence of the captured user video head portrait may be determined through a middle point coordinate of the captured video and a middle point coordinate of other nearby user videos.

For example, as shown in fig. 7, when the user video a is grabbed to a certain position between the user video B and the user video C, it may be determined to insert the user video a between the user video B and the user video C by identifying the current center point coordinate position of the user video a and the center point coordinate positions of the user video B and the user video C.

Therefore, the user occupying the central position can perform interesting interaction in various modes on video head portraits of other users, and the interactive interest of the user in watching live videos is improved.

Optionally, the step 103 includes at least one of:

In one embodiment, the background in the user video can be segmented and filled, so that the purposes of unifying the background in the user interactive video and avoiding background interference are achieved.

Specifically, a background segmentation mode may be used to segment a person from a background of the user video, that is, a person image contour in the user video may be identified, then the background segmentation is performed on the user video based on the person image contour, a background area in the user video is determined, and a preset color is used to fill the background area in the user video, that is, a monochromatic system may be used to fill the background. When the number of the user videos is multiple, the background areas of different user videos can be respectively filled with different colors, so that the background is prevented from being too monotonous.

In this way, by performing color filling processing on the background area in the user video, the problems of unclear pictures or poor viewing experience caused by excessively complex background can be reduced.

In another embodiment, the clipping process may be further performed on the user video, so as to ensure that a face area of the user video is mainly displayed in a play frame of the live video. Specifically, face recognition can be performed on the user video, the face coordinate position of the user in the user video is determined, then based on the recognized face position, clipping processing is performed on each frame of user image in the user video according to a certain ratio, for example, a 1:1 ratio, and clipping rules can be as shown in fig. 8, so that the face 80 of the user is ensured to be positioned in the middle of the left and right sides of the picture and positioned in the middle of the upper and lower sides of the picture to one third of the upper side of the picture, wherein the frame line in fig. 8 is a clipping area. Finally, the user video after clipping may be loaded into the live video, for example, as shown in fig. 4, the user video 41 after clipping may be displayed at the bottom of the play screen 40 of the live video.

Therefore, by cutting the user video, the user interactive video displayed in the live video can be ensured to highlight the head portrait characteristics of the interactive user, and the display effect of the interactive video is ensured.

Optionally, the step 101 includes:

after the step 101, the method further includes:

The server may obtain the total number of users currently accessing the video interaction of the local scene, and record as u, that is, the number of obtained user videos, for example, when 100 users currently watch a match in the local living broadcast room and access the interaction, u=100.

The server side can also respectively identify and classify the user voice in each user video, determine the voice category of the user in each user video, such as cheering, speaking, crying, sneezing, non-human voice and the like, specifically analyze all the user voices accessed to live video interaction in real time by using a voice classification algorithm based on deep learning and the like, determine the voice category of each user, and filter the conventional voice, such as sneezing, non-human voice and the like, without statistics. And counting the number of other sound categories, for example, determining that 20 users do not send out sounds currently, 45 users speak in small voices, 35 users speak in large voices, namely cheering, so that the number of the sound of a target category can be determined, wherein the target category can be determined according to the user state required to be identified, for example, the sound of the target category can be cheering, crying and the like, for example, the number of the sound of the current target category is set as S, the sound of the target category is cheering, S=35, and the number ratio of the sound of the target category, namely S/u, can be determined.

In the case where the number of sounds of the target class is larger than the preset value, the current atmosphere may be identified as the target atmosphere, for example, in the case where it is identified that the number of cheering sounds is larger than the preset value, it may be determined that the current atmosphere is the cheering atmosphere. The preset value may be different based on the type of the live video, for example, for a world cup live video, the preset value may be set higher, such as 0.8, and for a common live video, the preset value may be set lower, such as 0.3.

In the case where it is determined that the current atmosphere is the target atmosphere, the sound volume of the target category may be processed, for example, the sound volume of the target category may be set to match the current atmosphere, the cheering sound volume may be set to be larger for cheering atmosphere, the crying sound volume may be set to be smaller for sad atmosphere, and so on.

Therefore, through identifying and classifying the sound in the user video accessed to the interaction, the sound volume of the target category is determined, and the playing effect of the user interaction video can be matched with the current atmosphere, so that the interaction effect in the live video is ensured.

Optionally, the processing the sound volume of the target class includes:

That is, in one embodiment, the sound volume of the target category may be determined according to the sound volume of the target category, specifically, the higher the number of the target categories is, the larger the sound volume of the target categories may be, and the determined sound volumes of the target categories may be superimposed on the plurality of user videos, so as to achieve the effect that the sound volume of the target categories gradually increases with the increase of the sound volume of the target categories, for example, as the number of users who currently send cheering is greater, the cheering sound volume sent from the plurality of user videos increases when the client plays live videos.

Therefore, the target class sound can be prevented from becoming large instantaneously, so that the user psychology is not ready to be frightened, the excessive buffering effect of sound volume is achieved, and the user interaction effect is ensured.

And 104, transmitting the live video loaded with the user video to a second client.

The server may send the live video loaded with the user video to the second client, where the second client may include the first client, or may include other clients other than the first client, that is, for the user who does not participate in the interaction, the user who does not participate in the interaction may also see the interaction video of other users in the live video, and of course, the user who does not participate in the interaction may also choose to not receive the interaction video of other users, and only watch the live video.

And after receiving the live video loaded with the user video, the second client can display the user video in a live video picture, namely, the second client user can watch the live video and watch the interactive video of the second client or other users.

The video processing method of the embodiment of the invention obtains the user video, wherein the user video is obtained by video acquisition of a watching user by a first client for playing live video; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video under the condition that the user in the user video is in the target state; and transmitting the live video loaded with the user video to a second client. Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

Referring to fig. 9, fig. 9 is a flowchart of a video processing method provided by an embodiment of the present invention, applied to a client, as shown in fig. 9, the method includes the following steps:

step 901, receiving live video loaded with user video and issued by a server.

Step 902, playing the live video loaded with the user video.

The user video is obtained by acquiring video of a watching user by a first client for playing live video, and the user in the user video is in a target state.

Optionally, the method further comprises:

It should be noted that, as an implementation manner of the client side corresponding to the embodiment shown in fig. 1, a specific implementation manner of the embodiment may refer to a related description in the embodiment shown in the fig. and is not repeated herein.

According to the video processing method, live video loaded with user video and issued by a server is received; and playing the live video loaded with the user video. Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

The embodiment of the invention also provides a video processing device. Referring to fig. 10, fig. 10 is a block diagram of a video processing apparatus according to an embodiment of the present invention, which is applied to a server. Since the principle of the video processing device for solving the problem is similar to that of the video processing method in the embodiment of the present invention, the implementation of the video processing device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 10, the video processing apparatus 1000 includes:

a first obtaining module 1001, configured to obtain a user video, where the user video is obtained by performing video collection on a watching user by a first client playing a live video;

a first identifying module 1002, configured to identify the user video, and determine whether a user in the user video is in a target state;

a first processing module 1003, configured to load the user video into the live video if it is determined that the user in the user video is in the target state;

and a sending module 1004, configured to send the live video loaded with the user video to a second client.

Optionally, the first identifying module 1002 is configured to identify an action amplitude and/or a facial expression of a user in the user video, and determine whether the user in the user video is in a target emotional state.

Optionally, the first identification module 1002 includes:

Optionally, the first obtaining module 1001 is configured to obtain a plurality of user videos, where the plurality of user videos are obtained by respectively performing video collection on respective watching users by different clients playing live videos;

the first processing module 1003 includes:

Optionally, the video processing apparatus 1000 further includes:

Optionally, the first processing module 1003 includes:

and/or, the first processing module 1003 includes:

the video processing apparatus 1000 further includes:

Optionally, the sixth processing module includes:

The video processing device provided by the embodiment of the present invention may execute the above method embodiment, and its implementation principle and technical effects are similar, and this embodiment will not be described herein.

The video processing device 1000 of the embodiment of the invention acquires a user video, wherein the user video is obtained by performing video acquisition on a watching user by a first client for playing live video; identifying the user video, and determining whether a user in the user video is in a target state; loading the user video into the live video under the condition that the user in the user video is in the target state; and transmitting the live video loaded with the user video to a second client. Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

The embodiment of the invention also provides a video processing device. Referring to fig. 11, fig. 11 is a block diagram of a video processing apparatus according to an embodiment of the present invention, which is applied to a client. Since the principle of the video processing device for solving the problem is similar to that of the video processing method in the embodiment of the present invention, the implementation of the video processing device can refer to the implementation of the method, and the repetition is omitted.

As shown in fig. 11, the video processing apparatus 1100 includes:

the receiving module 1101 is configured to receive a live video loaded with a user video, which is issued by a server;

and a playing module 1102, configured to play the live video loaded with the user video.

Optionally, the video processing apparatus 1100 further includes:

The video processing device 1100 of the embodiment of the invention receives live video loaded with user video issued by a server; and playing the live video loaded with the user video. . Therefore, the user can interact through the video in the process of watching the live video, and the interaction mode is more interesting and rich.

The embodiment of the invention also provides electronic equipment. Because the principle of solving the problem of the electronic device is similar to that of the video processing method in the embodiment of the present invention, the implementation of the electronic device can refer to the implementation of the method, and the repetition is omitted. In one embodiment, the electronic device may be a server, as shown in fig. 12, where the server includes:

processor 1200 for reading the program in memory 1220, performs the following process:

live video loaded with the user video is delivered to a second client via transceiver 1210.

A transceiver 1210 for receiving and transmitting data under the control of the processor 1200.

Wherein in fig. 12, a bus architecture may comprise any number of interconnected buses and bridges, and in particular, one or more processors represented by processor 1200 and various circuits of memory represented by memory 1220, linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 1210 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium. The processor 1200 is responsible for managing the bus architecture and general processing, and the memory 1220 may store data used by the processor 1200 in performing operations.

Optionally, the processor 1200 is further configured to read the program in the memory 1220, and perform the following steps:

gesture recognition is carried out on the target user video;

Optionally, the processor 1200 is further configured to read the program in the memory 1220, and perform at least one of the following steps:

When the electronic device provided by the embodiment of the present invention is used as a server, the method embodiment shown in fig. 1 may be executed, and the implementation principle and the technical effect are similar, so that the description of this embodiment is omitted here.

In another embodiment, the electronic device may be a client, as shown in fig. 13, including:

processor 1300, for reading the program in memory 1320, performs the following procedure:

receiving live video loaded with user video and issued by a server through a transceiver 1310;

and playing the live video loaded with the user video.

A transceiver 1310 for receiving and transmitting data under the control of the processor 1300.

Where in FIG. 13, a bus architecture may comprise any number of interconnected buses and bridges, with various circuits of the one or more processors, specifically represented by processor 1300, and the memory, represented by memory 1320, being linked together. The bus architecture may also link together various other circuits such as peripheral devices, voltage regulators, power management circuits, etc., which are well known in the art and, therefore, will not be described further herein. The bus interface provides an interface. The transceiver 1310 may be a number of elements, i.e., include a transmitter and a receiver, providing a means for communicating with various other apparatus over a transmission medium. The user interface 1330 may also be an interface capable of interfacing with an inscribed desired device for a different user device, including but not limited to a keypad, display, speaker, microphone, joystick, etc.

The processor 1300 is responsible for managing the bus architecture and general processing, and the memory 1320 may store data used by the processor 1300 in performing operations.

Optionally, the processor 1300 is further configured to read the program in the memory 1320, and perform the following steps:

and uploading the third user video to the server through a transceiver 1310 so that the server can identify the third user video.

When the electronic device provided by the embodiment of the present invention is used as a client, the method embodiment shown in fig. 9 may be executed, and the implementation principle and the technical effect are similar, so that the description of this embodiment is omitted here.

Furthermore, a computer readable storage medium according to an embodiment of the present invention stores a computer program, which in one implementation is executable by a processor to implement the steps of:

and transmitting the live video loaded with the user video to a second client.

Optionally, the identifying the motion amplitude and/or facial expression of the user in the user video, and determining whether the user in the user video is in the target emotional state includes:

Optionally, the acquiring the user video includes:

Gesture recognition is carried out on the target user video;

Optionally, the acquiring the user video includes:

after the capturing the plurality of user videos, the method further includes:

Optionally, the processing the sound volume of the target class includes:

In another embodiment, the computer program is executable by a processor to perform the steps of:

receiving live video loaded with user video and issued by a server;

playing the live video loaded with the user video;

Optionally, the method further comprises:

In the several embodiments provided in the present application, it should be understood that the disclosed methods and apparatus may be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing unit, or each unit may be physically included separately, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional unit is stored in a storage medium, and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform part of the steps of the transceiving method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the present invention.

Claims

1. A video processing method applied to a server, the method comprising:

transmitting the live video loaded with the user video to a second client;

the obtaining the user video comprises the following steps: acquiring a plurality of user videos, wherein the user videos are obtained by respectively acquiring videos of respective watching users for different clients playing live videos;

the loading the user video into the live video if the user in the user video is determined to be in the target state comprises the following steps: determining target user videos from the at least two user videos, wherein the action change amplitude of the users in the target user videos is the largest;

identifying the outline of the action core part of the user in the target user video; based on the contour, capturing a dynamic image of the action core part from the target user video; loading the dynamic image of the action core part at the position associated with the target user video in the play picture of the live video;

Wherein: determining a target movement speed and a target movement direction of a first user video after collision with the dynamic image of the action core part based on the movement speed and the movement direction of the action core part in the dynamic image of the action core part; according to the target movement speed and the target movement direction, generating a flick effect diagram of the first user video in a play picture of the live video; the first user video is at least one of the at least two user videos other than the target user video.

2. The method of claim 1, wherein the identifying the user video, determining whether a user in the user video is in a target state, comprises:

3. The method of claim 2, wherein the target emotional state comprises an cheering state;

4. The method of claim 1, wherein loading the user video into the live video if it is determined that a user in the user video is in the target state, further comprises:

5. The method of claim 4, wherein the step of determining the position of the first electrode is performed,

loading the dynamic image of the action core part at a position associated with the target user video in a play picture of the live video in a third size, wherein the third size is larger than a second size, and the second size is the display size of the target user video.

6. The method of claim 5, wherein the loading the dynamic image of the action core portion at a third size is followed by loading the dynamic image of the action core portion at a location associated with the target user video in a play frame of the live video, the method further comprising:

7. The method of claim 5, wherein the loading the dynamic image of the action core portion at a third size is followed by loading the dynamic image of the action core portion at a location associated with the target user video in a play frame of the live video, the method further comprising:

gesture recognition is carried out on the target user video;

8. The method of claim 1, wherein the loading the user video into the live video if it is determined that a user in the user video is in the target state comprises at least one of:

9. The method of claim 1, wherein the obtaining the user video comprises:

after the capturing the plurality of user videos, the method further includes:

10. The method of claim 9, wherein the processing the sound volume of the target class comprises:

11. A video processing method applied to a client, the method comprising:

Receiving live video loaded with user video and issued by a server; the method comprises the steps that a plurality of user videos are acquired by a server, wherein the user videos are obtained by respectively acquiring videos of watching users by different clients playing live videos; the loading the user video into the live video under the condition that the user in the user video is determined to be in a target state comprises the following steps: determining target user videos from the at least two user videos, wherein the action change amplitude of the users in the target user videos is the largest; identifying the outline of the action core part of the user in the target user video; based on the contour, capturing a dynamic image of the action core part from the target user video; loading the dynamic image of the action core part at the position associated with the target user video in the play picture of the live video;

playing the live video loaded with the user video;

12. The method of claim 11, wherein the method further comprises:

13. A video processing apparatus applied to a server, the video processing apparatus comprising:

the sending module is used for sending the live video loaded with the user video to a second client;

14. A video processing apparatus for use in a client, the video processing apparatus comprising:

the receiving module is used for receiving live video loaded with user video and issued by the server; the method comprises the steps that a plurality of user videos are acquired by a server, wherein the user videos are obtained by respectively acquiring videos of watching users by different clients playing live videos; the loading the user video into the live video under the condition that the user in the user video is determined to be in a target state comprises the following steps: determining target user videos from the at least two user videos, wherein the action change amplitude of the users in the target user videos is the largest; identifying the outline of the action core part of the user in the target user video; based on the contour, capturing a dynamic image of the action core part from the target user video; loading the dynamic image of the action core part at the position associated with the target user video in the play picture of the live video;

the playing module is used for playing the live video loaded with the user video;

15. An electronic device, comprising: a transceiver, a memory, a processor, and a computer program stored on the memory and executable on the processor; the processor is configured to read a program in a memory to implement the steps in the video processing method according to any one of claims 1 to 10, or to implement the steps in the video processing method according to claim 11 or 12.

16. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps in the video processing method according to any one of claims 1 to 10; or to implement the steps of the video processing method as claimed in claim 11 or 12.