WO2022001593A1 - 视频生成方法、装置、存储介质及计算机设备 - Google Patents

视频生成方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2022001593A1
WO2022001593A1 PCT/CN2021/098796 CN2021098796W WO2022001593A1 WO 2022001593 A1 WO2022001593 A1 WO 2022001593A1 CN 2021098796 W CN2021098796 W CN 2021098796W WO 2022001593 A1 WO2022001593 A1 WO 2022001593A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
target
shooting
character
user
Prior art date
Application number
PCT/CN2021/098796
Other languages
English (en)
French (fr)
Inventor
张新磊
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2022001593A1 publication Critical patent/WO2022001593A1/zh
Priority to US17/983,071 priority Critical patent/US20230066716A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/63Control of cameras or camera modules by using electronic viewfinders
    • H04N23/631Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters
    • H04N23/632Graphical user interfaces [GUI] specially adapted for controlling image capture or setting capture parameters for displaying or modifying preview images prior to image capturing, e.g. variety of image resolutions or capturing parameters
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/254Analysis of motion involving subtraction of images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration by the use of more than one image, e.g. averaging, subtraction
    • G06T5/77
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/16Image acquisition using multiple overlapping images; Image stitching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/803Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of input or preprocessed data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/174Facial expression recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/61Control of cameras or camera modules based on recognised objects
    • H04N23/611Control of cameras or camera modules based on recognised objects where the recognised objects include parts of the human body
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/63Control of cameras or camera modules by using electronic viewfinders
    • H04N23/633Control of cameras or camera modules by using electronic viewfinders for displaying additional information relating to control or operation of the camera
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N23/00Cameras or camera modules comprising electronic image sensors; Control thereof
    • H04N23/60Control of cameras or camera modules
    • H04N23/64Computer-aided capture of images, e.g. transfer from script file into camera, check of taken image quality, advice or proposal for image composition or decision on when to take image
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/76Television signal recording
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/24Indexing scheme for image data processing or generation, in general involving graphical user interfaces [GUIs]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20036Morphological image processing
    • G06T2207/20044Skeletonization; Medial axis transform
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face

Definitions

  • the present application relates to the technical field of video processing, and in particular to video generation technology.
  • the video sharing platform also supports users to re-create other people's videos. For example, after a user browses a favorite video on the video sharing platform, he can make a co-production based on the video. Video, that is, users can combine their own videos with other people's videos to obtain co-shot videos. When shooting a co-shot video, the shooting effect and shooting cost have always been the concerns of users. Therefore, how to obtain high-quality shooting effects in a video co-shooting scene while reducing shooting costs has become an urgent problem to be solved by those skilled in the art nowadays.
  • Embodiments of the present application provide a video generation method, device, storage medium, and computer equipment, which can not only achieve high-quality shooting effects, but also reduce shooting costs.
  • the technical solution is as follows:
  • a video generation method executed by an electronic device, the method comprising:
  • the second video corresponds to the video clip including the target character in the first video
  • the second video is fused into the video content of the first video to obtain a co-shot video.
  • a video generation apparatus comprising:
  • a first processing module configured to perform video shooting in response to a trigger operation for the video co-shooting option
  • a video acquisition module used for acquiring the second video obtained by the current shooting; the second video corresponds to the video clip including the target character in the first video;
  • the second processing module is configured to fuse the second video into the video content of the first video based on the identification of the target character and other characters in the first video to obtain a co-shot video.
  • an electronic device in another aspect, includes a processor and a memory, the memory stores at least one piece of program code, the at least one piece of program code is loaded and executed by the processor to realize the above video Generate method.
  • a storage medium wherein at least one piece of program code is stored in the storage medium, and the at least one piece of program code is loaded and executed by a processor to implement the above-mentioned video generation method.
  • a computer program product or computer program comprising computer program code stored in a computer-readable storage medium from which a processor of a computer device readable storage The medium reads the computer program code, and the processor executes the computer program code, so that the computer device executes the above-described video generation method.
  • FIG. 1 is a schematic diagram of an implementation environment involved in a video generation method provided by an embodiment of the present application
  • FIG. 2 is a flowchart of a video generation method provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a user interface provided by an embodiment of the present application.
  • FIG. 4 is a flowchart of a video generation method provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 10 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 11 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 13 is a flowchart of a video generation method provided by an embodiment of the present application.
  • FIG. 14 is a flowchart of a video generation method provided by an embodiment of the present application.
  • 16 is a schematic diagram of a key point of a human body provided by an embodiment of the present application.
  • 17 is a schematic flowchart of detection and tracking of a moving target provided by an embodiment of the present application.
  • FIG. 19 is a schematic diagram of an overall execution flow of a video generation method provided by an embodiment of the present application.
  • FIG. 20 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 21 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 22 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 23 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 24 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 25 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 26 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 27 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • FIG. 28 is a schematic diagram of another user interface provided by an embodiment of the present application.
  • 29 is a schematic structural diagram of a video generation apparatus provided by an embodiment of the present application.
  • FIG. 30 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 31 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the implementation environment may include: a terminal 101 and a server 102 . That is, the video generation method provided by the embodiment of the present application is jointly executed by the terminal 101 and the server 102 .
  • the terminal 101 is usually a mobile terminal.
  • the terminal 101 terminal includes, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, and the like.
  • the server 102 may be an independent physical server, or a server cluster or distributed system composed of multiple physical servers, or may provide cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud Cloud servers for basic cloud computing services such as communications, middleware services, domain name services, security services, CDN, and big data and artificial intelligence platforms.
  • the terminal and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in this application.
  • a video client is usually installed on the terminal 101, and the server 102 is configured to provide background services for the video client, so as to support the user to browse videos published by other users on the video sharing platform through the video client.
  • Example 2 the implementation environment may also include only the terminal 101 . That is, the video generation method provided by the embodiment of the present application can also be executed by the terminal 101 alone. For this situation, the terminal 101 is usually required to have a strong computing processing capability.
  • the video generation method provided by the embodiments of the present application may be applied in a video co-shooting scenario.
  • a short video usually refers to a video with a short playback duration, such as a video shorter than a certain duration threshold.
  • the duration threshold may be 30 seconds, 50 seconds, or 60 seconds, etc., which is not specifically limited in this embodiment of the present application.
  • the embodiment of the present application can output a prompt message based on the existing video screen content to provide the user with shooting guidance, so that the user can shoot the original video at a low cost according to the story content recorded in the video.
  • An interesting video with a high degree of integration can output a prompt message based on the existing video screen content to provide the user with shooting guidance, so that the user can shoot the original video at a low cost according to the story content recorded in the video.
  • the implementation of this application will provide users with shooting guidance (such as the character's facial orientation, facial expressions, body movements, etc.) , camera shooting method, character dialogue, etc.).
  • shooting guidance such as the character's facial orientation, facial expressions, body movements, etc.
  • camera shooting method character dialogue, etc.
  • the embodiments of the present application can guide users, such as the user's movement posture, facial expression state, camera shooting mode, etc., so as to help the user to complete the video shooting more friendly, and reduce the shooting cost of the video co-production for the user.
  • the camera shooting mode includes, but is not limited to, the camera's viewfinder mode and the camera's movement mode.
  • the framing method of the camera includes, but is not limited to, horizontal framing or vertical framing. Move up and down, etc.
  • the embodiment of the present application provides a short video co-shooting method based on scene fusion.
  • scene fusion is that the original video and the video shot by the user are not only related to each other in content, but the final co-shot video is obtained by content fusion of the original video and the video shot by the user, that is, the video.
  • the synthesis processing is to insert the video shot by the user into the original video, replace some video clips in the original video, and finally get one video, that is, the original video and the video captured by the user are synthesized into one video, and a co-shot video is obtained.
  • each frame of video image of the co-shot video includes one video picture.
  • the co-production video when the co-production video is presented, only one video is included in the screen, instead of including two videos on the same screen, that is, the short video co-production method based on scene fusion does not bluntly splicing the two videos, not on the same screen.
  • Present two-way video such as left-right split screen, top-bottom split screen, or large and small window modes.
  • the video generation method provided in this embodiment of the present application can also be applied to other video co-production scenes, such as movie clips or TV series clips, etc., which are not performed in this embodiment of the present application. Specific restrictions.
  • FIG. 2 is a flowchart of a video generation method provided by an embodiment of the present application, and the execution subject is exemplarily the terminal 101 shown in FIG. 1 . It should be understood that in practical applications, the video generation method provided by the embodiment of the present application Other electronic devices with video processing capabilities may also be used. Referring to FIG. 2, the method provided by the embodiment of the present application includes:
  • the terminal performs video shooting in response to a trigger operation for the video co-shooting option.
  • the terminal may display the video co-production option on the playback interface of the first video.
  • the first video is also referred to as the original video in the embodiments of the present application. That is, the video browsed and played by the user is referred to as the first video in this document.
  • the first video can be a short video posted to the video sharing platform by a registered user of the video sharing platform, and the short video can be either an original video for the user, a video imitated by the user, or a short video for the user in events such as TV dramas and movies. Or a short video clipped from any type of video, which is not specifically limited in this embodiment of the present application.
  • the first video may also be other forms of video with a duration longer than that of the short video, which is also not specifically limited in this embodiment of the present application.
  • any form of video containing human characters can be applied to the method.
  • a video co-shooting option 301 may be displayed on the playback interface.
  • the video co-shooting option 301 can be laid out and displayed at an edge position of the playback interface, such as the left edge, right edge, upper edge or lower edge etc.
  • the video co-shooting option 301 is displayed on the right edge and lower position of the playback interface.
  • the video co-shooting option 301 can also be displayed in other positions, such as other positions in the playback interface other than the edge positions, or in the display column of the video operation options corresponding to the playback interface.
  • the display position of option 301 does not make any restrictions.
  • a video co-production option 301 such as "I want to co-production" is displayed on the playing interface, it means that the user can interact with the currently playing first video for video co-production.
  • the trigger operation for the video co-shooting option may be a user clicking the video co-shooting option 301 shown in FIG. 3 , which is not specifically limited in this embodiment of the present application.
  • the second video corresponds to a video clip including the target character in the first video.
  • the second video currently captured by the terminal is also referred to herein as the video captured by the user.
  • the second video shot by the user may correspond to a video clip that includes the target character in the first video.
  • the target character may be a character that the user selects before shooting the second video and is to play, and the target character may be the first video. Any character present in the video.
  • the terminal may also acquire a prompt message based on the recognition of the screen content of the first video; and during the video shooting process, display the prompt message on the shooting interface; wherein the prompt message is used to guide the shooting of the second video, That is, a guide prompt is provided for the user to shoot the second video.
  • the prompt message is obtained by analyzing the picture content of the first video. This analysis step can be performed either by the terminal or by the server.
  • the prompt message includes: one or more of the camera shooting mode, human body posture and character dialogue; optionally, by displaying the camera shooting mode, the user can be informed how to truly restore the first During the shooting process of a video, to ensure that the captured second video has a high consistency with the original first video; the human body posture may include one or more of facial expressions, facial orientation and body movements.
  • the character dialogue is popularly referred to as the character's lines.
  • the terminal may select a guide mode combining icons and text. That is, the terminal displays a prompt message on the shooting interface, which may include one or more of the following:
  • the terminal displays the prompt icon and prompt text of the camera shooting mode on the shooting interface.
  • the terminal displays the prompt icon and prompt text of the human body posture on the shooting interface.
  • the terminal displays character dialogue on the shooting interface.
  • the terminal can only display any one of the prompt icon and prompt text, that is, the terminal can display the prompt icon or prompt text of the camera shooting mode on the shooting interface, and the terminal can also display the human body on the shooting interface.
  • the prompt icon or prompt text of the gesture and the present application does not make any limitation on the content of the prompt message displayed by the terminal.
  • the server may also perform the combining process, which is not specifically limited in this embodiment of the present application.
  • integrating the second video into the video content of the first video to obtain a co-produced video including but not limited to: if the first video does not include the same frame of the selected target character and other characters, using the second video
  • the video replaces the video clip that includes the target character in the first video; that is, this method utilizes the video frame included in the second video to replace the video frame including the target character in the first video; if the first video includes the target character and other characters
  • the same-frame picture is used to replace the face image of the target character in the same-frame picture with the user's face image in the second video.
  • the face of the target character in the above-mentioned picture with the same frame is changed, and the facial avatar of the target character in the above picture with the same frame is replaced with the face image of the user in the second video.
  • the co-shot video can present the following effects when playing: The video picture of the first video and the video picture of the second video are played linearly interspersed.
  • the picture in the same frame refers to a video picture that includes the target character and other characters at the same time.
  • the first video includes character A, character B, and character C, and the user selects character A before shooting the second video. If the target character is the target character, then the first video includes both the picture of character A and character B, the picture including character A and character C at the same time, and the picture including character A, character B and character C at the same time, all belong to the target character and other characters. same frame.
  • the terminal will display a video co-shooting option; the terminal may, in response to a user triggering the video co-shooting option, perform video shooting, and obtain a currently captured second video, where the second video corresponds to
  • the first video includes video clips of the target character; further, based on the identification of the target character and other characters in the first video, the second video is fused into the video content of the first video to obtain a co-shot video. That is, the co-production video is obtained by content fusion of the first video and the second video, which makes the co-production video have a good content fit, and users can deeply integrate into the video production, which improves the degree of video personalization.
  • the video generation method can not only achieve high-quality video production effects, but also significantly reduce the shooting cost.
  • FIG. 4 is a flowchart of a video generation method provided by an embodiment of the present application, and the execution subject may be the terminal 101 shown in FIG. 1 .
  • the first video includes N characters, where N is a positive integer and N ⁇ 2. That is, the prerequisite for the implementation of the video co-production solution provided by the embodiments of the present application is that the original video includes at least two characters.
  • the method flow provided by the embodiment of the present application includes:
  • the terminal displays a video co-shooting option on a playback interface of the first video.
  • This step is similar to the above-mentioned step 201, and will not be repeated here.
  • the terminal displays N character options on the play interface in response to the user's triggering operation for the video co-production option.
  • the terminal confirms that the user starts using the video co-shooting function, and the triggering operation also activates the terminal to perform the step of performing face recognition in the first video .
  • a face recognition algorithm based on a convolutional neural network can be used for face recognition.
  • the terminal obtains the number of characters and character IDs included in the first video by performing face recognition in the first video.
  • the number of roles is the same as the number of role options.
  • N character options 501 are shown in FIG. 5 .
  • the first video selected by the user and co-produced includes 2 characters, namely character 1 and character 2 .
  • the user can choose any one of the two characters for replacement shooting. For example, after the user clicks the video co-production option, the terminal can pop up a window to prompt that there are two characters in the video that can participate in the shooting, and the user can select one of the characters to replace, that is, the user will perform the screen content of the selected character.
  • the character options of the character 1 and the character options of the character 2 in FIG. 5 may be presented with their corresponding character pictures respectively.
  • the character picture may be a frame of video image of character 1 in the first video and a frame of video image of character 2 in the first video, which are not specifically limited in this embodiment of the present application.
  • the terminal selects M target video clips including the target role from the first video in response to the user's triggering operation for the target role option in the N role options, and displays a preview of each target video clip on the playback interface. screen.
  • M is a positive integer and M ⁇ 1.
  • the triggering operation for the target role option may be the user's click operation on any one of the N role options, and the role corresponding to the role option selected by the user is referred to herein as the target role.
  • the terminal or server may filter out M video clips including role 1 from the first video as target video clips, Further, the terminal will display a preview image of each target video clip in the M target video clips on the playback interface, and the user can watch these target video clips at will.
  • a preview screen 601 of the four target video clips associated with character 1 is shown in FIG. 6 .
  • the preview images 601 of the four target video clips may be presented on the playback interface in a tiled manner or a list manner, and the preview images 601 of the four target video clips may be the first frame or key of each target video clip. frame or a randomly selected video frame, which is not specifically limited in this embodiment of the present application.
  • the terminal plays the designated target video clip in response to the user's triggering operation on the preview screen of the designated target video clip among the M target video clips.
  • the embodiment of the present application also supports sliding display of each target in response to the user's sliding operation on the preview screen of each target video clip.
  • the trigger operation on the preview screen of the specified target video clip may be a user's click operation on the preview screen of the specified target video clip.
  • the terminal starts the camera to shoot video; and obtains a prompt message based on the recognition of the screen content of the first video; during the video shooting process, the terminal displays the prompt message on the shooting interface.
  • the prompt message is used to guide the user to shoot the second video.
  • the terminal after the terminal starts the camera to shoot, the terminal presents the target video clips that require the user to imitate the performance on the shooting interface one by one according to the sequence of the M target video clips, and analyzes and finds that the , so as to get a prompt message that is adapted to the current shooting progress. That is, during the video shooting process, a prompt message is displayed on the shooting interface, including but not limited to: performing screen content analysis on each target video clip related to the target character to obtain a prompt message corresponding to each target video clip; During the shooting of each target video clip, a prompt message corresponding to each target video clip is displayed on the shooting interface.
  • displaying a prompt message corresponding to each target video clip on the shooting interface including but not limited to adopting the following methods: in the display mode placed on the top layer, the video window is displayed in suspension on the shooting interface; Wherein, the video window is used to display the target video clip matching the current shooting progress, that is, the target video clip corresponding to the currently displayed prompt message.
  • the meaning of the display mode placed on the top layer is that it is displayed at the top of the page and is not blocked by any other layers.
  • the terminal can choose to display the target video clips that require the user to imitate the performance in the upper left corner of the shooting interface, which not only achieves the purpose of prompting the user, but also does not perform any actions on the shooting interface. Excessive occupation.
  • FIG. 7 to FIG. 10 also show different types of prompt messages 701 displayed on the shooting interface.
  • the terminal knows that the user needs to face the front to shoot at this time by analyzing the content of the first video, and then a corresponding prompt will be given on the shooting interface to guide the user to shoot, so that the video shot by the user can be
  • the screen has a better match with the characters and screen logic in the original video.
  • the prompt message 701 presented on the photographing interface at this time includes: a prompt icon of the face facing and the prompt text "facing straight ahead".
  • the prompt message 701 may further include a camera shooting method.
  • the terminal will display the prompt icon of the camera shooting mode shown in FIG. 8 on the UI (User Interface, user interface) display. (arrow in Figure 8) and prompt text (screen advance), so as to inform the user how to control the lens.
  • the terminal can also display the UI of the character dialogue that matches the current shooting progress, so as to inform the user of the text content that needs to be read out when shooting.
  • Fig. 8 shows that the user needs to read out the character dialogue "We can really hold hands" while advancing the screen.
  • the prompt message 701 may also include body movements.
  • the terminal will also simultaneously display the UI display of body movements on the shooting interface, that is, the prompt icons and Hint text.
  • the prompt icon may be "little man in motion”
  • the prompt text may be "raise the right arm”.
  • the character dialogue of "Really?” needs to be read out.
  • the prompt message 701 may also include facial expressions. That is, prompt icons and prompt texts of facial expressions can also be displayed on the shooting interface. For example, by analyzing the screen content of the original video, it can be seen that if the current character is smiling to the right, the terminal will also simultaneously display the UI display of facial expressions on the shooting interface, that is, the prompt icon and prompt text of facial expressions will be displayed on the shooting interface. . As shown in FIG. 10 , the prompt icon may be "smiley face", and the prompt text may be "smile facing right".
  • the user can also be prompted for a countdown before each target video clip starts shooting. For example, a 10-second or 5-second or 3-second countdown can be performed before starting the shooting.
  • the prompt form of the countdown may be either a voice form or a graphic form, which is not specifically limited in this embodiment of the present application.
  • a trigger control may be displayed on the interface, and after detecting that the user actively triggers the control, the shooting of the current video clip is started.
  • the current shooting may also be triggered by the user through voice. That is, the terminal has a voice recognition function, and automatically starts shooting of the current video segment after recognizing that the voice issued by the user is an instruction to start shooting.
  • the terminal Based on the identification of the target character and other characters in the first video, the terminal fuses the second video into the video content of the first video to obtain a co-production video.
  • the embodiment of the present application provides a short video co-shooting method based on scene fusion.
  • the terminal will collect a second video, and the terminal will compare the first video and the currently captured second video.
  • the following processing methods are usually adopted: the first video and the second video are synthesized and processed into one video to obtain a co-shot video; wherein, each video frame of the co-shot video contains only one video image.
  • scene fusion is that the original first video and the second video shot by the user are not only related to each other in content, but also the final co-shot video is obtained by content fusion of the first video and the second video , that is, the synthesis processing of the video is to insert the second video shot by the user into the original first video, to realize the replacement of some video clips in the first video, and finally get one video, that is, the original first video.
  • the video and the second video shot by the user are synthesized and processed into one video to obtain a co-shot video.
  • each frame of video image of the co-shot video includes one video picture.
  • the terminal can directly replace the M target video clips with the second video If the M target video clips associated with the target character selected by the user include the same frame picture of the target character and other roles, then the processing mode of the second terminal is to utilize the first facial image in the second video to replace the same frame picture The second facial image of the target character; wherein the first facial image is the user's facial image captured by the camera when the user imitates the target character in the same frame.
  • the processing method of the terminal at this time is: replace the facial image of the character in the original video with the facial image of the user, that is, change the face, so as to achieve the plot. Consistent with the logic of the screen.
  • the co-production video only includes one video in the screen when it is presented, instead of including two videos on the same screen, that is, the video co-production method based on scene fusion in this application does not bluntly splicing the two videos, that is, it does not Present two-channel video on the same screen, such as left-right split screen, top-bottom split screen or large and small window mode.
  • the terminal After generating the co-production video, the terminal displays the preview screen, playback options, playback progress bar, and video modification options of the co-production video; in response to the user's trigger operation for the playback option, plays the co-production video, and displays the co-production video through the animation of the playback progress bar. playback progress.
  • the user can choose to watch the final co-production video, and choose whether to publish or modify the video.
  • the terminal will display a preview screen 1101 of the co-shot video, a playback option 1102 , a playback progress bar 1103 and a video modification option 1104 .
  • the video modification options 1104 may include multiple ones, and FIG. 11 only exemplarily shows four video modification options, namely modification option 1, modification option 2, modification option 3, and modification option 4. It can be understood that the number of video modification options may be more or less than the four shown in the figure, which is not specifically limited in this embodiment of the present application.
  • the preview screen 1101 of the co-shot video may be the first frame, a key frame or a randomly selected video frame in the co-shot video, which is not specifically limited in this embodiment of the present application.
  • the trigger operation for the play option may be the user's click operation on the play option 1102 .
  • the terminal performs modification processing on the co-shot video in response to the user's triggering operation for the video modification option.
  • the trigger operation for the video modification option may be the user's click operation on the video modification option 1104 .
  • the video modification options 1104 may include, but are not limited to: adjusting materials, adding text, adding stickers, adding filters, performing beauty, etc., which are not specifically limited in this embodiment of the present application.
  • the terminal can also display a release option 1105.
  • the release option 1105 the user can upload the co-shot video that has been created. Post to a video sharing platform or profile for other users to browse or watch.
  • the terminal may display an error prompt message on the shooting interface; wherein, The error message is used to guide the user to re-shoot the video.
  • the prompt message in the form of voice may also be played, which is not specifically limited in this embodiment of the present application.
  • the terminal may display the video co-shooting option on the playback interface of the video selected for playback by the user; after that, the terminal may perform video shooting in response to the user's triggering operation on the video co-shooting option; and during the video shooting process, the terminal will automatically display the video co-shooting option on the shooting interface.
  • a prompt message is displayed on the camera; that is, the prompt message will be presented in the user's shooting interface, so as to guide the user to complete the video shooting quickly and with high quality.
  • the currently shot video is fused into the content of the original video to generate a co-production video to achieve video co-production.
  • This video generation method can not only achieve high-quality shooting effects, but also Can significantly reduce the cost of shooting. For example, it can achieve a high level in lens presentation and character performance, and it also speeds up the completion of video shooting, saving time and labor costs.
  • the embodiment of the present application can output a prompt message that is beneficial for the user to shoot by analyzing the content of the video screen, thereby helping the user to quickly engage in the video creation process. That is, the embodiment of the present application is based on the premise of analyzing the content of the video screen, and guides the user to shoot by showing a prompt message to the user, wherein the prompt message contains rich content, such as one of the camera shooting method, human body posture, and character dialogue. one or more.
  • this kind of video co-production solution based on scene fusion, because the original first video and the second video shot by the user are synthesized and processed into one video, that is, the co-production video only includes one screen in the picture presentation, which is achieved in chronological order.
  • the linear interspersed playback of the original first video and the second video shot by the user ensures a seamless video creation effect, and the video creation process is more friendly.
  • this video co-production solution enables users to engage in the video creation process in a more natural and immersive way based on the content of the existing video images, so that the final co-production video can be viewed from the user's perspective.
  • Better integration that is, the co-production video is more in line with the original video in terms of content presentation and character performance, avoiding the rigid splicing between the two videos.
  • FIG. 12 shows several video pictures captured from the co-shot video, and these video pictures are sequentially ordered in chronological order from left to right.
  • the video picture 1201 and the video picture 1203 are from the original first video
  • the video picture 1204 and the video picture 1205 are from the second video shot by the user
  • the video picture 1206 is from the first video. It is obtained by changing the face of the target character contained in the corresponding video picture in the video, that is, replacing the face image of the target character with the face image of the user.
  • step 403 "select M target video clips including the target character selected by the user from the first video", in a possible implementation manner, filter the first video including the target
  • the step of the target video segment of the character may be performed by the server or performed by the terminal itself, which is not specifically limited in this embodiment of the present application. Refer to Figure 13 for the method of performing video clip screening on the server, including the following steps:
  • the terminal uploads the role ID of the target role selected by the user to the server.
  • the character ID may be the character's name, the character's avatar, the character code (such as a character) agreed upon through negotiation between the terminal and the server, etc., which are not specifically limited in this embodiment of the present application.
  • the server determines the target time point at which the target character appears in the first video; the target time point is marked with a key frame to obtain video dot information; the video dot information and the target time point are returned. to the terminal.
  • determining the target time point at which the target character appears in the first video it can be achieved in the following manner: first, determine the video frame including the face of the target character in the first video, and then obtain the corresponding video frame. The time point at which the target character appears in the first video can be obtained.
  • face recognition of the target character may be performed on each video frame included in the first video, so as to obtain the above-mentioned video frame including the face of the target character.
  • the face recognition of the target character can also be carried out at a short interval of time, that is, the face recognition algorithm is used at a number of relatively dense specified time points to determine whether there is a target character face at the specified time point, and Outputting a series of time points where the face of the target character exists, that is, a set of time point columns, represents that the face of the target character appears at the above-mentioned time point of the first video.
  • the determined time points may be sorted in sequence, which is not specifically limited in this embodiment of the present application.
  • the server may also check the first video according to the target time point to obtain video check information.
  • video dots are key frame markers.
  • placing the cursor on the playback progress bar will display the next content of the video. That is, when the control cursor moves to a certain point on the playback progress bar, the video content played at the point will be automatically displayed.
  • Video management marks the key content points in the video so that users can quickly browse to the content they want to watch.
  • performing video dotting may be to mark the determined target time points with key frames, that is, to further determine the target time points corresponding to the key frames in these determined target time points.
  • the key frame usually refers to the frame where the key action or posture is located in the movement or posture change of the character.
  • identifying a key frame it may be determined by the degree of change between adjacent frames, which is not specifically limited in this embodiment of the present application.
  • the terminal cuts out M target video clips from the first video according to the video point information and the target time point.
  • the terminal when the terminal performs segmentation of the target video segment associated with the target character in the first video, including but not limited to the following manners: For example, when segmenting the target video segment, the segmented Each target video segment that comes out includes at least one video dot (a key frame) as a premise. For another example, it is also possible to choose to divide the target time point that appears between two video dots into the same target video clip, that is, the terminal can use the target time point corresponding to the key frame as the basis for dividing the video clip, that is, the target time point that appears in the video clip. Those target time points between target time points corresponding to two key frames belong to the same target video segment, which is not specifically limited in this embodiment of the present application.
  • the method for performing target video clip screening for the terminal includes the following steps:
  • the terminal determines a target time point at which the target character appears in the first video.
  • the terminal performs key frame marking on the target time point to obtain video dot information.
  • the terminal cuts out M target video segments from the first video according to the obtained video management information and the target time point.
  • steps 1401 to 1403 For the implementation of steps 1401 to 1403, reference may be made to the above-mentioned steps 1301 to 1303.
  • step 405 "analyzing the screen content of each target video clip related to the target character"
  • this step can be executed by the server or the terminal itself.
  • This embodiment of the present application This is not specifically limited.
  • screen content analysis is performed on each target video segment related to the target character, including but not limited to the following steps:
  • human gestures may include one or more of facial expressions, facial orientations, and body movements.
  • the above step 1501 may further include:
  • For each target video segment determine the human body key points of the target character in the target video segment according to the target video segment through a human body key point detection network.
  • the human key point detection network can be based on the OpenPose algorithm, which is a deep learning algorithm based on the architecture of a dual-branch multi-level CNN (Convolutional Neural Networks, convolutional neural network), which is mainly recognized by images. method to detect human keypoints.
  • the OpenPose algorithm is a human keypoint detection framework, which can detect up to 135 keypoints in total in images of bodies, fingers, and faces. And the detection speed is very fast, and the real-time detection effect can be achieved.
  • the video frames included in each target video segment can be input into the human key point detection network, and the human key point detection network can first obtain feature information through the backbone network of VGG-19, and then continuously through 6 stages. Optimization, each stage has 2 branches, one of which is used to obtain the heatmaps of the coordinates of the key points of the human body, and the other branch is used to obtain the direction vector PAFs from the starting point to the end point of the limb meaning between the key points of the human body. After that, the PAFs are converted into bipartite graphs, and the bipartite graph matching problem is solved by using a Hungarian algorithm, so as to obtain the key points of the human body of the characters in the picture.
  • the key points of the human body detected by the algorithm can be used to analyze the facial expression, facial orientation, body movements of the person, and even track the movement of the person's fingers.
  • the human body posture can be estimated. 16 shows three different human body postures, namely a standing posture 1601 with hands on hips, a running posture 1602, and a standing posture 1603 with hands clasped in front of the chest.
  • the facial key points in the key points of the human body connect the facial key points in the key points of the human body to obtain a facial framework model; according to the facial framework model, determine the facial expression and facial orientation of the target character in the target video clip .
  • the relative positional relationship between the different parts of the face is based on the basic structure of the face, such as the basic position rules of the chin, mouth, nose, eyes, and eyebrows, and the facial feature points are connected in turn to generate a facial frame.
  • model, and the facial skeleton model can reflect the user's facial expression and facial orientation.
  • the relative positional relationship between different parts of the limb that is, according to the basic structure of human limbs, such as the basic position rules of the neck, shoulders, elbows, wrists, fingers, waist, knees and ankles.
  • the limb key points are connected in sequence to generate a limb structure model, and the limb structure model can reflect the user's limb movements, especially the precise movements of the user's fingers.
  • This embodiment of the present application analyzes the facial expressions (such as joy, anger, sadness, joy, etc.) of the target character selected by the user in the first video, facial orientation (such as facing straight ahead or to the right, etc.), body movements (such as raising arms, etc.) , kicks, etc.) and other information as an interpretation of the content of the video screen, and this information is displayed to the user through the UI in the form of a prompt message, which intuitively and clearly guides the user to complete the shooting.
  • facial expressions such as joy, anger, sadness, joy, etc.
  • each target video clip obtains the movement direction change information and size change information of the target object in the target video clip; determine the target video according to the movement direction change information and size change information of the target object in the target video clip The camera capture method corresponding to the clip.
  • the detection and tracking algorithm based on the grayscale image is used to detect and track the moving target (such as the person appearing in the video) in the video screen, so as to analyze and determine the movement of the moving target in the video screen.
  • the moving target such as the person appearing in the video
  • Direction trend and size change trend and based on this, the camera shooting method of the video picture is deduced.
  • the camera shooting mode is assisted in determining, and displayed on the user's shooting interface through the UI, thereby realizing effective shooting guidance for the user.
  • the detection and tracking algorithm based on grayscale images is as follows: firstly identify the outline of the target object in the video picture; after that, convert the multi-frame video picture images into gray images, and perform Analysis and calculation to complete the detection and tracking of the target. 17, the general flow of the detection and tracking algorithm includes but is not limited to:
  • the MainWin class 1701 is used to initialize the camera, draw a graphical interface, read the next frame of color image from the camera and hand it over to the Process class 1702 for processing.
  • the Process class 1702 is used to perform the conversion of the next frame of color image into a grayscale image, and the difference between the current converted gray image and the previous frame of grayscale image; among them, because the simple frame difference method is often difficult to achieve Detection accuracy, so you can choose to use the horizontal and vertical projection of the phase difference image to complete the detection. That is, the images after phase difference are projected horizontally and vertically, and a horizontal threshold and a vertical threshold are calculated accordingly.
  • the horizontal threshold and the vertical threshold are used to segment the target; and according to the horizontal threshold and The vertical direction threshold determines the horizontal and vertical coordinates of the target, and draws a rectangular tracking frame of the target according to the horizontal and vertical coordinates.
  • the Tracker class 1703 is used to track the target. First, it analyzes whether the target is a new target, or a target that already exists in the previous image frame and continues to move in the current image frame (Target) , and then perform corresponding operations on different analysis results. For example, if the target is a previously detected target, mark the target as matched and add it to the end of the chain, if the target has not been detected before, create an empty chain for the new target . Among them, in order to track the subsequent process, an empty chain is usually created for each new target.
  • the camera shooting mode corresponding to the target video clip is determined, which can be: for example, the target object between two adjacent frames is determined. If the grayscale image of the current object gradually becomes larger, it means that the lens is moving; for example, if the grayscale image of the current target gradually moves to the left of the screen, it means that the corresponding lens movement is panning to the right.
  • the target object here may be a target character selected by the user, which is not specifically limited in this embodiment of the present application.
  • For each target video clip identify the voice data of the target character in the target video clip, and obtain the character dialogue of the target character in the target video clip.
  • a face-changing operation is also included.
  • deepfake technology may be used to perform the face swap operation.
  • deepfake technology is composed of "deep machine learning” (deep machine learning) and “fake photo” (fake photo), which is essentially a technical framework for deep learning models in the field of image synthesis and replacement, and belongs to deep image generation models.
  • the Encoder-Decoder self-encoding and decoding architecture is used when building the model. In the test phase, the arbitrarily distorted face is restored. The whole process includes: obtaining normal face photos ⁇ distorting and transforming face photos ⁇ Encode encoding vector ⁇ Decoder decoding Vector ⁇ restore normal face photos in five steps.
  • the face-changing process of Deepfake technology is mainly divided into: face positioning, face conversion and image stitching.
  • face localization is to extract the feature points of the original face, such as left and right eyebrows, nose, mouth and chin. These feature points roughly describe the organ distribution of the face.
  • extraction can be performed directly through mainstream toolkits such as dlib and OpenCV, which generally use the classic HOG (Histogram of Oriented Gradient, histogram of directional gradient) face labeling algorithm.
  • HOG Heistogram of Oriented Gradient, histogram of directional gradient
  • a generative model such as GAN or VAE is used, and its goal is to generate a B face with an A expression.
  • the final image stitching is to fuse the face to the background of the original image, so as to achieve the effect of only changing the face.
  • the object to be processed is video, it is necessary to process the image frame by frame, and then re-splicing the processed result into a video.
  • FIG. 18 shows the main architecture involved in the Deepfake technology.
  • the architecture mainly includes three parts, namely an encoder 1801 , a generator 1802 and a discriminator 1803 .
  • the encoder 1801 the input video and the landmarks of the video (obtained by connecting the key points of the face into a line), and output an N-dimensional vector.
  • the role of the encoder 1801 is to learn the unique information of a video (such as the invariance of the person's identity), and at the same time, it hopes to have the invariance of the pose.
  • the generator 1802 is used to generate fake images based on landmarks. It is worth noting that a part of the input of generator 1802 comes from encoder 1801. For example, the generator 1802 uses the specific face information learned by the encoder 1801 to complete according to the given face shape according to the face shape given by landmarks, so as to realize the effect of face changing.
  • the discriminator 1803 it includes two parts, one of which is an encoder network, which encodes the image into a vector; and an operation that multiplies the parameter W and the vector.
  • the embodiments of the present application can analyze and determine the human body posture of the target character selected by the user in the first video, the character dialogue, and the camera shooting mode of the camera through the above-mentioned several technologies, so as to display the prompt message in the UI, and realize more Friendly helping users to complete video shooting can significantly enhance the restoration of the original video to the original video, thereby enhancing the realism of content synthesis.
  • the overall execution process can be realized by relying on three parts, namely: the user side, the terminal side and the server side.
  • the following processing may be included: facial recognition, video clip generation preview, UI element delivery, camera call, video synthesis, etc.;
  • the following processing may be included: time management of the video, analysis of video content (such as: face orientation, facial expressions, camera movement and body movements, etc.).
  • the method flow provided by the embodiment of the present application includes:
  • the user activates the video co-shooting function and activates the terminal to perform face recognition by performing a trigger operation on the terminal for the video co-shooting option displayed on the shooting interface.
  • the terminal performs face recognition in the original video and classifies the recognized faces according to the character ID, and presents the character ID on the shooting interface for the user to select the character.
  • the user selects a role, and accordingly, the terminal uploads the role ID of the target role selected by the user to the server.
  • the server analyzes and calculates the target time point that the target character appears in the original video according to the role ID that the terminal uploads; And, according to the target time point that the character ID appears, the video point processing is performed, and the target time point that the character ID appears in. and video dotting information is returned to the terminal for the terminal to generate at least one target video clip associated with the target role, and the preview pictures of these target video clips are presented to the user, so that the user can preview the target video clip where the selected target role appears.
  • the server analyzes the screen content of the target video clip associated with the target character, obtains the human body posture, body movements and the camera shooting mode of the camera of the target character in the video clip, and sends this information to the terminal; the terminal turns on the camera and Present this information to the user in the form of UI elements to guide the user to take a photo.
  • the terminal performs content update processing on the original video based on the video shot by the user to obtain a co-shot video, and generates a preview screen of the co-shot video for the user to preview the co-shot video.
  • the terminal can display the video co-shooting option on the playback interface of the video watched by the user; after that, the terminal can perform video shooting in response to the user's triggering operation on the video co-shooting option; during the video shooting process, The terminal can automatically display a prompt message on the shooting interface, where the prompt message is used to guide the user to shoot video; that is, the prompt message will be presented on the user's shooting interface to guide the user to complete the video shooting quickly and with high quality.
  • this video generation method can not only achieve high-quality shooting effects, but also Significantly reduces shooting costs. It can achieve a high level in lens presentation and character performance, and at the same time speed up the completion of video shooting, saving time and labor costs.
  • the embodiment of the present application can output a prompt message that is beneficial for the user to shoot by analyzing the content of the video screen, thereby helping the user to quickly engage in the video creation process. That is, the embodiment of the present application is based on the premise of analyzing the content of the video screen, and guides the user to shoot by showing a prompt message to the user, wherein the prompt message contains rich content, such as one of the camera shooting method, human body posture, and character dialogue. one or more.
  • this kind of video co-production solution based on scene fusion because the original video and the video shot by the user are synthesized and processed into one video, that is, the co-shot video only includes one screen in the picture presentation, and the original video and the user's shooting are realized in chronological order.
  • the linear interspersed playback of the video ensures the seamless connection of the video creation effect, and the video creation process is more friendly.
  • users can devote themselves to the video creation process in a more natural and immersive way, so that the final co-production video has a better quality from the user's point of view.
  • Convergence that is, the co-production video is more in line with the original video in terms of content presentation and character performance, avoiding the rigid splicing between the two videos.
  • FIGS. 20 to 28 show product renderings of video co-production realized based on the video generation method provided by the embodiment of the present application.
  • the video generation method provided by the embodiments of the present application will now be described with reference to FIGS. 20 to 28 .
  • FIG. 20 shows a playback interface 2000 of the original video.
  • the playback interface 2000 displays a video co-production option of “I want to co-shoot”.
  • the user interface shown in FIG. 21 will be displayed. 2100, two character options are displayed on the user interface 2100, namely character A and character B, and the user can select any one of the two characters for replacement shooting.
  • the terminal can pop up a window to prompt that there are two characters in the video that can participate in the shooting, and the user can select one of the characters to replace, that is, the user will perform the screen content of the selected character.
  • the character options of character A and the character options of character B may be presented with their corresponding character pictures respectively.
  • the terminal will display the preview screen of each of the four video clips including role A on the playback interface 2200 shown in FIG. 22 .
  • these 4 video clips are video clips including character A selected from the original video, and the user can watch these video clips at will.
  • the preview images of the four video clips can be presented on the playback interface in a tiled manner or in a list manner, and the preview images of the four video clips can be the first frame or key frame of each video clip or a randomly selected one.
  • a video frame which is not specifically limited in this embodiment of the present application.
  • the terminal can display the video clips that the user needs to imitate and perform in the upper left corner of the user interface, which not only achieves the purpose of prompting the user, but also does not perform too much on the user interface. occupied.
  • a video clip that requires the user to imitate performance may also be displayed in the upper right corner, lower left corner, or lower right corner of the user interface, etc., which is not specifically limited in this embodiment of the present application.
  • FIG. 23 to FIG. 26 also show that different types of prompt messages are displayed on the user interface.
  • the terminal knows that the user needs to face the right to shoot at this time by analyzing the screen content of the original video, then a corresponding prompt message will be displayed on the user interface 2300 to guide the user to shoot, so that the user can shoot
  • the video screen has a better match with the characters and screen logic in the original video.
  • the prompt message presented on the user interface 2300 at this time includes: a prompt icon with the face facing and the prompt text "facing right".
  • the prompt message may further include the camera shooting mode.
  • the terminal will display the prompt icon (arrow in FIG. 24 ) and prompt text (picture advance) on the user interface 2400 of the camera shooting mode. ) to inform the user how to manipulate the lens.
  • the terminal can also display the character dialogue that matches the current shooting progress, so as to inform the user of the text content that needs to be read out when shooting.
  • Fig. 24 shows that the user needs to read out the character dialogue of "Let's take a photo together?” while advancing the screen.
  • the prompt message may also include body movements.
  • the terminal will also display the body movements on the user interface 2500 synchronously, that is, the prompt icons of the body movements will be displayed on the user interface 2500. and prompt text.
  • the prompt icon may be "Little Man in Motion”
  • the prompt text may be "Raise the left arm”.
  • the character dialogue of "Really?” needs to be read out while the user performs this physical action.
  • the prompt message may also include facial expressions. That is, prompt icons and prompt texts of facial expressions may also be displayed on the user interface 2600 .
  • the terminal will also display the facial expressions on the user interface 2600 synchronously, that is, the prompt icons and prompt text of the facial expressions will be displayed on the user interface.
  • the prompt icon may be "smiley face", and the prompt text may be "smile to the left”.
  • the terminal will display the preview screen playback options, playback progress bar and video modification options of the co-shot video on the user interface 2700 .
  • the video modification options may include multiple ones, and only five video modification options are exemplarily shown in FIG. 11 , which are adjustment material, text, sticker, filter, and beauty. It can be understood that the number of video modification options may be more or less than the five shown in the figure, which is not specifically limited in this embodiment of the present application.
  • FIG. 28 shows several video pictures captured in the co-shot video, and these video pictures are sequentially ordered in chronological order from left to right.
  • these several video pictures are sorted 1 to 7 in the order from left to right, then video picture 1, video picture 3 and video picture 5 are from the original video, while video picture 2, video picture 4 and video picture 5 are from the original video.
  • the video picture 6 comes from the video shot by the user, and the video picture 7 is obtained by changing the face of the target character included in the corresponding video picture in the original video, that is, replacing the face image of the target character with the face image of the user.
  • FIG. 29 is a schematic structural diagram of a video generation apparatus provided by an embodiment of the present application. Referring to Figure 29, the device includes:
  • a first processing module 2901 configured to perform video shooting in response to a trigger operation for the video co-shooting option
  • the video acquisition module 2902 is used to acquire the second video obtained by the current shooting; the second video corresponds to the video clip including the target character in the first video;
  • the second processing module 2903 is configured to fuse the second video into the video content of the first video based on the identification of the target character and other characters in the first video to obtain a co-shot video.
  • the device further includes:
  • a message obtaining module configured to obtain a prompt message based on the identification of the screen content of the first video; the prompt message is used to guide the shooting of the second video;
  • the first display module is used for displaying the prompt message on the shooting interface during the video shooting process.
  • the second processing module is configured to replace the first video with the second video if the first video does not include the same frame picture of the target character and other characters includes a video clip of the target character.
  • the second processing module is configured to use the user's face image in the second video if the first video includes the same frame of the target character and other characters Replace the face image of the target character in the same frame picture.
  • the prompt message includes one or more of camera shooting methods, human body posture and character dialogue;
  • the first display module is configured to perform one or more of the following: At least one of the prompt icon and prompt text of the camera shooting mode is displayed on the shooting interface; at least one of the prompt icon and prompt text of the human body posture is displayed on the shooting interface; wherein, the human body
  • the gesture includes one or more of facial expressions, facial orientation and body movements; the character dialogue is displayed on the shooting interface.
  • the first video includes N characters, where N is a positive integer and N ⁇ 2, and the apparatus further includes:
  • a second display module configured to display N character options on the playback interface of the first video before shooting the video in response to a trigger operation for the video co-shooting option
  • the third processing module is configured to, in response to a triggering operation for the target character option in the N character options, filter out M video clips including the target character in the first video as target video clips; wherein M is a positive integer.
  • the second display module is further configured to display a preview image of each of the target video clips on the playback interface
  • the third processing module is further configured to play the specified target video segment in response to a trigger operation on the preview screen of the specified target video segment.
  • the apparatus further includes:
  • the third display module is configured to display a video window in suspension on the shooting interface; wherein, the video window is used to display the video clip corresponding to the prompt message in the first video.
  • the apparatus further includes:
  • the 4th display module is used to display the preview screen, playback option, playback progress bar and video modification option of the co-shot video after generating the co-shot video;
  • a fourth processing module configured to play the co-shot video in response to a trigger operation for the playback option
  • the fourth display module is further configured to display the playback progress of the co-shot video through the playback progress bar;
  • a fifth processing module configured to perform modification processing on the co-shot video in response to a triggering operation for the video modification option.
  • the third processing module is configured to determine a target time point at which the target character appears in the first video; perform key frame marking on the target time point to obtain video point information ; According to the video dotting information and the target time point, cut out the M target video segments in the first video.
  • the message acquisition module is specifically configured to perform screen content analysis on each of the target video clips to obtain a prompt message corresponding to each of the target video clips;
  • the first display module is specifically configured to display a prompt message corresponding to each target video clip on the shooting interface during the shooting process of each target video clip.
  • the message acquisition module is configured to, for each target video segment, determine the target in the target video segment through a human key point detection network and according to the target video segment
  • the human body key points of the character according to the relative positional relationship between different parts of the face, the face key points in the human body key points are connected to obtain a facial frame model; according to the face frame model, determine the target character in the facial expressions and facial orientations in the target video clip; according to the relative positional relationship between different parts of the limbs, the limb key points in the human body key points are connected to obtain a limb frame model; according to the limb structure model, determine the The target character physically moves in the target video clip.
  • the message acquisition module is specifically configured to acquire, for each target video clip, the movement direction change information and size change information of the target object in the target video clip;
  • the movement direction change information and size change information of the object in the target video clip are used to determine the camera shooting mode corresponding to the target video clip.
  • the message acquisition module is specifically configured to, for each target video clip, identify the voice data of the target character in the target video clip, and obtain the target character Character dialogue in the target video clip.
  • the first display module is further configured to display an error prompt message on the shooting interface if the video image currently shot by the camera does not match the prompt message currently displayed ; wherein, the error prompt message is used to guide the user to re-shoot the video.
  • the video generation apparatus when the video generation apparatus provided in the above-mentioned embodiments generates a video, only the division of the above-mentioned functional modules is used as an example for illustration. The internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the video generating apparatus and the video generating method embodiments provided by the above embodiments belong to the same concept, and the specific implementation process thereof is detailed in the method embodiments, which will not be repeated here.
  • FIG. 30 shows a structural block diagram of an electronic device 3000 provided by an exemplary embodiment of the present application.
  • the electronic device 3000 can be used to execute the video generation method in the above method embodiments.
  • the device 3000 may be a portable mobile terminal, such as: a smart phone, a tablet computer, an MP3 player (Moving Picture Experts Group Audio Layer III, a moving picture expert compression standard audio layer 3), MP4 (Moving Picture Experts Group Audio Layer IV, a dynamic picture expert Video Expert Compresses Standard Audio Layer 4) Player, Laptop or Desktop.
  • Device 3000 may also be referred to by other names such as user equipment, portable terminal, laptop terminal, desktop terminal, and the like.
  • the device 3000 includes: a processor 3001 and a memory 3002 .
  • the processor 3001 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 3001 can use at least one hardware form among DSP (Digital Signal Processing, digital signal processing), FPGA (Field-Programmable Gate Array, field programmable gate array), PLA (Programmable Logic Array, programmable logic array) accomplish.
  • the processor 3001 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the wake-up state, also called CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor for processing data in a standby state.
  • the processor 3001 may be integrated with a GPU (Graphics Processing Unit, image processor), and the GPU is used for rendering and drawing the content that needs to be displayed on the display screen.
  • the processor 3001 may further include an AI (Artificial Intelligence, artificial intelligence) processor, where the AI processor is used to process computing operations related to machine learning.
  • AI Artificial Intelligence, artificial intelligence
  • Memory 3002 may include one or more computer-readable storage media, which may be non-transitory. Memory 3002 may also include high-speed random access memory, as well as non-volatile memory, such as one or more disk storage devices, flash storage devices. In some embodiments, the non-transitory computer-readable storage medium in the memory 3002 is used to store at least one program code, and the at least one program code is used to be executed by the processor 3001 to implement the methods provided by the method embodiments in this application. Video generation method.
  • the device 3000 may also optionally include: a peripheral device interface 3003 and at least one peripheral device.
  • the processor 3001, the memory 3002 and the peripheral device interface 3003 may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 3003 through a bus, a signal line or a circuit board.
  • the peripheral device includes: at least one of a radio frequency circuit 3004 , a touch display screen 3005 , a camera 3006 , an audio circuit 3007 , a positioning component 3008 and a power supply 3009 .
  • the electronic device 3100 may vary greatly due to different configurations or performance, and may include one or more processors (Central Processing Units, CPU) 3101 and one or more memories 3102, wherein at least one piece of program code is stored in the memory 3102, and the at least one piece of program code is loaded and executed by the processor 3101 to realize the video generation method provided by the above method embodiments .
  • processors Central Processing Units, CPU
  • memories 3102 wherein at least one piece of program code is stored in the memory 3102, and the at least one piece of program code is loaded and executed by the processor 3101 to realize the video generation method provided by the above method embodiments .
  • the electronic device may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface for input and output, and the electronic device may also include other components for implementing device functions, which will not be repeated here.
  • a computer-readable storage medium such as a memory including program codes
  • the program codes can be executed by a processor in the terminal to complete the video generation method in the foregoing embodiment.
  • the computer-readable storage medium may be a read-only memory (Read-Only Memory, ROM), a random access memory (Random Access Memory, RAM), a compact disc read-only memory (Compact Disc Read-Only Memory, CD-ROM) ), magnetic tapes, floppy disks, and optical data storage devices, etc.
  • a computer program product or computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the processor of the electronic device from The computer-readable storage medium reads the computer program code, and the processor executes the computer program code, so that the electronic device executes the video generation method in the above embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Medical Informatics (AREA)
  • Software Systems (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • User Interface Of Digital Computer (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

提供一种视频生成方法、装置、存储介质及计算机设备,属于视频处理技术领域。所述方法包括:响应于针对视频合拍选项的触发操作,进行视频拍摄;获取当前拍摄得到的第二视频,该第二视频对应于第一视频中包括目标角色的视频片段;基于对第一视频中的目标角色及其他角色的识别,将第二视频融合至第一视频的视频内容中,获得合拍视频。该方法不但能够取得优质的拍摄效果,而且还能够降低拍摄成本。

Description

视频生成方法、装置、存储介质及计算机设备
本申请要求于2020年07月03日提交中国专利局、申请号为2020106368525、申请名称为“视频生成方法、装置、存储介质及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及视频处理技术领域,特别涉及视频生成技术。
背景技术
物质文明的快速发展,使得大众对精神文明的追求日益提高,随之市面上涌现出了众多的视频分享平台,比如短视频分享平台便是其中一种。目前,原创用户在拍摄或制作好视频后,可以将视频上传至视频分享平台。而其他用户除了可以通过视频分享平台观看该视频之外,还可以对该视频进行诸如点赞、评论或转发等操作。
另外,出于增添趣味性、提升视频产量等方面的考量,视频分享平台还支持用户对他人视频进行二度创作,比如用户在视频分享平台上浏览到喜爱的视频后,可以基于该视频制作合拍视频,即用户可以将自身拍摄的视频与他人视频融合起来得到合拍视频。而在拍摄合拍视频时,拍摄效果和拍摄成本一直是用户关心的问题。为此,如何在视频合拍场景中取得优质的拍摄效果,同时降低拍摄成本,便成为了时下本领域技术人员亟待解决的一个问题。
发明内容
本申请实施例提供了一种视频生成方法、装置、存储介质及计算机设备,不但能够取得优质的拍摄效果,而且还能够降低拍摄成本。所述技术方案如下:
一方面,提供了一种视频生成方法,由电子设备执行,所述方法包括:
响应于针对视频合拍选项的触发操作,进行视频拍摄;
获取当前拍摄得到的第二视频;所述第二视频对应于第一视频中包括目标角色的视频片段;
基于对所述第一视频中的目标角色及其他角色的识别,将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频。
另一方面,提供了一种视频生成装置,所述装置包括:
第一处理模块,用于响应于针对视频合拍选项的触发操作,进行视频拍摄;
视频获取模块,用于获取当前拍摄得到的第二视频;所述第二视频对应于第一视频中包括目标角色的视频片段;
第二处理模块,用于基于对所述第一视频中的目标角色及其他角色的识别,将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频。
另一方面,提供了一种电子设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行以实现上述的视频生成方法。
另一方面,提供了一种存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现上述的视频生成方法。
另一方面,提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机程序代码,该计算机程序代码存储在计算机可读存储介质中,计算机设备的处理器从计算机可读存储介质读取该计算机程序代码,处理器执行该计算机程序代码,使得该计算机设备执行上述的视频生成方法。
附图说明
图1是本申请实施例提供的一种视频生成方法涉及的实施环境的示意图;
图2是本申请实施例提供的一种视频生成方法的流程图;
图3是本申请实施例提供的一种用户界面的示意图;
图4是本申请实施例提供的一种视频生成方法的流程图;
图5是本申请实施例提供的另一种用户界面的示意图;
图6是本申请实施例提供的另一种用户界面的示意图;
图7是本申请实施例提供的另一种用户界面的示意图;
图8是本申请实施例提供的另一种用户界面的示意图;
图9是本申请实施例提供的另一种用户界面的示意图;
图10是本申请实施例提供的另一种用户界面的示意图;
图11是本申请实施例提供的另一种用户界面的示意图;
图12是本申请实施例提供的另一种用户界面的示意图;
图13是本申请实施例提供的一种视频生成方法的流程图;
图14是本申请实施例提供的一种视频生成方法的流程图;
图15是本申请实施例提供的一种视频生成方法的流程图;
图16是本申请实施例提供的一种人体关键点的示意图;
图17是本申请实施例提供的一种移动目标物的检测和跟踪的流程示意图;
图18是本申请实施例提供的一种Deepfake技术的架构图;
图19是本申请实施例提供的一种视频生成方法的整体执行流程的示意图;
图20是本申请实施例提供的另一种用户界面的示意图;
图21是本申请实施例提供的另一种用户界面的示意图;
图22是本申请实施例提供的另一种用户界面的示意图;
图23是本申请实施例提供的另一种用户界面的示意图;
图24是本申请实施例提供的另一种用户界面的示意图;
图25是本申请实施例提供的另一种用户界面的示意图;
图26是本申请实施例提供的另一种用户界面的示意图;
图27是本申请实施例提供的另一种用户界面的示意图;
图28是本申请实施例提供的另一种用户界面的示意图;
图29是本申请实施例提供的一种视频生成装置的结构示意图;
图30是本申请实施例提供的一种电子设备的结构示意图;
图31是本申请实施例提供的一种电子设备的结构示意图。
具体实施方式
首先对本申请实施例提供的视频生成方法涉及的实施环境进行介绍。
示例一,参见图1,该实施环境可以包括:终端101和服务器102。即,本申请实施例提供的视频生成方法由终端101和服务器102联合执行。
其中,终端101通常为移动式终端。在一种可能的实现方式中,终端101终端包括但不限于:智能手机、平板电脑、笔记本电脑、台式计算机等。
其中,服务器102可以是独立的物理服务器,也可以是多个物理服务器构成的服务器集群或者分布式系统,还可以是提供云服务、云数据库、云计算、云函数、云存储、网络服务、云通信、中间件服务、域名服务、安全服务、CDN、以及大数据和人工智能平台等基础云计算服务的云服务器。终端以及服务器可以通过有线或无线通信方式进行直接或间接地连接,本申请在此不做限制。
作为一个示例,终端101上通常安装有视频客户端,服务器102用于为该视频客户端提供后台服务,以支持用户通过该视频客户端浏览其他用户发布在视频分享平台上的视频。
示例二,该实施环境还可以仅包括终端101。即,本申请实施例提供的视频生成方法还可由终端101单独执行。针对该种情况,通常需要终端101具备强大的计算处理能力。
基于上述的实施环境,本申请实施例提供的视频生成方法可以应用在视频合拍场景下。
场景1、短视频的视频合拍场景
其中,短视频通常指代播放时长较短,比如小于某一时长阈值的视频。示例性地,该时长阈值可以为30秒或50秒或60秒等,本申请实施例对此不进行具体限定。
一方面,在视频合拍过程中,本申请实施例可以基于现有视频画面内容输出提示消息为用户提供拍摄引导,使得用户可以根据视频所记录的故事内容,自己低成本地拍摄出与原有视频画面融合度高、有趣味的视频。
即,本申请实施会基于对视频画面内容的理解与分析(比如分析摄像头拍摄方式、角色的人体姿态、识别角色对白等),为用户提供拍摄引导(比如角色的面部朝向、面部表情、肢体动作、摄像头拍摄方式、角色对白等)。这样当用户进行视频合拍时,可以根据系统输出的提示消息进行拍摄。也即,本申请实施例在视频合拍过程中,可以对诸如用户运动姿态、面部表情状态、摄像头拍摄方式等进行引导,从而更友好地帮助用户完成视频拍摄,降低用户进行视频合拍的拍摄成本,提升拍摄效率,同时提升拍摄效果,使得最终的合拍视频具有较好的内容还原度。
示例性地,摄像头拍摄方式包括但不限于:摄像头的取景方式、摄像头的运动方式。可选的,摄像头的取景方式包括但不限于水平取景或竖向取景等;可选的,摄像头的运动方式包括但不限于:推镜头、拉镜头、摇镜头、跟镜头、静止镜头、镜头左右上下移动等。
另一方面,本申请实施例提供的是一种基于场景融合的短视频合拍方式。示例性 地,场景融合的含义是,原始视频和用户拍摄的视频不但在内容上相互关联,而且最终得到的合拍视频是通过对原始视频和用户拍摄的视频进行内容上的融合得到的,即视频的合成处理是将用户拍摄的视频穿插到原始视频中,对原始视频中部分视频片段进行替换,最终得到的是一路视频,也即将原始视频和用户拍摄的视频合成处理为一路视频,得到合拍视频。其中,合拍视频的每帧视频图像中均包含一路视频画面。换言之,合拍视频在进行呈现时画面中仅包括一路视频,而非同一画面上包括两路视频,即该种基于场景融合的短视频合拍方式不是对两路视频进行生硬拼接,不是在同一画面上呈现诸如左右分屏、上下分屏或大小窗模式的两路视频。
场景2、其他视频的视频合拍场景
除了场景1中示出的短视频合拍场景之外,本申请实施例提供的视频生成方法还可以应用于其他视频的合拍场景下,比如电影片段或电视剧片段等,本申请实施例对此不进行具体限定。
图2是本申请实施例提供的一种视频生成方法的流程图,执行主体示例性的为图1中示出的终端101,应理解,在实际应用中,本申请实施例提供的视频生成方法还可以由其它具备视频处理能力的电子设备。参见图2,本申请实施例提供的方法包括:
201、终端响应于针对视频合拍选项的触发操作,进行视频拍摄。
可选的,终端可以在第一视频的播放界面上显示该视频合拍选项。其中,第一视频在本申请实施例中也被称之为原始视频。即本文将用户浏览并播放的视频称之为第一视频。示例性地,第一视频可以为视频分享平台的注册用户发布到视频分享平台的短视频,该短视频既可以为用户原创视频,也可以为用户模仿视频,还可以为用户在诸如电视剧、电影或任意类型的视频中截取的一小段视频,本申请实施例对此不进行具体限定。另外,第一视频除了短视频形式之外,还可以为时长大于短视频的其它形式视频,本申请实施例对此同样不进行具体限定。宽泛地来讲,任何形式的包含人物角色的视频均可应用于本方法。
如图3所示,在第一视频的播放过程中,可以在播放界面上显示一个视频合拍选项301。作为一个示例,为了避免该视频合拍选项301对呈现的视频画面过多遮挡,可以将该视频合拍选项301布局显示在播放界面的边缘位置处,比如播放界面的左边缘、右边缘、上边缘或下边缘等。在图3中,该视频合拍选项301显示在了播放界面的右边缘且靠下的位置。当然,在实际应用中,视频合拍选项301也可以显示其它位置,如播放界面中除边缘位置处外的其它位置,又如播放界面对应的视频操作选项显示栏中,本申请在此不对视频合拍选项301的显示位置做任何限定。
在终端播放第一视频过程中,如若播放界面上显示诸如“我要合拍”的视频合拍选项301,则代表用户可以与当前播放的第一视频进行视频合拍的互动。
示例性地,针对视频合拍选项的触发操作可以为用户对图3中示出的视频合拍选项301进行点击操作,本申请实施例对此不进行具体限定。
202、获取当前拍摄得到的第二视频;第二视频对应于第一视频中包括目标角色的视 频片段。
相对于原始视频,终端当前拍摄得到的第二视频在本文中也被称之为用户拍摄视频。通常情况下,用户拍摄的第二视频可以对应于第一视频中包括目标角色的视频片段,目标角色可以是用户在拍摄第二视频之前选择的、自己所要扮演的角色,目标角色可以是第一视频中存在的任意一个角色。
可选的,终端还可以基于对第一视频的画面内容的识别,获取提示消息;并在视频拍摄过程中,在拍摄界面上显示提示消息;其中,该提示消息用于指导拍摄第二视频,即为用户拍摄第二视频提供指导提示。
其中,提示消息是通过对第一视频进行画面内容分析得到。该分析步骤既可以由终端执行也可以由服务器执行。在一种可能的实现方式中,该提示消息包括:摄像头拍摄方式、人体姿态和角色对白中的一种或多种;可选的,通过显示摄像头拍摄方式,可以告知用户该如何真实地还原第一视频的拍摄过程,以保证所拍摄的第二视频与原始的第一视频具有较高的一致性;人体姿态可以包括面部表情、面部朝向和肢体动作中的一种或多种。而角色对白通俗来讲是指角色的台词。
作为一个示例,为了对用户进行更好地拍摄引导,终端在显示提示消息时,可以选择图标和文字结合的引导方式。即,终端在拍摄界面上显示提示消息,可以包括以下一项或多项:
终端在拍摄界面上显示摄像头拍摄方式的提示图标和提示文字。
终端在拍摄界面上显示人体姿态的提示图标和提示文字。
终端在拍摄界面上显示角色对白。
当然,在实际应用中,终端也可以仅显示提示图标和提示文字中的任意一种,即终端可以在拍摄界面上显示摄像头拍摄方式的提示图标或提示文字,终端也可以在拍摄界面上显示人体姿态的提示图标或提示文字,本申请在此不对终端显示的提示消息的内容做任何限定。
203、基于对第一视频中的目标角色及其他角色的识别,将第二视频融合至第一视频的视频内容中,获得合拍视频。
通过对原始视频中目标角色和其他角色的识别,对原始视频和用户拍摄视频进行合成处理,即可得到合拍视频,而合拍视频最终可为用户呈现一种视频合拍效果。其中,除了终端执行合成处理之外,也可以由服务器执行合成处理,本申请实施例对此不进行具体限定。
作为一个示例,将第二视频融合至第一视频的视频内容中获得合拍视频,包括但不限于:若第一视频中不包括被选中的目标角色和其他角色的同框画面,则利用第二视频替换第一视频中包括目标角色的视频片段;即,该种方式利用第二视频包括的视频帧来替换第一视频中包括目标角色的视频帧;若第一视频中包括目标角色和其他角色的同框画面,则利用第二视频中的用户面部图像替换同框画面中的目标角色的面部图像。即,该种方式对上述同框画面中的目标角色进行换脸,将上述同框画面中的目标角色面部头像替换成第二视频中的用户面部图像。可选的,合拍视频在播放时能够呈现如下效果: 第一视频的视频画面和第二视频的视频画面呈线性穿插播放。
需要说明的是,同框画面是指在同时包括目标角色和其它角色的视频画面,例如,假设第一视频中包括角色A、角色B和角色C,用户拍摄第二视频之前选择了角色A作为目标角色,那么第一视频中同时包括角色A和角色B的画面、同时包括角色A和角色C的画面、以及同时包括角色A、角色B和角色C的画面,均属于目标角色和其他角色的同框画面。
在本申请实施例提供的方法中,终端会显示一个视频合拍选项;终端可以响应于用户对该视频合拍选项的触发操作,进行视频拍摄,获得当前拍摄的第二视频,该第二视频对应于第一视频中包括目标角色的视频片段;进而,基于对第一视频中的目标角色及其他角色的识别,将第二视频融合至第一视频的视频内容中,获得合拍视频。即合拍视频是通过对第一视频和第二视频进行内容上的融合得到的,这使得合拍视频具有良好的内容契合度,用户能够深度融入到视频制作中,提高了视频个性化程度。该种视频生成方法不但能够取得优质视频制作效果,而且显著地降低了拍摄成本。
图4是本申请实施例提供的一种视频生成方法的流程图,执行主体示例性的可以为图1中示出的终端101。其中,第一视频中包括N个角色,N为正整数且N≥2。即,本申请实施例提供的视频合拍方案的实施前提条件是原始视频中包括至少两个角色。参见图4,本申请实施例提供的方法流程包括:
401、终端在第一视频的播放界面上显示视频合拍选项。
本步骤同上述步骤201类似,此处不再赘述。
402、终端响应于用户针对该视频合拍选项的触发操作,在播放界面上显示N个角色选项。
在本申请实施例中,在用户对该视频合拍选项执行了触发操作后,终端便确认用户启动使用视频合拍功能,而该触发操作还会激活终端执行在第一视频中进行人脸识别的步骤。示例性地,可以采用基于卷积神经网络的人脸识别算法进行人脸识别。终端通过在第一视频中进行人脸识别,得出第一视频中包括的角色数量以及角色ID。其中,角色数量与角色选项的个数一致。
图5中示出了N个角色选项501。由图5中示出的N个角色选项501可知,用户选中合拍的第一视频中包括2个角色,分别为角色1和角色2。用户可以选择这两个角色中的任意一个角色进行替换拍摄。比如,在用户点击视频合拍选项后,终端可以弹窗提示该视频中可参与拍摄的角色有两个,用户可以选择其中一个角色进行替换,即由用户来表演被选中角色的画面内容。
作为一个示例,图5中角色1的角色选项和角色2的角色选项,可分别用各自对应的角色图片来呈现。而该角色图片可以是角色1在第一视频中的一帧视频画面,以及角色2在第一视频中的一帧视频图像,本申请实施例对此不进行具体限定。
403、终端响应于用户针对N个角色选项中的目标角色选项的触发操作,从第一视频中筛选出包括目标角色的M个目标视频片段,并在播放界面上显示每个目标视频片段的 预览画面。
其中,M为正整数且M≥1。而针对目标角色选项的触发操作可以为用户对这N个角色选项中任意一个角色选项的点击操作,该被用户选中的角色选项对应的角色在本文中称之为目标角色。在本申请实施例中,如果用户选择图5示出的其中一个角色(比如选择角色1),则终端或服务器可以从第一视频中筛选出包括角色1的M个视频片段作为目标视频片段,进而终端会在播放界面上显示M个目标视频片段中每个目标视频片段的预览画面,用户可以随意观看这些目标视频片段。
图6中示出了与角色1相关的4个目标视频片段的预览画面601。示例性地,这4个目标视频片段的预览画面601可以通过平铺方式或列表方式呈现在播放界面上,而4个目标视频片段的预览画面601可以为每个目标视频片段的首帧或关键帧或随机选取的一个视频帧,本申请实施例对此不进行具体限定。
404、终端响应于用户针对M个目标视频片段中指定目标视频片段的预览画面的触发操作,播放指定目标视频片段。
需要说明的是,如果第一视频中与目标角色相关的目标视频片段个数较多,则本申请实施例还支持响应于用户针对各目标视频片段的预览画面的滑动操作,滑动展示每个目标视频片段的预览画面。作为一个示例,针对指定目标视频片段的预览画面的触发操作可以为用户对指定目标视频片段的预览画面的点击操作。
405、终端启动摄像头进行视频拍摄;并基于对第一视频的画面内容的识别,获取提示消息;在视频拍摄过程中,终端在拍摄界面上显示提示消息。
其中,该提示消息用于引导用户进行第二视频的拍摄。
在本申请实施例中,终端在启动摄像头进行拍摄之后,终端根据M个目标视频片段的先后顺序,将需要用户模仿表演的目标视频片段逐一在拍摄界面上呈现,并且会分析得出视频画面中的核心信息,以此得到与当前拍摄进度相适配的提示消息。即,在视频拍摄过程中,在拍摄界面上显示提示消息,包括但不限于:对与目标角色相关的每个目标视频片段进行画面内容分析,得到每个目标视频片段对应的提示消息;在每个目标视频片段的拍摄过程中,在拍摄界面上显示与每个目标视频片段对应的提示消息。
在一种可能的实现方式中,在拍摄界面上显示与每个目标视频片段对应的提示消息,包括但不限于采取以下方式:以置于顶层的显示方式,在拍摄界面上悬浮显示视频窗口;其中,该视频窗口用于显示与当前拍摄进度匹配的目标视频片段,即与当前显示的提示消息对应的目标视频片段。其中,置于顶层的显示方式的含义是,显示在页面最顶端,不被任何其他图层所遮挡。
如图7至图10所示,在视频拍摄过程中,终端可以选择将需要用户模仿表演的目标视频片段显示在拍摄界面的左上角,既达到对用户进行提示的目的,同时还不对拍摄界面进行过多占用。另外,除了左上角之外,还可以选择将需要用户模仿表演的视频片段显示在拍摄界面的右上角、左下角或右下角等位置,或者,终端也可以响应于用户对该视频窗口的拖拽操作,在拍摄界面中相应的位置处显示该视频窗口,本申请实施例对此不进行具体限定。
在本申请实施例中,图7至图10还示出了在拍摄界面上显示的不同类型的提示消息701。
针对图7,终端通过对第一视频进行画面内容分析得知用户此时需要面朝正前方进行拍摄,那么在拍摄界面上便会进行相应的提示,以引导用户拍摄,从而使得用户拍摄的视频画面与原始视频中的角色及画面逻辑有更好的匹配度。如图7所示,此时拍摄界面上呈现的提示消息701包括:面部朝向的提示图标和提示文字“面朝正前方”。
针对图8,为了确保用户可以真实地还原拍摄过程,以保证所拍摄的第二视频与原始的第一视频具有较高一致性,提示消息701还可以包括摄像头拍摄方式。如图8所示,通过对原始视频进行画面内容分析可知,当前为推镜头画面,那么终端在UI(User Interface,用户界面)展示上将会呈现图8中所示的摄像头拍摄方式的提示图标(图8中箭头)和提示文字(画面推进),从而告知用户该如何进行镜头的操控。另外,与此同时,终端还可以对与当前拍摄进度相匹配的角色对白进行UI展示,以告知用户在拍摄时需要读出的文字内容。其中,图8中示出了用户在将画面进行推进的同时,还需要读出“我们真的可以牵手”的角色对白。
针对图9,提示消息701还可以包括肢体动作。比如,通过对原始视频进行画面内容分析可知,当前角色的右侧胳膊抬起来了,那么终端也会同步在拍摄界面上进行肢体动作的UI展示,即在拍摄界面上展示肢体动作的提示图标和提示文字。如图9所示,该提示图标可以为“运动中的小人”,该提示文字可以为“抬起右侧胳膊”。另外,在用户执行这个肢体动作的同时还需要读出“真的吗”的角色对白。
针对图10,提示消息701还可以包括面部表情。即,拍摄界面上还可以展示面部表情的提示图标和提示文字。比如,通过对原始视频进行画面内容分析可知,当前角色面朝右侧微笑,那么终端也会同步在拍摄界面上进行面部表情的UI展示,即在拍摄界面上展示面部表情的提示图标和提示文字。如图10所示,该提示图标可以为“笑脸”,该提示文字可以为“面朝右侧微笑”。
在另一种可能的实现方式中,在视频拍摄过程中,为了方便用户熟悉角色对白和接下来要做的动作,以及避免用户错过每个目标视频片段(需要模仿的M个目标视频片段)刚开始的一两秒,在每个目标视频片段开始拍摄之前还可以先对用户进行倒计时提示。比如,可以在启动拍摄之前进行10秒或5秒或3秒的倒计时。可选的,倒计时的提示形式既可以是语音形式也可以是图文形式,本申请实施例对此不进行具体限定。可选的,在拍摄每个目标视频片段过程中,除了上述倒计时的提示方式之外,可以在界面上显示一个触发控件,检测到用户主动触发该控件后,再启动当前视频片段的拍摄。可选的,还可以由用户通过语音来触发当前拍摄。即,终端具有语音识别功能,在识别到用户发出的语音为启动拍摄指令后自动启动当前视频片段的拍摄。
406、终端基于对第一视频中的目标角色及其他角色的识别,将第二视频融合至第一视频的视频内容中,获得合拍视频。
本申请实施例提供的是一种基于场景融合的短视频合拍方式,在基于终端显示的提示消息的同时,终端会采集得到第二视频,而终端在对第一视频与当前拍摄得到的第二视 频进行合成处理时,通常采取以下处理方式:将第一视频与第二视频合成处理为一路视频,得到合拍视频;其中,合拍视频的每帧视频图像中均仅包含一路视频画面。
其中,场景融合的含义是,原始的第一视频和用户拍摄的第二视频不但在内容上相互关联,而且最终得到的合拍视频是通过对第一视频和第二视频进行内容上的融合得到的,即视频的合成处理是将用户拍摄的第二视频穿插到原始的第一视频中,实现的是对第一视频中部分视频片段的替换,最终得到的是一路视频,也即将原始的第一视频和用户拍摄的第二视频合成处理为一路视频,得到合拍视频。其中,合拍视频的每帧视频图像中均包含一路视频画面。
在一种可能的实现方式中,若与用户选中的目标角色关联的M个目标视频片段中不包括目标角色和其他角色的同框画面,那么终端可以直接利用第二视频替换M个目标视频片段;若用户选中的目标角色关联的M个目标视频片段中包括目标角色和其他角色的同框画面,则第二终端的处理方式为利用第二视频中的第一面部图像替换同框画面中目标角色的第二面部图像;其中,第一面部图像为用户模仿同框画面中的目标角色时,摄像头拍摄到的用户面部图像。
简言之,若用户所表演的角色需要与其他角色同框出现,那么此时终端的处理方式是:将原有视频中的人物面部图像替换为用户的面部图像,即换脸,以达到剧情和画面逻辑的一致性。
综上所述,合拍视频在呈现时画面中仅包括一路视频,而非同一画面上包括两路视频,即本申请中基于场景融合的视频合拍方式不是对两路视频进行生硬拼接,即不是在同一画面上呈现诸如左右分屏、上下分屏或大小窗模式的两路视频。
407、终端在生成合拍视频后,显示合拍视频的预览画面、播放选项、播放进度条以及视频修改选项;响应于用户针对播放选项的触发操作,播放合拍视频,并通过播放进度条动画显示合拍视频的播放进度。
其中,终端设备合成合拍视频完成后,用户可以选择观看最终的合拍视频,并选择是否进行发布或是修改视频。
如图11所示,在生成合拍视频后,终端会显示合拍视频的预览画面1101、播放选项1102、播放进度条1103以及视频修改选项1104。其中,视频修改选项1104可以包括多个,图11中仅示例性地示出了4个视频修改选项,分别为修改选项1、修改选项2、修改选项3和修改选项4。可以理解的是,视频修改选项的个数可以多于或者少于图示的4个,本申请实施例对此不进行具体限定。
在一种可能的实现方式中,合拍视频的预览画面1101可以为该合拍视频中的首帧、关键帧或随机选取的一个视频帧,本申请实施例对此不进行具体限定。
示例性地,针对播放选项的触发操作可以为用户对播放选项1102的点击操作。
408、终端响应于用户针对视频修改选项的触发操作,对合拍视频执行修改处理。
示例性地,针对视频修改选项的触发操作可以为用户对视频修改选项1104的点击操作。在一种可能的实现方式中,视频修改选项1104可以包括但不限于:调整素材、添加文字、添加贴纸、添加滤镜、进行美颜等,本申请实施例对此不进行具体限定。
另外,终端除了显示合拍视频的预览画面1101、播放选项1102、播放进度条1103以及视频修改选项1104之外,还可以显示发布选项1105,用户通过触发该发布选项1105,可以将制作好的合拍视频发布到视频分享平台或个人主页,以供其他用户浏览或观看。
另外,若摄像头当前拍摄到的视频画面与当前显示的提示消息不匹配,即若用户执行的相关操作或动作与当前显示的提示消息不符,则终端可以在拍摄界面上显示错误提示消息;其中,该错误提示消息用于引导用户重新进行视频拍摄。另外,除了显示文字或图标形式的提示消息之外,还可以播放语音形式的提示消息,本申请实施例对此不进行具体限定。
本申请实施例提供的方法至少具有以下有益效果:
终端可以在用户选中播放的视频的播放界面上显示视频合拍选项;之后,终端可以响应于用户对该视频合拍选项的触发操作,进行视频拍摄;而在视频拍摄过程中,终端会自动在拍摄界面上显示提示消息;即,该提示消息会呈现在用户的拍摄界面中,以此来引导用户快速且保质地完成视频拍摄。最终,基于对原始视频中的目标角色及其他角色的识别,将当前拍摄得到的视频融合至原始视频的内容中生成合拍视频,实现视频合拍,该种视频生成方法不但能够取得优质拍摄效果,还能显著降低拍摄成本。比如在镜头呈现上和人物表演上能够达到较高的水平,同时还加快了视频拍摄的完成速度,节约了时间成本和人力成本。
即,在视频合拍场景下,本申请实施例通过对视频画面内容进行分析,能够对外输出有利于用户拍摄的提示消息,进而帮助用户快速地投入到视频的创作过程。也即,本申请实施例以分析视频画面内容为前提,通过向用户展示提示消息来引导用户拍摄,其中,该提示消息包含的内容丰富,比如包含摄像头拍摄方式、人体姿态和角色对白中的一种或多种。
另外,该种基于场景融合的视频合拍方案,由于将原始的第一视频和用户拍摄的第二视频合成处理为一路视频,即在画面呈现上合拍视频仅包括一路画面,实现的是在时间顺序上将原始的第一视频和用户拍摄的第二视频的线性穿插播放,确保了视频的无缝衔接创作效果,该种视频创作过程更加友好。换言之,通过该种视频合拍方案,实现了在围绕现有视频画面内容的基础上,使得用户能够以更加自然、更加沉浸的方式投入到视频创作过程,使得最终的合拍视频从用户角度来看具有更好的融合性,也即合拍视频在内容呈现上和人物表演上与原始视频更为契合,避免了两路视频之间的生硬拼接。
示例性地,下面通过图12对“在时间顺序上原始的第一视频和用户拍摄的第二视频呈线性穿插播放”进行说明。其中,图12中示出了在合拍视频中截取到的几个视频画面,这几个视频画面从左到右是按照时间顺序依次排序的。在图12中,视频画面1201和视频画面1203来自于原始的第一视频,而视频画面1202、视频画面1204和视频画面1205来自于用户拍摄的第二视频,而视频画面1206是通过对第一视频中相应视频画面包含的目标角色进行换脸得到的,即将目标角色的面部图像替换为用户的面部图像。由于在合拍视频的播放过程中,图12中的几个视频画面以在时间顺序上由左到右顺次呈现的,由于原始视频画面和用户拍摄视频穿插播放,因此该种视频合拍方案实现了原始的第一视频和 用户拍摄的第二视频的基于场景融合。
在另一个实施例中,上述步骤403中“从第一视频中筛选出包括用户选中的目标角色的M个目标视频片段”,在一种可能的实现方式中,从第一视频中筛选包括目标角色的目标视频片段的步骤,既可以由服务器执行,也可以由终端自己执行,本申请实施例对此不进行具体限定。针对服务器执行视频片段筛选的方式,参见图13,包括如下步骤:
1301、终端将用户选中的目标角色的角色ID上传至服务器。
其中,角色ID可以为角色的姓名、角色的头像、终端和服务器协商一致的角色代号(比如字符)等,本申请实施例对此不进行具体限定。
1302、服务器在接收到目标角色的角色ID后,确定目标角色在第一视频中出现的目标时间点;对目标时间点进行关键帧标记得到视频打点信息;将该视频打点信息和目标时间点返回给终端。
示例性地,在确定目标角色在第一视频中出现的目标时间点时,可以采取下述方式实现:首先在第一视频中确定包括目标角色人脸的视频帧,之后并获取上述视频帧对应的时间点,即可得到目标角色在第一视频中出现的目标时间点。
其中,在第一视频中检测目标角色出现的目标时间点时,可以对第一视频中包括的每个视频帧分别进行目标角色人脸识别,进而得到上述包括目标角色人脸的视频帧。另外,为了提高效率,还可以间隔较短的一段时间进行一次目标角色人脸识别,即在多个较密集的指定时间点使用人脸识别算法,确定指定时间点是否存在目标角色人脸,并输出存在目标角色人脸的一系列时间点,即一组时间点列,即代表在第一视频的上述时间点出现了目标角色人脸。其中,确定出来的时间点可以按照先后顺序依次排序,本申请实施例对此不进行具体限定。
另外,在第一视频中确定出目标角色出现的目标时间点后,服务器还可以根据目标时间点对第一视频进行打点,进而得到视频打点信息。
简言之,视频打点即关键帧标记,是视频在播放时将光标放在播放进度条上会显现视频接下来的内容。即,当控制光标移动到播放进度条上的某个点上时,会自动显示出在该点上所播放的视频内容。视频打点通过将视频中的关键内容点标记出来,以方便用户快速浏览到其想看的内容。
基于以上描述可知,进行视频打点可以是对确定出来的目标时间点进行关键帧标记,即在这些确定出来的目标时间点中再进一步地确定关键帧所对应的目标时间点。其中,关键帧通常指代角色运动或姿态变化中关键动作或姿态所处的那一帧。示例性地,在识别关键帧时可以通过相邻帧之间的变化程度来确定,本申请实施例对此不进行具体限定。
1303、终端根据视频打点信息和目标时间点在第一视频中切分出M个目标视频片段。
在一种可能的实现方式中,终端在第一视频中进行与目标角色关联的目标视频片段的切分时,包括但不限于如下方式:比如,在切分目标视频片段时,可以将切分出来的每个目标视频片段中至少包括一个视频打点(一个关键帧)作为前提。又比如,还可以选 择将出现在两个视频打点之间的目标时间点划分在同一个目标视频片段内,即终端可以将关键帧对应的目标时间点作为视频片段的划分依据,也即出现在两个关键帧对应的目标时间点之间的那些目标时间点属于同一个目标视频片段,本申请实施例对此不进行具体限定。
另外,参见图14,针对终端执行目标视频片段筛选的方式,包括如下步骤:
1401、终端确定目标角色在第一视频中出现的目标时间点。
1402、终端对目标时间点进行关键帧标记,得到视频打点信息。
1403、终端根据得到的视频打点信息和目标时间点,在第一视频中切分出M个目标视频片段。
关于步骤1401至步骤1403的实施可以参考上述步骤1301至步骤1303。
在另一个实施例中,针对上述步骤405中的“对与目标角色相关的每个目标视频片段进行画面内容分析”,该步骤既可以由服务器执行,也可以由终端自己执行,本申请实施例对此不进行具体限定。在一种可能的实现方式中,参见图15,对与目标角色相关的每个目标视频片段进行画面内容分析,包括但不限于如下步骤:
1501、针对每个目标视频片段,利用人体姿态检测技术分析该目标视频片段中用户选中的目标角色的人体姿态。
如前文所述,人体姿态可以包括面部表情、面部朝向和肢体动作中的一种或多种。在一种可能的实现方式中,上述步骤1501可以进一步地包括:
1501-1、针对每个目标视频片段,通过人体关键点检测网络,根据该目标视频片段,确定该目标视频片段中目标角色的人体关键点。
示例性地,该人体关键点检测网络可以基于OpenPose算法,OpenPose算法是一种基于双分支多级CNN(Convolutional Neural Networks,卷积神经网络)的体系结构的深度学习算法,主要是通过图像识别的方法来检测人体关键点。换言之,OpenPose算法是一个人体关键点检测框架,它能够在图片中检测躯体、手指、面部总共多达135个关键点。并且检测速度很快,能够达到实时检测效果。
以OpenPose算法为例,可以将每个目标视频片段包括的视频帧输入人体关键点检测网络,而该人体关键点检测网络可以首先通过VGG-19的骨干网络得到特征信息,而后通过6个阶段不断优化,每个阶段有2个分支,其中一个分支用来得到人体关键点坐标的热图(heatmaps),另一个分支用来得到人体关键点之间肢体意义的起点指向终点的方向向量PAFs。之后将PAFs转化成二分图,并采用诸如匈牙利算法求解二分图匹配问题,从而得到图片中人物的人体关键点。
其中,利用该算法检测到的人体关键点可以实现分析人物的面部表情、面部朝向、肢体动作,甚至还可以跟踪人物手指的运动。示例性地,在进行人体姿态估计时,可以如图16所示,通过将检测到的人体关键点按照一定规则连接起来,实现估计人体姿态。其中,图16示出了三种不同的人体姿态,分别为双手叉腰的站立姿态1601、奔跑姿态1602和双手抱在胸前的站立姿态1603。
1501-2、按照面部不同部位之间的相对位置关系,将人体关键点中的面部关键点进 行连接,得到面部构架模型;根据面部架构模型,确定目标角色在目标视频片段中面部表情和面部朝向。
示例性地,该面部不同部位之间的相对位置关系,即是按照人脸的基本结构,比如下巴、嘴巴、鼻子、眼睛以及眉毛的基本位置规则,将面部特征点依次进行连接,生成面部构架模型,而该面部构架模型便能够反映出用户的面部表情和面部朝向。
1501-3、按照肢体不同部位之间的相对位置关系,将人体关键点中的肢体关键点进行连接,得到肢体构架模型;根据肢体架构模型,确定目标角色在目标视频片段中肢体动作。
示例性地,该肢体不同部位之间的相对位置关系,即按照人体肢体的基本结构,比如颈部、肩部、肘部、腕部、手指、腰部、膝部以及脚踝的基本位置规则,将肢体关键点依次进行连接,生成肢体构架模型,而该肢体构架模型可以反映出用户的肢体动作,尤其是用户手指的精确动作。
本申请实施例通过分析第一视频中用户选中的目标角色的面部表情(比如喜、怒、哀、乐等)、面部朝向(比如面朝正前方或右侧等)、肢体动作(比如抬胳膊、踢腿、等)等信息,作为对视频画面内容的解读,并将这些信息以提示消息的方式通过UI展示给用户,实现了直观且清晰地引导用户完成拍摄。
1502、针对每个目标视频片段,获取该目标视频片段中目标物的运动方向变化信息和大小变化信息;根据目标物在该目标视频片段中的运动方向变化信息和大小变化信息,确定该目标视频片段对应的摄像头拍摄方式。
本步骤通过基于灰度图像的检测和跟踪算法,来对视频画面中出现的移动目标物(比如视频画面中出现的人物)进行检测和跟踪,从而分析判断出移动目标物在视频画面中的运动方向趋势和大小变化趋势,并据此反推出该视频画面的摄像头拍摄方式。换言之,通过分析移动目标物在视频画面中的运动方向趋势和大小变化趋势,从而辅助判定出相应的视频画面中镜头是如何运动的。而通过此种方式辅助判定摄像头拍摄方式,并通过UI展示在用户的拍摄界面中,实现了对用户进行有效的拍摄引导。
简单来说,基于灰度图像的检测和跟踪算法,即是:首先识别视频画面中的目标物轮廓;之后,将多帧视频画面图像转换为灰色图像,并通过对相邻帧的灰色图像进行分析计算,来完成目标物的检测与跟踪。示例性地,参见图17,该检测和跟踪算法的大体流程包括但不限于:
首先,定义MainWin类1701、Process类1702、Tracker类1703。其中,MainWin类1701用于执行摄像头初始化,绘制图形界面,从摄像头中读取下一帧彩色图像交给Process类1702进行处理。其中,Process类1702用于执行将下一帧彩色图像图转换成灰度图像,并将当前转换后的灰色图像与上一帧灰度图像相差;其中,由于简单的帧差法往往难以以达到检测精度,因此可以选择采用相差后图像的水平和垂直投影完成检测。即,对相差后图像分别进行水平和垂直投影,并据此计算出一个水平方向阈值和一个垂直方向阈值,该水平方向阈值和该垂直方向阈值用于分割目标物;并根据该水平方向阈值和该垂直方向阈值确定目标物的水平坐标和垂直坐标,并根据该水平坐标和该垂直坐 标绘制出目标物的矩形跟踪框。而Tracker类1703用于执行对目标物的跟踪,首先分析目标物是否为新出现的目标,或者,是在之前的图像帧中已经存在并且在当前的图像帧中继续移动的目标物(Target),然后分别对不同的分析结果执行相应的操作。比如,如果该目标物为之前检测到的目标物,则将该目标物标志为已匹配并加入到链尾,如果该目标物之前未检测到,则为新出现的该目标物创建一个空链。其中,为了后续过程的跟踪,通常会为每个新出现的目标物均创建一条空链。
另外,举例来说,根据目标物在每个目标视频片段中的运动方向变化信息和大小变化信息,确定目标视频片段对应的摄像头拍摄方式,可以为:比如,相邻的两帧之间目标物的灰度图像在逐渐变大,则说明此时是推镜头运动;又比如,若当前目标物的灰度图像逐渐向画面左侧移动,则说明此时对应的镜头运动为向右摇镜头。另外,此处的目标物可以是用户所选中的目标角色,本申请实施例对此不进行具体限定。
1503、针对每个目标视频片段,对目标角色在该目标视频片段中的语音数据进行识别,得到目标角色在该目标视频片段中的角色对白。
在本申请实施例中,还可以通过语音识别技术,针对每个目标角色出现的目标视频片段,识别其中是否包括与目标角色相关的角色对白,如果存在与目标角色相关的角色对白,则会在拍摄界面上进行UI展示,以告知用户在拍摄时所需要读出的文字内容。
另外,在执行视频合成处理时,如果用户选中的目标角色与其他角色同框了,则还包括一个换脸的操作。在一种可能的实现方式中,执行换脸操作可以采用Deepfake技术。
其中,Deepfake技术由“deep machine learning”(深度机器学习)和“fake photo”(假照片)组合而成,本质是一种深度学习模型在图像合成、替换领域的技术框架,属于深度图像生成模型的一次成功应用。在构建模型时使用了Encoder-Decoder自编解码架构,在测试阶段通过将任意扭曲的人脸进行还原,整个过程包含了:获取正常人脸照片→扭曲变换人脸照片→Encode编码向量→Decoder解码向量→还原正常人脸照片五个步骤。
总体上,Deepfake技术的换脸过程主要分为:人脸定位、人脸转换和图像拼接。其中,人脸定位即是抽取原人脸的特征点,例如左右眉毛、鼻子、嘴和下巴等。这些特征点大致描述了人脸的器官分布。示例性地,可以直接通过dlib和OpenCV等主流工具包直接进行抽取,这些工作包一般采用了经典的HOG(Histogram of Oriented Gradient,方向梯度直方图)的脸部标记算法。针对人脸转换,即是采用GAN或VAE等生成模型,它的目标是生成拥有A表情的B脸。最后的图像拼接则是将人脸融合到原图的背景,从而达到只改变人脸的效果。另外,如果处理的对象是视频,那么还需要一帧帧地处理图像,然后再将处理后的结果重新拼接成视频。
其中,图18示出了Deepfake技术涉及的主要架构,如图18所示,该架构主要包括三部分,分别为编码器1801、生成器1802和判别器1803。针对编码器1801,输入视频和该视频的landmarks(对人脸关键点连接成线得到),输出一个N维向量。编码器1801的作用是学习到一个视频的特有信息(比如这个人的身份不变性),同时希望具有姿态的不变性。可以认为和人脸识别网络一样,一个视频对应一个特征,视频中的人脸图像的 特征应该和整个视频的特征距离不大;而不同视频间的特征距离差很大。生成器1802用于基于landmarks生成假图像。值得关注的是,生成器1802的一部分输入来自于编码器1801。比如,生成器1802根据landmarks给出的脸型,利用编码器1801学习到的特定的人脸信息按照给定的脸型补全,从而实现换脸的效果。针对判别器1803,包括两个部分,其中一部分是编码器网络,将图像编码为向量;另外还包括一个将参数W和向量相乘的操作。
本申请实施例通过上述几种技术,可以实现对第一视频中用户所选中的目标角色的人体姿态、角色对白和摄像头的摄像头拍摄方式进行分析判定,从而通过对提示消息进行UI展示,实现更友好地帮助用户完成视频拍摄,可以显著增强用户拍摄视频对原始视频的还原度,从而提升内容合成的真实感。
下面对本申请实施例提供的视频生成方法的整体执行流程进行描述。
以服务器执行视频片段筛选、对原始视频进行画面内容分析为例,则整体执行流程可以依托三个部分来实现,即:用户侧、终端侧和服务器侧。其中,围绕用户操作流程,会在终端侧与服务器侧之间产生相应的技术能力匹配。针对终端侧,可以包括如下处理:面部识别、视频片段生成预览、UI元素下发、摄像头调用、视频合成等;针对服务器侧可以包括如下处理:对视频时间进行打点、分析视频内容(如:面部朝向、面部表情、镜头运动和肢体动作等)等。
参见图19,本申请实施例提供的方法流程包括:
1901、原始视频的播放过程中,用户通过在终端上执行针对拍摄界面上显示的视频合拍选项的触发操作,启动视频合拍功能并激活终端执行人脸识别。相应地,终端在原始视频中进行人脸识别并将识别到的人脸按照角色ID进行分类,以及,将角色ID呈现在拍摄界面上,以供用户进行角色选择。
1902、用户进行角色选择,相应地,终端将用户选中的目标角色的角色ID上传到服务器。
1903、服务器根据终端上传的角色ID,分析运算出原始视频中目标角色出现的目标时间点;以及,根据该角色ID出现的目标时间点执行视频打点处理,并将该角色ID出现的目标时间点和视频打点信息返回给终端,以供终端生成与目标角色关联的至少一个目标视频片段,并将这些目标视频片段的预览画面呈现给用户,以供用户预览其选中的目标角色出现的目标视频片段。
1904、服务器对与目标角色关联的目标视频片段进行画面内容分析,得到目标角色在视频片段中的人体姿态、肢体动作和摄像头的摄像头拍摄方式,并将这些信息下发给终端;终端开启摄像头并将这些信息以UI元素的形式呈现给用户,以引导用户拍摄。
1905、终端基于用户拍摄视频对原始视频进行内容更新处理,得到合拍视频,并生成合拍视频的预览画面,以供用户预览合拍视频。
1906、用户在预览完成后,可以进行诸如视频发布等操作。
本申请实施例提供的方法,终端可以在用户观看的视频的播放界面上显示视频合拍选项;之后,终端可以响应于用户对该视频合拍选项的触发操作,进行视频拍摄;在视 频拍摄过程中,终端可以自动在拍摄界面上显示提示消息,其中,该提示消息用于引导用户进行视频拍摄;即,提示消息会呈现在用户的拍摄界面中,以此来引导用户快速且保质地完成视频拍摄。最终,通过对原始视频中的目标角色及其他角色进行识别,并将当前拍摄得到的视频融合至原始视频的视频内容中,实现视频合拍,该种视频生成方法不但能够取得优质拍摄效果,还可以显著降低拍摄成本。在镜头呈现上和人物表演上能够达到较高的水平,同时还加快了视频拍摄的完成速度,节约了时间成本和人力成本。
即,在视频合拍场景下,本申请实施例通过对视频画面内容进行分析,能够对外输出有利于用户拍摄的提示消息,进而帮助用户快速地投入到视频的创作过程。也即,本申请实施例以分析视频画面内容为前提,通过向用户展示提示消息来引导用户拍摄,其中,该提示消息包含的内容丰富,比如包含摄像头拍摄方式、人体姿态和角色对白中的一种或多种。
另外,该种基于场景融合的视频合拍方案,由于将原始视频和用户拍摄的视频合成处理为一路视频,即在画面呈现上合拍视频仅包括一路画面,实现的是时间顺序上原始视频和用户拍摄视频的线性穿插播放,确保了视频的无缝衔接创作效果,该种视频创作过程更加友好。换言之,通过该种视频合拍方案,在围绕现有视频画面内容的基础上,用户能够以更加自然、更加沉浸的方式投入到视频创作过程,使得最终的合拍视频从用户角度看来具有更好的融合性,也即合拍视频在内容呈现上和人物表演上与原始视频更为契合,避免了两路视频之间的生硬拼接。
作为一个示例,图20至28示出了基于本申请实施例提供的视频生成方法实现的视频合拍的产品效果图。现结合图20至28对本申请实施例提供的视频生成方法进行描述。
图20示出了原始视频的播放界面2000,在该播放界面2000上显示有一个“我要合拍”的视频合拍选项,当用户触发该视频合拍选项后,便会显示图21所示的用户界面2100,该用户界面2100上显示了两个角色选项,分别为角色A和角色B,用户可以选择这两个角色中的任意一个角色进行替换拍摄。比如,在用户点击视频合拍选项后,终端可以弹窗提示该视频中可参与拍摄的角色有两个,用户可以选择其中一个角色进行替换,即由用户来表演被选中角色的画面内容。作为一个示例,角色A的角色选项和角色B的角色选项,可分别用各自对应的角色图片来呈现。
如果用户选择图21示出的用户界面2100选择了其中一个角色(比如选择角色A),则终端会在图22呈现的播放界面2200上显示包括角色A的4个视频片段各自的预览画面。其中,这4个视频片段是从原始视频中筛选出来的包括角色A的视频片段,而用户可以随意观看这些视频片段。示例性地,这4个视频片段的预览画面可以平铺方式或列表方式呈现在播放界面上,而4个视频片段的预览画面可以为每个视频片段的首帧或关键帧或随机选取的一个视频帧,本申请实施例对此不进行具体限定。
如图23至图26所示,在视频拍摄过程中,终端可以将需要用户模仿表演的视频片段显示在用户界面的左上角,既达到对用户进行提示的目的,同时还不对用户界面进行过多占用。另外,除了左上角之外,还可以将需要用户模仿表演的视频片段显示在用户界面的右上角、左下角或右下角等位置,本申请实施例对此不进行具体限定。
在本申请实施例中,图23至图26还示出了在用户界面上显示不同类型提示消息。
针对图23,终端通过对原始视频进行画面内容分析得知用户此时需要面朝右方进行拍摄,那么在用户界面2300上便会显示相应的提示消息,以引导用户拍摄,从而使得用户拍摄的视频画面与原始视频中的角色及画面逻辑有更好的匹配度。如图23所示,此时用户界面2300上呈现的提示消息包括:面部朝向的提示图标和提示文字“面朝右方”。
针对图24,为了确保用户可以真实地还原拍摄过程,以保持与原始视频的较高一致性,提示消息还可以包括摄像头拍摄方式。如图24所示,通过对原始视频进行画面内容分析可知,当前为推镜头画面,那么终端在用户界面2400上将会呈现摄像头拍摄方式的提示图标(图24中箭头)和提示文字(画面推进),从而告知用户该如何进行镜头的操控。另外,与此同时,终端还可以对与当前拍摄进度相匹配的角色对白进行展示,以告知用户在拍摄时需要读出的文字内容。其中,图24中示出了用户在将画面进行推进的同时,还需要读出“我们一起拍合照?”的角色对白。
针对图25,提示消息还可以包括肢体动作。比如,通过对原始视频进行画面内容分析可知,当前角色的左侧胳膊抬起来了,那么终端也会同步在用户界面2500上进行肢体动作的展示,即在用户界面2500上展示肢体动作的提示图标和提示文字。如图25所示,该提示图标可以为“运动中的小人”,该提示文字可以为“抬起左侧胳膊”。另外,在用户执行这个肢体动作的同时还需要读出“真的吗?”的角色对白。
针对图26,提示消息还可以包括面部表情。即,用户界面2600上还可以展示面部表情的提示图标和提示文字。比如,通过对原始视频进行画面内容分析可知,当前角色面朝右侧微笑,那么终端也会同步在用户界面2600上进行面部表情的展示,即在用户界面上展示面部表情的提示图标和提示文字。如图26所示,该提示图标可以为“笑脸”,该提示文字可以为“面朝左侧微笑”。
如图27所示,在生成合拍视频后,终端会在用户界面2700上显示合拍视频的预览画面播放选项、播放进度条以及视频修改选项。其中,视频修改选项可以包括多个,图11中仅示例性地示出了5个视频修改选项,分别为调整素材、文字、贴纸、滤镜和美颜。可以理解的是,视频修改选项的个数可以多于或者少于图示的5个,本申请实施例对此不进行具体限定。
示例性地,下面通过图28对“在时间顺序上原始视频和用户拍摄视频呈线性穿插播放”进行说明。其中,图28中示出了在合拍视频中截取到的几个视频画面,这几个视频画面从左到右是按照时间顺序依次排序的。在图28中,按照从左至右的顺序对这几个视频画面进行排序1至7,则视频画面1、视频画面3和视频画面5来自于原始视频,而视频画面2、视频画面4和视频画面6来自于用户拍摄视频,而视频画面7是通过对原始视频中相应视频画面包含的目标角色进行换脸得到,即将目标角色的面部图像替换为用户的面部图像。由于在合拍视频的播放过程中,图28中的几个视频画面以在时间顺序上由左到右顺次呈现的,由于原始视频画面和用户拍摄视频穿插播放,因此该种视频合拍方案实现了原始视频和用户拍摄视频的基于场景融合。
图29是本申请实施例提供的一种视频生成装置的结构示意图。参见图29,该装置包 括:
第一处理模块2901,用于响应于针对视频合拍选项的触发操作,进行视频拍摄;
视频获取模块2902,用于获取当前拍摄得到的第二视频;所述第二视频对应于第一视频中包括目标角色的视频片段;
第二处理模块2903,用于基于对所述第一视频中的目标角色及其他角色的识别,将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频。
在一种可能的实现方式中,该装置还包括:
消息获取模块,用于基于对所述第一视频的画面内容的识别,获取提示消息;所述提示消息用于指导拍摄所述第二视频;
第一显示模块,用于在视频拍摄过程中,在拍摄界面上显示所述提示消息。
在一种可能的实现方式中,所述第二处理模块,用于若所述第一视频中不包括目标角色和其他角色的同框画面,则利用所述第二视频替换所述第一视频中包括所述目标角色的视频片段。
在一种可能的实现方式中,所述第二处理模块,用于若所述第一视频中包括所述目标角色和其他角色的同框画面,则利用所述第二视频中的用户面部图像替换所述同框画面中的目标角色面部图像。
在一种可能的实现方式中,所述提示消息包括摄像头拍摄方式、人体姿态和角色对白中的一种或多种;所述第一显示模块,用于执行以下一项或多项:在所述拍摄界面上显示所述摄像头拍摄方式的提示图标和提示文字中的至少一种;在所述拍摄界面上显示所述人体姿态的提示图标和提示文字中的至少一种;其中,所述人体姿态包括面部表情、面部朝向和肢体动作中的一种或多种;在所述拍摄界面上显示所述角色对白。
在一种可能的实现方式中,所述第一视频中包括N个角色,N为正整数且N≥2,所述装置还包括:
第二显示模块,用于响应于针对所述视频合拍选项的触发操作,在进行视频拍摄之前,在所述第一视频的播放界面上显示N个角色选项;
第三处理模块,用于响应于针对所述N个角色选项中的目标角色选项的触发操作,在所述第一视频中筛选出包括目标角色的M个视频片段作为目标视频片段;其中,M为正整数。
在一种可能的实现方式中,所述第二显示模块,还用于在所述播放界面上显示每个所述目标视频片段的预览画面;
所述第三处理模块,还用于响应于针对指定目标视频片段的预览画面的触发操作,播放所述指定目标视频片段。
在一种可能的实现方式中,所述装置还包括:
第三显示模块,用于在所述拍摄界面上悬浮显示视频窗口;其中,所述视频窗口用于显示所述第一视频中所述提示消息对应的视频片段。
在一种可能的实现方式中,所述装置还包括:
第四显示模块,用于在生成所述合拍视频后,显示所述合拍视频的预览画面、播放 选项、播放进度条以及视频修改选项;
第四处理模块,用于响应于针对所述播放选项的触发操作,播放所述合拍视频;
所述第四显示模块,还用于通过所述播放进度条显示所述合拍视频的播放进度;
第五处理模块,用于响应于针对所述视频修改选项的触发操作,对所述合拍视频执行修改处理。
在一种可能的实现方式中,所述第三处理模块,用于确定所述目标角色在所述第一视频中出现的目标时间点;对所述目标时间点进行关键帧标记得到视频打点信息;根据所述视频打点信息和所述目标时间点,在所述第一视频中切分出所述M个目标视频片段。
在一种可能的实现方式中,所述消息获取模块,具体用于对每个所述目标视频片段进行画面内容分析,得到每个所述目标视频片段对应的提示消息;
所述第一显示模块,具体用于在每个所述目标视频片段的拍摄过程中,在所述拍摄界面上显示与每个所述目标视频片段对应的提示消息。
在一种可能的实现方式中,所述消息获取模块,用于针对每个所述目标视频片段,通过人体关键点检测网络,根据所述目标视频片段,确定所述目标视频片段中所述目标角色的人体关键点;按照面部不同部位之间的相对位置关系,将所述人体关键点中的面部关键点进行连接,得到面部构架模型;根据所述面部架构模型,确定所述目标角色在所述目标视频片段中面部表情和面部朝向;按照肢体不同部位之间的相对位置关系,将所述人体关键点中的肢体关键点进行连接,得到肢体构架模型;根据所述肢体架构模型,确定所述目标角色在所述目标视频片段中肢体动作。
在一种可能的实现方式中,所述消息获取模块,具体用于针对每个所述目标视频片段,获取所述目标视频片段中目标物的运动方向变化信息和大小变化信息;根据所述目标物在所述目标视频片段中的运动方向变化信息和大小变化信息,确定所述目标视频片段对应的摄像头拍摄方式。
在一种可能的实现方式中,所述消息获取模块,具体用于针对每个所述目标视频片段,对所述目标角色在所述目标视频片段中的语音数据进行识别,得到所述目标角色在所述目标视频片段中的角色对白。
在一种可能的实现方式中,所述第一显示模块,还用于若所述摄像头当前拍摄的视频画面与当前显示的所述提示消息不匹配,则在所述拍摄界面上显示错误提示消息;其中,所述错误提示消息用于引导用户重新进行视频拍摄。
上述所有可选技术方案,可以采用任意结合形成本公开的可选实施例,在此不再一一赘述。
需要说明的是:上述实施例提供的视频生成装置在生成视频时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的视频生成装置与视频生成方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
图30示出了本申请一个示例性实施例提供的电子设备3000的结构框图。该电子设备3000可用于执行上述方法实施例中的视频生成方法。
该设备3000可以是便携式移动终端,比如:智能手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。设备3000还可能被称为用户设备、便携式终端、膝上型终端、台式终端等其他名称。
通常,设备3000包括有:处理器3001和存储器3002。
处理器3001可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器3001可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器3001也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器3001可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器3001还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器3002可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器3002还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器3002中的非暂态的计算机可读存储介质用于存储至少一个程序代码,该至少一个程序代码用于被处理器3001所执行以实现本申请中方法实施例提供的视频生成方法。
在一些实施例中,设备3000还可选包括有:外围设备接口3003和至少一个外围设备。处理器3001、存储器3002和外围设备接口3003之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口3003相连。具体地,外围设备包括:射频电路3004、触摸显示屏3005、摄像头3006、音频电路3007、定位组件3008和电源3009中的至少一种。
图31是本申请实施例提供的一种电子设备的结构示意图,该电子设备3100可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(Central Processing Units,CPU)3101和一个或一个以上的存储器3102,其中,所述存储器3102中存储有至少一条程序代码,所述至少一条程序代码由所述处理器3101加载并执行以实现上述各个方法实施例提供的视频生成方法。当然,该电子设备还可以具有有线或无线网络接口、键盘以及输入输出接口等部件,以便进行输入输出,该电子设备还可以包括其他用于实现设备功能的部件,在此不做赘述。
在示例性实施例中,还提供了一种计算机可读存储介质,例如包括程序代码的存储器,上述程序代码可由终端中的处理器执行以完成上述实施例中的视频生成方法。例 如,所述计算机可读存储介质可以是只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、光盘只读存储器(Compact Disc Read-Only Memory,CD-ROM)、磁带、软盘和光数据存储设备等。
在示例性实施例中,还提供了一种计算机程序产品或计算机程序,该计算机程序产品或计算机程序包括计算机程序代码,该计算机程序代码存储在计算机可读存储介质中,电子设备的处理器从计算机可读存储介质读取该计算机程序代码,处理器执行该计算机程序代码,使得该电子设备执行上述实施例中的视频生成方法。

Claims (19)

  1. 一种视频生成方法,由电子设备执行,所述方法包括:
    响应于针对视频合拍选项的触发操作,进行视频拍摄;
    获取当前拍摄得到的第二视频;所述第二视频对应于第一视频中包括目标角色的视频片段;
    基于对所述第一视频中的目标角色及其他角色的识别,将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频。
  2. 根据权利要求1所述的方法,所述方法还包括:
    基于对所述第一视频的画面内容的识别,获取提示消息;所述提示消息用于指导拍摄所述第二视频;
    在视频拍摄过程中,在拍摄界面上显示所述提示消息。
  3. 根据权利要求1所述的方法,所述将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频,包括:
    若所述第一视频中不包括所述目标角色和其他角色的同框画面,则利用所述第二视频替换所述第一视频中包括所述目标角色的视频片段。
  4. 根据权利要求1所述的方法,所述将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频,包括:
    若所述第一视频中包括所述目标角色和其他角色的同框画面,则利用所述第二视频中的用户面部图像替换所述同框画面中的所述目标角色的面部图像。
  5. 根据权利要求2所述的方法,所述提示消息包括摄像头拍摄方式、人体姿态和角色对白中的一种或多种;所述在拍摄界面上显示所述提示消息,包括:
    在所述拍摄界面上显示所述摄像头拍摄方式的提示图标和提示文字中的至少一种;
    在所述拍摄界面上显示所述人体姿态的提示图标和提示文字中的至少一种;所述人体姿态包括面部表情、面部朝向和肢体动作中的一种或多种;
    在所述拍摄界面上显示所述角色对白。
  6. 根据权利要求1或2所述的方法,所述第一视频中包括N个角色,所述N为大于或者等于2的整数,所述N个角色包括所述目标角色;所述方法还包括:
    响应于针对所述视频合拍选项的触发操作,在进行视频拍摄之前,在所述第一视频的播放界面上显示所述N个角色各自对应的角色选项;
    响应于针对所述目标角色对应的角色选项的触发操作,从所述第一视频中筛选出包括所述目标角色的M个视频片段作为目标视频片段;所述M为正整数。
  7. 根据权利要求6所述的方法,所述方法还包括:
    在所述播放界面上显示每个所述目标视频片段的预览画面;
    响应于针对指定目标视频片段的触发操作,播放所述指定目标视频片段。
  8. 根据权利要求2所述的方法,所述方法还包括:
    在所述拍摄界面上悬浮显示视频窗口;所述视频窗口用于显示所述第一视频中与所述提示消息对应的视频片段。
  9. 根据权利要求1所述的方法,所述方法还包括:
    在获得所述合拍视频后,显示所述合拍视频的预览画面、播放选项、播放进度条以及视频修改选项;
    响应于针对所述播放选项的触发操作,播放所述合拍视频,并通过所述播放进度条显示所述合拍视频的播放进度;
    响应于针对所述视频修改选项的触发操作,对所述合拍视频进行修改处理。
  10. 根据权利要求6所述的方法,所述从所述第一视频中筛选出包括所述目标角色的M个视频片段作为目标视频片段,包括:
    确定所述目标角色在所述第一视频中出现的目标时间点;
    对所述目标时间点进行关键帧标记,得到视频打点信息;
    根据所述视频打点信息和所述目标时间点,在所述第一视频中切分出所述M个目标视频片段。
  11. 根据权利要求2所述的方法,所述基于对所述第一视频的画面内容的识别,获取提示消息,包括:
    对所述第一视频中每个包括所述目标角色的目标视频片段进行画面内容分析,得到每个所述目标视频片段对应的提示消息;
    所述在视频拍摄过程中,在拍摄界面上显示所述提示消息,包括
    在每个所述目标视频片段的拍摄过程中,在所述拍摄界面上显示每个所述目标视频片段对应的提示消息。
  12. 根据权利要求11所述的方法,所述对所述第一视频中每个包括所述目标角色的目标视频片段进行画面内容分析,包括:
    针对每个所述目标视频片段,通过人体关键点检测网络,确定所述目标视频片段中所述目标角色的人体关键点;
    按照面部不同部位之间的相对位置关系,将所述人体关键点中的面部关键点进行连接,得到面部构架模型;根据所述面部架构模型,确定所述目标角色在所述目标视频片段中面部表情和面部朝向;
    按照肢体不同部位之间的相对位置关系,将所述人体关键点中的肢体关键点进行连接,得到肢体构架模型;根据所述肢体架构模型,确定所述目标角色在所述目标视频片段中肢体动作。
  13. 根据权利要求11所述的方法,所述对所述第一视频中每个包括所述目标角色的目标视频片段进行画面内容分析,包括:
    针对每个所述目标视频片段,获取所述目标视频片段中目标物的运动方向变化信息和大小变化信息;根据所述目标物在所述目标视频片段中的运动方向变化信息和大小变化信息,确定所述目标视频片段对应的摄像头拍摄方式。
  14. 根据权利要求11所述的方法,所述对所述第一视频中每个包括所述目标角色的目标视频片段进行画面内容分析,包括:
    针对每个所述目标视频片段,对所述目标角色在所述目标视频片段中的语音数据进行 识别,得到所述目标角色在所述目标视频片段中的角色对白。
  15. 根据权利要求2所述的方法,所述方法还包括:
    若所述摄像头当前拍摄的视频画面与当前显示的所述提示消息不匹配,则在所述拍摄界面上显示错误提示消息。
  16. 一种视频生成装置,所述装置包括:
    第一处理模块,用于响应于针对视频合拍选项的触发操作,进行视频拍摄;
    视频获取模块,用于获取当前拍摄得到的第二视频;所述第二视频对应于第一视频中包括目标角色的视频片段;
    第二处理模块,用于基于对所述第一视频中的目标角色及其他角色的识别,将所述第二视频融合至所述第一视频的视频内容中,获得合拍视频。
  17. 一种电子设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条程序代码,所述至少一条程序代码由所述处理器加载并执行以实现如权利要求1至15中任一项权利要求所述的视频生成方法。
  18. 一种存储介质,所述存储介质中存储有至少一条程序代码,所述至少一条程序代码由处理器加载并执行以实现如权利要求1至15中任一项权利要求所述的视频生成方法。
  19. 一种计算机程序产品,包括指令,当其在计算机上运行时,使得计算机执行权利要求1至15中任一项权利要求所述的视频生成方法。
PCT/CN2021/098796 2020-07-03 2021-06-08 视频生成方法、装置、存储介质及计算机设备 WO2022001593A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US17/983,071 US20230066716A1 (en) 2020-07-03 2022-11-08 Video generation method and apparatus, storage medium, and computer device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010636852.5A CN111726536B (zh) 2020-07-03 2020-07-03 视频生成方法、装置、存储介质及计算机设备
CN202010636852.5 2020-07-03

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/983,071 Continuation US20230066716A1 (en) 2020-07-03 2022-11-08 Video generation method and apparatus, storage medium, and computer device

Publications (1)

Publication Number Publication Date
WO2022001593A1 true WO2022001593A1 (zh) 2022-01-06

Family

ID=72571653

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/098796 WO2022001593A1 (zh) 2020-07-03 2021-06-08 视频生成方法、装置、存储介质及计算机设备

Country Status (3)

Country Link
US (1) US20230066716A1 (zh)
CN (1) CN111726536B (zh)
WO (1) WO2022001593A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023142432A1 (zh) * 2022-01-27 2023-08-03 腾讯科技(深圳)有限公司 基于增强现实的数据处理方法、装置、设备、存储介质及计算机程序产品
US11875556B2 (en) * 2020-06-12 2024-01-16 Beijing Bytedance Network Technology Co., Ltd. Video co-shooting method, apparatus, electronic device and computer-readable medium

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111726536B (zh) * 2020-07-03 2024-01-05 腾讯科技(深圳)有限公司 视频生成方法、装置、存储介质及计算机设备
CN114390190B (zh) * 2020-10-22 2023-10-31 海信视像科技股份有限公司 显示设备及监测应用启动摄像头的方法
CN112423112B (zh) * 2020-11-16 2023-03-21 北京意匠文枢科技有限公司 一种发布视频信息的方法与设备
CN112422831A (zh) * 2020-11-20 2021-02-26 广州太平洋电脑信息咨询有限公司 视频生成方法、装置、计算机设备和存储介质
CN112464786B (zh) * 2020-11-24 2023-10-31 泰康保险集团股份有限公司 一种视频的检测方法及装置
CN114816599B (zh) * 2021-01-22 2024-02-27 北京字跳网络技术有限公司 图像显示方法、装置、设备及介质
CN114915722B (zh) * 2021-02-09 2023-08-22 华为技术有限公司 处理视频的方法和装置
CN113114925B (zh) 2021-03-09 2022-08-26 北京达佳互联信息技术有限公司 一种视频拍摄方法、装置、电子设备及存储介质
CN113362434A (zh) * 2021-05-31 2021-09-07 北京达佳互联信息技术有限公司 一种图像处理方法、装置、电子设备及存储介质
CN115442538A (zh) * 2021-06-04 2022-12-06 北京字跳网络技术有限公司 一种视频生成方法、装置、设备及存储介质
CN117652148A (zh) * 2021-09-08 2024-03-05 深圳市大疆创新科技有限公司 拍摄方法、拍摄系统及存储介质
CN113783997B (zh) * 2021-09-13 2022-08-23 北京字跳网络技术有限公司 一种视频发布方法、装置、电子设备及存储介质
CN113946254B (zh) * 2021-11-01 2023-10-20 北京字跳网络技术有限公司 内容显示方法、装置、设备及介质
CN114500851A (zh) * 2022-02-23 2022-05-13 广州博冠信息科技有限公司 视频录制方法及装置、存储介质、电子设备
CN116631042B (zh) * 2023-07-25 2023-10-13 数据空间研究院 表情图像生成、表情识别模型、方法、系统和存储器

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017063133A1 (zh) * 2015-10-13 2017-04-20 华为技术有限公司 一种拍摄方法和移动设备
CN108989691A (zh) * 2018-10-19 2018-12-11 北京微播视界科技有限公司 视频拍摄方法、装置、电子设备及计算机可读存储介质
CN109005352A (zh) * 2018-09-05 2018-12-14 传线网络科技(上海)有限公司 合拍视频的方法及装置
CN109982130A (zh) * 2019-04-11 2019-07-05 北京字节跳动网络技术有限公司 一种视频拍摄方法、装置、电子设备及存储介质
CN110121094A (zh) * 2019-06-20 2019-08-13 广州酷狗计算机科技有限公司 视频合拍模板的显示方法、装置、设备及存储介质
CN111726536A (zh) * 2020-07-03 2020-09-29 腾讯科技(深圳)有限公司 视频生成方法、装置、存储介质及计算机设备

Family Cites Families (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
FR2775813B1 (fr) * 1998-03-06 2000-06-02 Symah Vision Procede et dispositif de remplacement de panneaux cibles dans une sequence video
CN101076111B (zh) * 2006-11-15 2010-09-01 腾讯科技(深圳)有限公司 一种获取视频流中关键帧段定位信息的方法
US20100031149A1 (en) * 2008-07-01 2010-02-04 Yoostar Entertainment Group, Inc. Content preparation systems and methods for interactive video systems
JP5252068B2 (ja) * 2011-12-22 2013-07-31 カシオ計算機株式会社 合成画像出力装置および合成画像出力処理プログラム
CN103942751B (zh) * 2014-04-28 2017-06-06 中央民族大学 一种视频关键帧提取方法
EP3015146A1 (en) * 2014-10-28 2016-05-04 Thomson Licensing Method for generating a target trajectory of a camera embarked on a drone and corresponding system
TWI592021B (zh) * 2015-02-04 2017-07-11 騰訊科技(深圳)有限公司 生成視頻的方法、裝置及終端
CN106021496A (zh) * 2016-05-19 2016-10-12 海信集团有限公司 视频搜索方法及视频搜索装置
CN106067960A (zh) * 2016-06-20 2016-11-02 努比亚技术有限公司 一种处理视频数据的移动终端和方法
CN106686306B (zh) * 2016-12-22 2019-12-03 西安工业大学 一种目标跟踪装置和跟踪方法
CN107333071A (zh) * 2017-06-30 2017-11-07 北京金山安全软件有限公司 视频处理方法、装置、电子设备及存储介质
CN110502954B (zh) * 2018-05-17 2023-06-16 杭州海康威视数字技术股份有限公司 视频分析的方法和装置
CN110913244A (zh) * 2018-09-18 2020-03-24 传线网络科技(上海)有限公司 视频处理方法及装置、电子设备和存储介质
CN110264396B (zh) * 2019-06-27 2022-11-18 杨骥 视频人脸替换方法、系统及计算机可读存储介质
CN110264668A (zh) * 2019-07-10 2019-09-20 四川长虹电器股份有限公司 基于机器视觉技术的多策略老人看护方法
CN110290425B (zh) * 2019-07-29 2023-04-07 腾讯科技(深圳)有限公司 一种视频处理方法、装置及存储介质
CN110505513A (zh) * 2019-08-15 2019-11-26 咪咕视讯科技有限公司 一种视频截图方法、装置、电子设备及存储介质
CN110536075B (zh) * 2019-09-20 2023-02-21 上海掌门科技有限公司 视频生成方法和装置
CN110855893A (zh) * 2019-11-28 2020-02-28 维沃移动通信有限公司 一种视频拍摄的方法及电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017063133A1 (zh) * 2015-10-13 2017-04-20 华为技术有限公司 一种拍摄方法和移动设备
CN109005352A (zh) * 2018-09-05 2018-12-14 传线网络科技(上海)有限公司 合拍视频的方法及装置
CN108989691A (zh) * 2018-10-19 2018-12-11 北京微播视界科技有限公司 视频拍摄方法、装置、电子设备及计算机可读存储介质
CN109982130A (zh) * 2019-04-11 2019-07-05 北京字节跳动网络技术有限公司 一种视频拍摄方法、装置、电子设备及存储介质
CN110121094A (zh) * 2019-06-20 2019-08-13 广州酷狗计算机科技有限公司 视频合拍模板的显示方法、装置、设备及存储介质
CN111726536A (zh) * 2020-07-03 2020-09-29 腾讯科技(深圳)有限公司 视频生成方法、装置、存储介质及计算机设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11875556B2 (en) * 2020-06-12 2024-01-16 Beijing Bytedance Network Technology Co., Ltd. Video co-shooting method, apparatus, electronic device and computer-readable medium
WO2023142432A1 (zh) * 2022-01-27 2023-08-03 腾讯科技(深圳)有限公司 基于增强现实的数据处理方法、装置、设备、存储介质及计算机程序产品

Also Published As

Publication number Publication date
US20230066716A1 (en) 2023-03-02
CN111726536B (zh) 2024-01-05
CN111726536A (zh) 2020-09-29

Similar Documents

Publication Publication Date Title
WO2022001593A1 (zh) 视频生成方法、装置、存储介质及计算机设备
US20200357180A1 (en) Augmented reality apparatus and method
US9626788B2 (en) Systems and methods for creating animations using human faces
US11615592B2 (en) Side-by-side character animation from realtime 3D body motion capture
JP2021192222A (ja) 動画インタラクティブ方法と装置、電子デバイス、コンピュータ可読記憶媒体、及び、コンピュータプログラム
US11734894B2 (en) Real-time motion transfer for prosthetic limbs
TWI752502B (zh) 一種分鏡效果的實現方法、電子設備及電腦可讀儲存介質
CN112822542A (zh) 视频合成方法、装置、计算机设备和存储介质
CN112199016B (zh) 图像处理方法、装置、电子设备及计算机可读存储介质
CN111638784B (zh) 人脸表情互动方法、互动装置以及计算机存储介质
US11582519B1 (en) Person replacement utilizing deferred neural rendering
CN114363689B (zh) 直播控制方法、装置、存储介质及电子设备
US11581020B1 (en) Facial synchronization utilizing deferred neural rendering
CN111491187A (zh) 视频的推荐方法、装置、设备及存储介质
US20080122867A1 (en) Method for displaying expressional image
US11581018B2 (en) Systems and methods for mixing different videos
CN116017082A (zh) 一种信息处理方法和电子设备
US20230334790A1 (en) Interactive reality computing experience using optical lenticular multi-perspective simulation
US20230334792A1 (en) Interactive reality computing experience using optical lenticular multi-perspective simulation
WO2024051467A1 (zh) 图像处理方法、装置、电子设备及存储介质
US20230334791A1 (en) Interactive reality computing experience using multi-layer projections to create an illusion of depth
WO2023130715A1 (zh) 一种数据处理方法、装置、电子设备、计算机可读存储介质及计算机程序产品
US20230291959A1 (en) Comment information display
WO2024039885A1 (en) Interactive reality computing experience using optical lenticular multi-perspective simulation
WO2024039887A1 (en) Interactive reality computing experience using optical lenticular multi-perspective simulation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21831541

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 09/06/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21831541

Country of ref document: EP

Kind code of ref document: A1