WO2021109678A1 - 视频生成方法、装置、电子设备及存储介质 - Google Patents

视频生成方法、装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2021109678A1
WO2021109678A1 PCT/CN2020/116452 CN2020116452W WO2021109678A1 WO 2021109678 A1 WO2021109678 A1 WO 2021109678A1 CN 2020116452 W CN2020116452 W CN 2020116452W WO 2021109678 A1 WO2021109678 A1 WO 2021109678A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
scene
information
person
text
Prior art date
Application number
PCT/CN2020/116452
Other languages
English (en)
French (fr)
Inventor
刘炫鹏
刘云峰
刘致远
文博
Original Assignee
深圳追一科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳追一科技有限公司 filed Critical 深圳追一科技有限公司
Publication of WO2021109678A1 publication Critical patent/WO2021109678A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7837Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content
    • G06F16/784Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using objects detected or recognised in the video content the detected or recognised objects being people
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation

Definitions

  • This application relates to the technical field of electronic equipment, and more specifically, to a video generation method, device, electronic equipment, and storage medium.
  • the audio method can facilitate users to obtain text information without looking at the text, but it is boring and boring. It is difficult for users to understand the specific information of the text content and the integration of the environment and scenes, thereby reducing the user The experience of acquiring information.
  • a video generation method, device, electronic device, and storage medium are provided.
  • an embodiment of the present application provides a video generation method, and the method includes:
  • an embodiment of the present application provides a video generation device, the device includes:
  • the information input module is used to obtain the interactive information input by the user
  • a scene video acquisition module configured to acquire a scene video according to the interaction information, and the scene video includes a character to be matched
  • the face acquisition module is used to acquire the user's face information and extract the corresponding facial features as the target facial features;
  • a video generation module configured to replace the facial features of the person to be matched in the scene video with the target facial feature to generate the video to be played;
  • the output module is used to output the to-be-played video.
  • an embodiment of the present application provides an electronic device, and the electronic device includes:
  • One or more processors are One or more processors;
  • a memory electrically connected to the one or more processors
  • One or more application programs wherein the one or more application programs are stored in the memory and configured to be executed by the one or more processors, and the one or more application programs are configured to implement The following operations:
  • the embodiment of the present application provides a computer-readable storage medium with program code stored in the computer-readable storage medium, and when the program code is invoked and executed by a processor, the following operations are implemented:
  • Fig. 1 shows a flowchart of a video generation method provided by an embodiment of the present application.
  • Fig. 2 shows a schematic diagram of replacing the facial features of a person to be matched provided by an embodiment of the present application.
  • Fig. 3 shows a flowchart of a video generation method provided by another embodiment of the present application.
  • Fig. 4 shows a schematic flow chart of generating a scene video according to video text information according to an embodiment of the present application.
  • Fig. 5 shows a flowchart of a video generation method provided by another embodiment of the present application.
  • Fig. 6 shows a flowchart of a video generation method provided by another embodiment of the present application.
  • Fig. 7 shows a functional block diagram of a video generation device provided by an embodiment of the present application.
  • FIG. 8 shows a structural block diagram of an electronic device provided by an embodiment of the present application for executing the video generation method according to the embodiment of the present application.
  • FIG. 9 shows a schematic diagram of a storage medium for storing or carrying program code for implementing the video generation method according to the embodiment of the present application provided by an embodiment of the present application.
  • the inventor proposes the video generation method, device, electronic device, and storage medium in the embodiments of the present application. While displaying information content through the video, the electronic device reproduces the user's face on a certain character in the video to enhance the user's sense of substitution, thereby enhancing the user's experience.
  • an embodiment of the present application provides a video generation method, which can be applied to electronic devices.
  • the electronic device can be various electronic devices with a display screen, a shooting camera, an audio output function and support for data input, including but not limited to smart phones, tablet computers, laptop portable computers, desktop computers, and wearables.
  • the data input may be based on the voice input of the voice module on the electronic device, the character input module inputting characters, etc.
  • the specific method may include:
  • Operation S110 Obtain interactive information input by the user.
  • the interactive information input by the user can be acquired through various information input modules integrated in the electronic device or various information input devices connected to the electronic device.
  • the interaction information includes, but is not limited to, various types of information such as voice information, text information, image information, and action information.
  • voice information can include voice audio information, such as Chinese, English audio, etc., and non-language audio information, such as music audio, etc.
  • text information can include text text information, such as Chinese, English, etc., and non-verbal audio information.
  • Text information of text such as special symbols, character expressions, etc.
  • image information can include static image information, such as static pictures, photos, etc., and dynamic image information, such as dynamic pictures, video images, etc.
  • action information can include user action information, For example, user gestures, body movements, facial movements, etc., as well as terminal action information, such as the position, posture, and movement status of the terminal device, such as shaking and rotating.
  • electronic devices can collect information through different types of information input modules.
  • electronic devices can collect user voice information through audio input devices such as microphones, text information input by users through touch screens or physical buttons, image information through cameras, and motion information through optical sensors and gravity sensors.
  • the same request can correspond to different types of interactive information. For example, when a user wants to input a request of "I want to listen to Aladdin's story", the user can input the corresponding audio through voice input, or upload pictures related to Aladdin or input the corresponding text information. It is understandable that, corresponding to the same request, only one type of interactive information can be input, or multiple types of interactive information can be input at the same time, so that the user's intention is clearer and easier to be electronically recognized.
  • the electronic device obtains different types of interactive information in a variety of ways, so that the user's multiple interaction methods can be responded freely, and it is no longer limited to the traditional mechanical human-computer interaction means, and realizes the interaction between man and machine.
  • the multi-state interaction meets more interactive scenarios.
  • Operation S120 Obtain a scene video according to the interaction information, and the scene video includes a character to be matched.
  • the electronic device After obtaining the interactive information input by the user, the electronic device can perform semantic understanding of the interactive information and obtain the semantic information of the interactive information, so as to realize an accurate understanding of the user's interactive information.
  • the scene video may be video information related to the interactive information acquired by the electronic device in response to the interactive information input by the user.
  • the electronic device may search for videos related to the semantic information according to the semantic information.
  • the interactive information input by the user is "I want to hear Aladdin's story”
  • the scene video corresponding to the interactive information may be a film and television work corresponding to Aladdin.
  • the electronic device may search for the video text information related to the semantic information according to the semantic information. For example, if the interactive information input by the user is "I want to hear Aladdin's story", the electronic device searches for story text related to Aladdin, and generates a corresponding scene video based on the story text.
  • the electronic device can cut the acquired video text information according to the scene to obtain multiple scene texts; perform semantic understanding based on each scene text, acquire the characters, places and events in each scene text, and convert the scene text For voice information.
  • the electronic device When the electronic device generates the sub-scene video corresponding to the scene text, it can generate a video picture of the character performing the event at the place according to the character, location and event, and synthesize the voice information with the video picture to obtain the sub-scene corresponding to the scene text.
  • Scene video If one sub-scene video is generated, the electronic device uses one sub-scene video as the scene video; if multiple sub-scene videos are generated, the electronic device splices the multiple sub-scene videos into a scene video.
  • Operation S130 Obtain the facial information of the user and extract the corresponding facial features as the target facial features.
  • the electronic device obtains the user's facial information, and extracts facial features based on the user's facial information.
  • the face information may be a face image or a video including a face.
  • the face feature may be a set of feature points used to describe all or part of the shape of the face, which records the position information and depth information of each feature point on the face of the person in space, by acquiring the face feature Rebuild part or all of the face image.
  • the electronic device may input the acquired facial image or facial video into the feature extraction model to obtain facial features.
  • facial features can be features of five sense organs, for example, features of eyebrows, eyes, nose, mouth, and ears.
  • the electronic device obtains the user's facial information, which may be a facial image of the user collected by a camera device of the electronic device, or a facial image provided by the user.
  • the face image is collected by the camera device, it may be that after the electronic device obtains the interactive information input by the user, the camera device of the electronic device is activated to collect the face image.
  • the electronic device extracts the facial features based on the facial information, which can be the acquired facial image or video on the electronic device side to extract the facial features as the target face; it can also be the facial image or video acquired through the network, etc. It is sent to the server, and the server extracts the facial features as the facial features.
  • the target face feature is defined as the face feature extracted according to the acquired face information.
  • Operation S140 replacing the facial features of the person to be matched in the scene video with the target facial features to generate a video to be played.
  • the electronic device After the electronic device obtains the scene video corresponding to the interaction information and the target face feature, it can replace the target face feature with the face feature of the person to be matched in the scene video to generate the video to be played.
  • the person to be matched is the person who needs to be replaced in the acquired scene video.
  • the electronic device can replace the facial feature points of the person designated by the user.
  • the electronic device can perform semantic understanding of the scene video, acquire the protagonist in the entire scene video, and replace the protagonist's facial features. The electronic device reproduces the target face feature on the face of the person to be matched in the scene video to obtain the video to be played.
  • the electronic device When the electronic device replaces the facial features of the person to be matched in the scene video, since the scene video can be split into multiple frames of images, it can process each frame of the scene video and detect each frame of image separately. Whether there is a person to be matched; if there is a person to be matched in a certain frame of image, the facial features of the person to be matched are positioned to determine the replacement area, and the replacement area is replaced with the target face feature. Therefore, if there is a picture of a character to be matched in the scene video, the facial features of the character to be matched will be replaced with the target facial features, and other characters and scenes in the scene video can be left unprocessed and remain in the original scene video. Image.
  • the electronic device can locate the face feature of the person to be matched, obtain the area to be replaced, and replace the face feature in the area to be replaced with the target face feature .
  • Fig. 2 shows a schematic diagram of facial feature replacement.
  • 141 is the person to be matched in the scene video
  • 142 is the replacement area obtained after locating the facial features of the person to be matched
  • 143 is the acquired target face feature
  • 144 is the replacement of the face feature of the person to be matched with the target The character after the facial features.
  • Operation S150 output the to-be-played video.
  • the output of the video to be played can be to play the video to be played on an electronic device, combining sound and screen content, to present the user with vivid video content, and to reproduce the user's facial features on the person in the video to be played in the video to be played , To enhance the user's sense of substitution for the video content.
  • the interactive information can be identified locally on the electronic device, and the scene video can be obtained according to the interactive information.
  • the electronic device collects facial information, extracts corresponding target facial features, and replaces the facial features of the person to be matched in the scene video to obtain the video to be played.
  • the electronic device when the electronic device establishes a communication connection with the server, after the electronic device obtains the interactive information input by the user, it can also forward the interactive information to the server, and the server obtains the corresponding information through semantic understanding of the interactive information.
  • the electronic device sends the acquired facial information to the server, and the server acquires and extracts the facial features to obtain the target facial features, and replaces the facial features of the person to be matched in the scene video with the target facial features , Get the video to be played, and send the video to be played to the electronic device for playing. This can reduce the local computing and storage pressure of the electronic device.
  • operation S120 and operation S130 is not limited. It can be performed at the same time after obtaining the interactive information, or performed first after obtaining the interactive information input by the user. Operation S130 obtains the facial information of the user and extracts the target facial features. Alternatively, operation S120 may be performed first to obtain the scene video according to the interaction information. In the actual execution process, it can be set as required, and there is no specific limitation here.
  • the electronic device obtains the interactive information input by the user; obtains the scene video according to the interactive information, and the scene video includes the person to be matched; obtains the user's facial information and extracts the corresponding facial features as the target person Face features; replace the facial features of the person to be matched in the scene video with the target face features to generate the to-be-played video; output the to-be-played video.
  • the information is displayed in front of the user vividly through the method of combining voice and image, and the user's face is reproduced on the video character at the same time, the interaction is more intuitive, and the user's sense of substitution with the information is enhanced, thereby improving the user's ability to obtain information.
  • Experience the interaction is more intuitive, and the user's sense of substitution with the information is enhanced, thereby improving the user's ability to obtain information.
  • FIG. 3 another embodiment of the present application provides a video generation method. Based on the foregoing embodiments, this embodiment focuses on the process of generating scene videos based on video text information.
  • the method may include:
  • Operation S210 Obtain interactive information input by the user.
  • operation S210 for the specific description of operation S210, reference may be made to operation S110 in the previous embodiment, which will not be repeated in this embodiment.
  • Operation S220 Perform semantic understanding on the interactive information, and obtain semantic information of the interactive information.
  • the electronic device may input the interactive information into a recognition model corresponding to the type of the interactive information, and recognize the interactive information based on the recognition model to obtain corresponding semantic information.
  • the electronic device may recognize the interaction information based on the voice recognition model, and obtain corresponding semantic information. If the interactive information is text information, the electronic device can recognize the interactive information based on the text recognition model and obtain corresponding semantic information. If the interactive information is image information, the electronic device can recognize the interactive information based on the image recognition model, and obtain corresponding semantic information. If the interaction information is motion information, the electronic device can recognize the interaction information based on a body language recognition model, a terminal gesture recognition model, or a gesture recognition model, and obtain corresponding semantic information.
  • Operation S230 searching for related video text information according to semantic information.
  • the electronic device After the electronic device obtains the semantic information corresponding to the interactive information, it can understand the real intention of the user and realize a more accurate search. According to the semantic information, the relevant video text information can be searched. It is understandable that the video text information can refer to the description The text information of the entire video content. For example, if the video is Aladdin, the text information describing the entire video content is the story "Aladdin and the Magic Lamp.”
  • the electronic device obtains semantic information through semantic understanding of interactive information, and can search for relevant video text information on the network according to the semantic information.
  • the interactive information input by the user is "listen to Aladdin's story”.
  • the electronic device can know that the user wants to listen to Aladdin's story through semantic understanding, and can search for video text information related to Aladdin, which is the story text of "Aladdin and the Magic Lamp”.
  • the electronic device may establish a text database in advance, and the text database stores multiple labeled video text information, where the labeled content may be scenes, characters, paragraphs, and so on.
  • the electronic device can search the corresponding video text information in the database according to the semantic information. It is understandable that the electronic device can mark the video text information according to actual needs, which is not limited here.
  • Operation S240 generate a scene video according to the video text information.
  • the electronic device After the electronic device obtains the video text information, it can generate the corresponding scene video according to the video text information. Specifically, the following operations can be included. Refer to the method flowchart shown in FIG. 4.
  • Operation S241 cutting the video text information according to scenes to obtain at least one piece of scene text.
  • the electronic device can cut the video text information according to the scenes to obtain the corresponding scene text.
  • the electronic device when it cuts the video text information, it may manually mark the video text information in advance, where the marked content may be scene information, character information, time information, and so on. Electronic equipment can be manually marked according to actual needs, which is not limited here. After the marking is completed, the electronic device can store the marked video text information in the database, and then the marked video text information can be obtained by querying the database later. The electronic device cuts the video text information according to the annotation information in the video text information to obtain one or more paragraphs of scene text. If the video text information is a scene, the electronic device obtains a piece of scene text, and if multiple scenes are involved, the electronic device obtains multiple pieces of scene text.
  • the annotated video text information obtained by the electronic device includes two scenes, one of which is a street and the other is a house.
  • the electronic device cuts the video text information to obtain two scene texts.
  • the electronic device may also add position information of the scene text in the video text information to the scene text, so as to determine the sequence of occurrence of the scene.
  • the electronic device cutting the video text information may be inputting the video text information into the first deep learning model for cutting. It is understandable that the first deep learning model can be trained through a large amount of data to realize cutting the video text information according to scenes, so as to obtain at least one scene text after the video text information is cut according to the scenes.
  • Operation S242 perform semantic understanding on at least one piece of scene text, and respectively generate sub-scene videos corresponding to each piece of scene text.
  • At least one piece of scene text can be obtained.
  • the electronic device will perform semantic understanding on the piece of scene text and generate a sub-scene video corresponding to a piece of scene text; if multiple scene texts are obtained, the electronic device will perform each piece of scene text separately Semantic understanding, and generate sub-scene videos corresponding to each piece of scene text.
  • the electronic device can perform semantic understanding of the scene text, extract semantic features from the scene text, the semantic features include characters, locations, and events; convert the scene text into voice information; generate the characters in the text based on the semantic features and voice information.
  • the sub-scene video of the location execution event can perform semantic understanding of the scene text, extract semantic features from the scene text, the semantic features include characters, locations, and events; convert the scene text into voice information; generate the characters in the text based on the semantic features and voice information.
  • the sub-scene video of the location execution event can perform semantic understanding of the scene text, extract semantic features from the scene text, the semantic features include characters, locations, and events; convert the scene text into voice information; generate the characters in the text based on the semantic features and voice information.
  • the audio in the sub-scene video can be converted into audio information from the scene text; the screen content in the sub-scene video can be obtained according to information such as characters, events, and locations in semantic features.
  • the electronic device may establish an image database in advance, and add a corresponding tag to each image in the image database, then it may obtain image information corresponding to the person according to the person, and obtain the action corresponding to the event according to the event.
  • the location the scene corresponding to the location is acquired, and the acquired images are superimposed and synthesized, and then the screen content of the person performing the event at the location can be obtained.
  • the electronic device may search for the content of the corresponding screen on the Internet according to the person, event, and location, and superimpose and synthesize the screen content to obtain the screen content of the event obtained by the person at the location.
  • the scene text reads "Aladdin came to the tunnel entrance, because the top step is too large to reach the ground, so I asked the magician to give him a hand”.
  • the electronic device performs semantic understanding of the scene text and extracts the corresponding semantic features.
  • the semantic features include the characters Aladdin and the magician, the location is a tunnel, and the event is that Aladdin asks the magician to pull him.
  • the electronic device can obtain the characters of Aladdin and the magician, reach out for the action of pulling him, and the scene of the tunnel entrance, synthesize and superimpose the pictures, and generate a picture of Aladdin asking the magician to pull him at the tunnel entrance.
  • content The electronic device converts the scene text into voice information, synthesizes the screen content and the voice information, and generates a sub-scene video.
  • the electronic device when the electronic device converts the scene text into voice information, if the user's face information has been obtained, it can recognize the user's face information, and identify the gender, age and other information of the person in the face information To match the timbre of the voice message with the character. For example, if the face information recognized by the electronic device is female, and the age is 10 years old, the voice information can be processed into a sweet tone to be close to the user's identity image, so that the user will have a better sense of substitution when hearing the voice information .
  • Operation S243 if one sub-scene video is generated, use one sub-scene video as the scene video.
  • the electronic device obtains a piece of scene text after cutting the video text information, it generates a sub-scene video corresponding to the piece of scene text, and uses the sub-scene video as the scene video.
  • Operation S244 if multiple sub-scene videos are generated, synthesize the multiple sub-scene videos into a scene video.
  • the electronic device obtains multiple pieces of scene text after cutting the video text information, then multiple corresponding sub-scene videos are generated according to each piece of scene text.
  • the electronic device synthesizes the multiple sub-scene videos into the scene video according to the occurrence sequence of the video text information.
  • the electronic device may add the location information of the corresponding scene text in the video text information to the sub-scene video when generating the sub-scene video, where the location information may be the location of the scene text in the video text information.
  • Paragraph information For example, if the paragraph of the scene text in the video text information is the 12th paragraph, the electronic device may add the label position to mark the 12th paragraph when generating the sub-scene video corresponding to the scene text.
  • the corresponding paragraph information is also annotated at the same time.
  • the paragraph information of the scene text can be obtained as a position label and added to the sub-scene video.
  • the electronic device synthesizes a plurality of sub-scene videos into a scene video, which may be obtained by acquiring a position label in each sub-scene video, and splicing and synthesizing the sub-scene videos according to the sequence of the position label to obtain the scene video. For example, the electronic device generates three sub-scene videos, which are a first sub-scene video, a second sub-scene video, and a third sub-scene video.
  • the position in the first sub-scene video is marked as paragraph 1
  • the position in the second sub-scene video is marked as paragraph 12
  • the position in the third sub-scene video is marked as paragraph 6, which can be marked by location
  • the scene video generated according to the video text information may include multiple characters, and one of the characters may be the character to be matched, and the facial features of the character to be matched are replaced.
  • Operation S250 Obtain the face information of the user and extract the corresponding face feature as the target face feature.
  • Operation S260 replacing the facial features of the person to be matched in the scene video with the target facial features to generate a video to be played.
  • Operation S270 output the video to be played.
  • the embodiment of the present application proposes a video generation method.
  • An electronic device obtains video text information through interactive information, cuts the video text information according to scenes, to obtain at least one piece of scene text; performs semantic understanding on at least one piece of scene text, and respectively generates a corresponding piece of scene text.
  • Sub-scene video if one sub-scene video is generated, the sub-scene video is used as the scene video; if multiple sub-scene videos are generated, the multiple sub-scene videos are combined into a scene video.
  • the video text information can be converted into the corresponding scene video to show the user vivid information content.
  • FIG. 5 another embodiment of the present application provides a video generation method. Based on the foregoing embodiments, this embodiment focuses on the process of obtaining scene videos based on interactive information.
  • the method may include:
  • Operation S310 Obtain interaction information input by the user.
  • Operation S320 perform semantic understanding on the interactive information, and obtain semantic information of the interactive information.
  • Operation S330 searching for a related video file as a scene video according to the semantic information.
  • the electronic device After the electronic device obtains the semantic information corresponding to the interactive information, it can directly search for the relevant video text as the scene video based on the semantic information.
  • the user's interactive information is "how to make braised pork".
  • the user wants to know how to make braised pork then search for video tutorials related to making braised pork, and use the searched video tutorials as scene videos.
  • the electronic device searches for related video tutorials, it may obtain multiple video tutorials, and the video tutorial with the highest amount of playback or comment may be used as the scene video according to the amount of video playback and the amount of comments. It is understandable that how to select scene videos from the searched video tutorials can be set according to actual needs, which is not limited here.
  • the electronic device searches according to semantic information, it can search in a special database, or it can search through the network, and it can be set according to actual needs, which is not limited here.
  • Operation S340 Obtain the facial information of the user and extract the corresponding facial features as the target facial features.
  • Operation S350 replacing the facial features of the person to be matched in the scene video with the target facial features to generate a video to be played.
  • Operation S360 output the video to be played.
  • the embodiment of the application proposes a video generation method.
  • the electronic device obtains the interactive information input by the user; performs semantic understanding of the interactive information to obtain the semantic information of the interactive information, and searches for the relevant video file as the scene video according to the semantic information to obtain the user's face Information and extract the corresponding facial features as the target facial features; replace the facial features of the person to be matched in the scene video with the target facial features to generate the to-be-played video; output the to-be-played video.
  • You can search for related videos with semantic information, and display the information to the user vividly through the video.
  • the user's sense of substitution can be enhanced, thereby enhancing the user's use of obtaining information Experience.
  • FIG. 6 another embodiment of the present application provides a video generation method. Based on the foregoing embodiment, this embodiment focuses on the process of determining the person to be matched in the scene video.
  • the specific method may include:
  • Operation S410 Obtain interaction information input by the user.
  • Operation S420 Obtain a scene video according to the interactive information.
  • Operation S430 Determine the person to be matched in the scene video.
  • the electronic device may include multiple characters in the scene information acquired according to the interactive information. Among multiple characters, you can select one character as the character to be matched, and replace the facial features.
  • the acquired scene video is a video related to Aladdin
  • the semantic understanding of the scene video can be performed, and it is known that the protagonist in the scene video is Aladdin, then Aladdin can be used as the character to be matched.
  • the number and duration of appearance of each character in the scene video can be counted, and the character with the most appearance times is taken as the protagonist of the scene video.
  • the characters appearing include character A, character B and character C. Among them, character A appears twice, the first appearance is 50s, and the second appearance is 10s; character B appears Once, the duration is 10s; the character C appears once, and the appearance duration is 1s. Combining the number and duration of each character's appearance, it can be determined that the character A is the protagonist of the scene video. Then, character A can be used as the character to be matched in the scene video.
  • it may be to obtain the characters appearing in the scene video, display the characters appearing in the scene video, to instruct the user to select a specified person from the displayed characters, and obtain the specified person selected by the user , And use the specified person as the person to be matched in the scene video.
  • Operation S440 Obtain the face information of the user and extract the corresponding face feature as the target face feature.
  • Operation S450 replacing the facial features of the person to be matched in the scene video with the target facial features to generate a video to be played.
  • the electronic device can perform semantic understanding of the acquired scene video, acquire the protagonist in the entire scene video, and use the protagonist as the person to be matched; replace the facial features of the person to be matched with the target facial features to generate Play the video.
  • the scene video obtained by the electronic device is a video related to Aladdin
  • the scene video can be semantically understood.
  • Aladdin can be regarded as the character to be matched;
  • the facial features are replaced with the target facial features to generate the video to be played.
  • the electronic device when it performs semantic understanding of the scene video, it can count the number and duration of appearance of each character in the scene video, and use the character with the most appearance as the protagonist of the scene video.
  • the characters appearing include character A, character B and character C. Among them, character A appears twice, the first appearance is 50s, and the second appearance is 10s; character B appears Once, the duration is 10s; the character C appears once, and the appearance duration is 1s.
  • the character A can be used as the character to be matched in the scene video, and the facial feature of the character A is replaced with the target facial feature to generate the to-be-played video.
  • the electronic device may obtain the characters appearing in the scene video, display the characters appearing in the scene video, to instruct the user to select a specified person from the displayed characters, and obtain the specified person selected by the user to specify the person As the person to be matched in the scene video; replace the face feature of the person to be matched with the target face feature to generate the video to be played.
  • Operation S460 output the to-be-played video.
  • FIG. 7 shows a video generation device 500 provided by an embodiment of the present application, which is applied to electronic equipment.
  • the video generation device 500 includes an information input module 510, a scene video acquisition module 520, a face acquisition module 530, and a video A generation module 540 and an output module 550.
  • the information input module 510 is used to obtain the interactive information input by the user; the scene video obtaining module 520 is used to obtain the scene video according to the interactive information, and the scene video includes the character to be matched; the face obtaining module 530 is used to obtain the user's face Information and extract the corresponding facial features as the target feature; the video generation module 540 is used to replace the facial features of the person to be matched in the scene video with the target facial features to generate the to-be-played video; the output module 550 is used to output the to-be-played video .
  • the scene video acquisition module 520 also includes: an understanding unit, used to perform semantic understanding of the interactive information, and obtain semantic information of the interactive information; a video generation unit, used to search for relevant video text information according to the semantic information; and generate a scene video according to the video text information .
  • the video generation unit also includes: a cutting subunit for cutting the video text information according to scenes to obtain at least one piece of scene text; a generating subunit for performing semantic understanding of at least one piece of scene text, and respectively generating a corresponding piece of scene text Sub-scene video; the synthesis sub-unit is used for generating one sub-scene video, using one sub-scene video as the scene video; if generating multiple sub-scene videos, synthesizing the multiple sub-scene videos into the scene video.
  • the generating sub-unit is also used to extract semantic features from the scene text.
  • the semantic features include characters, location, and time; the scene text is converted into voice information; according to the semantic features and voice information, a sub-scene video in which the characters perform the event at the location is generated.
  • the scene video acquisition module 520 is also used to perform semantic understanding of the interaction information, and obtain semantic information of the interaction information; and search for a related video file as a scene video according to the semantic information.
  • the video generation module 540 also includes: a determination unit, used to perform semantic detriment on the scene video, to obtain the protagonist of the entire scene video, and use the protagonist as the character to be matched in the scene video; a replacement unit to determine the facial features of the character to be matched Replace with target facial features.
  • a determination unit used to perform semantic detriment on the scene video, to obtain the protagonist of the entire scene video, and use the protagonist as the character to be matched in the scene video
  • a replacement unit to determine the facial features of the character to be matched Replace with target facial features.
  • the video generation module 540 also includes: a display unit for displaying all the characters in the scene video to instruct the user to select a specified person from all people; obtain the specified person selected by the user, and use the specified person as the person to be matched in the scene video ;
  • the replacement unit is used to replace the facial features of the person to be matched with the target facial features.
  • the electronic device obtains the interactive information input by the user; obtains the scene video according to the interactive information, and the scene video includes the person to be matched; obtains the user's facial information and extracts the corresponding facial feature as the target facial feature;
  • the facial features replace the facial features of the person to be matched in the scene video to generate the to-be-played video; output the to-be-played video. Therefore, the information is displayed to the user vividly through the video, and the facial features of the specific person in the video are replaced with the target facial features, which enhances the user's sense of substitution, thereby enhancing the user's experience of obtaining information.
  • the coupling or direct coupling or communication connection between the displayed or discussed modules may be through some interfaces, and the indirect coupling or communication connection between the devices or modules may be electrical, Mechanical or other forms.
  • the functional modules in the various embodiments of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware or software functional modules.
  • the electronic device 600 may be an electronic device capable of running application programs, such as a smart phone, a tablet computer, or an e-book.
  • the electronic device 600 in this application may include one or more of the following components: a processor 610, a memory 620, and one or more application programs, where one or more application programs may be stored in the memory 620 and configured to be configured by One or more processors 610 are executed, and one or more programs are configured to implement the following operations: obtain interactive information input by the user; obtain a scene video according to the interactive information, and the scene video includes the person to be matched; obtain the user’s face information and The corresponding facial features are extracted as the target facial features; the facial features of the person to be matched in the scene video are replaced with the target facial features to generate the to-be-played video; and the to-be-played video is output.
  • obtaining the scene video according to the interactive information includes: performing semantic understanding of the interactive information to obtain the semantic information of the interactive information; searching for relevant video text information according to the semantic information; and generating the scene video according to the video text information.
  • generating a scene video according to the video text information includes: cutting the video text information according to the scene to obtain at least one piece of scene text; performing semantic understanding on at least one piece of scene text, and respectively generating a sub-scene video corresponding to each piece of scene text; if One sub-scene video is generated, and one sub-scene video is used as the scene video; and if multiple sub-scene videos are generated, the multiple sub-scene videos are combined into the scene video.
  • perform semantic understanding of at least one piece of scene text and respectively generate sub-scene videos corresponding to each piece of scene text, including: extracting semantic features from the scene text, the semantic features including characters, places, and events; converting the scene text into voice information ; And according to semantic features and voice information, generate sub-scene videos in which people perform events at the place.
  • acquiring the scene video according to the interactive information includes: understanding the semantics of the interactive information to obtain the semantic information of the interactive information; and searching for a related video file as the scene video according to the semantic information.
  • replacing the facial features of the person to be matched in the scene video with the target face feature to generate the video to be played including: semantic understanding of the scene video, obtaining the protagonist of the entire scene video, and using the protagonist as the person to be matched in the scene video ; And replacing the facial features of the person to be matched with the facial features of the target person to generate the video to be played.
  • replacing the facial features of the person to be matched in the scene video with the target face feature to generate the to-be-played video including: displaying all the characters in the scene video to instruct the user to select a specified person from all people; obtaining the selected person from the user Specify the person, and use the specified person as the person to be matched in the scene video; and replace the facial feature of the person to be matched with the target facial feature to generate the video to be played.
  • the processor 610 may include one or more processing cores.
  • the processor 610 uses various interfaces and lines to connect various parts of the entire electronic device 600, and executes by running or executing instructions, programs, code sets, or instruction sets stored in the memory 620, and calling data stored in the memory 620.
  • the processor 610 may use at least one of digital signal processing (Digital Signal Processing, DSP), Field-Programmable Gate Array (Field-Programmable Gate Array, FPGA), and Programmable Logic Array (Programmable Logic Array, PLA).
  • DSP Digital Signal Processing
  • FPGA Field-Programmable Gate Array
  • PLA Programmable Logic Array
  • the processor 610 may integrate one or a combination of a central processing unit (CPU), a graphics processing unit (GPU), a modem, and the like.
  • the CPU mainly processes the operating system, user interface, and application programs; the GPU is used for rendering and drawing of display content; the modem is used for processing wireless communication. It can be understood that the above-mentioned modem may not be integrated into the processor 610, but may be implemented by a communication chip alone.
  • the memory 620 may include random access memory (RAM) or read-only memory (Read-Only Memory).
  • the memory 620 may be used to store instructions, programs, codes, code sets, or instruction sets.
  • the memory 620 may include a storage program area and a storage data area, where the storage program area may store instructions for implementing the operating system and instructions for implementing at least one function (such as touch function, sound playback function, image playback function, etc.) , Instructions used to implement the following various method embodiments, etc.
  • the storage data area can also store data created by the electronic device 600 during use (such as phone book, audio and video data, chat record data), and the like.
  • FIG. 9 shows a structural block diagram of a computer-readable storage medium provided by an embodiment of the present application.
  • the computer-readable storage medium 700 stores program code, and when the program code can be invoked and executed by the processor, the following operations are realized: acquiring interactive information input by a user; acquiring a scene video according to the interactive information, and the scene video includes waiting Matching people; acquiring the user's face information and extracting the corresponding facial features as target facial features; replacing the facial features of the person to be matched in the scene video with the target facial features to generate the video to be played; and outputting Describe the video to be played.
  • the obtaining the scene video according to the interactive information includes: performing semantic understanding of the interactive information to obtain the semantic information of the interactive information; searching for relevant video text information according to the semantic information; and according to the semantic information
  • the video text information generates a scene video.
  • the generating a scene video according to the video text information includes: cutting the video text information according to the scene to obtain at least one piece of scene text; performing semantic understanding on the at least one piece of scene text, and respectively generating corresponding to each piece of scene text If one sub-scene video is generated, the one sub-scene video is used as the scene video; and if multiple sub-scene videos are generated, the multiple sub-scene videos are combined into the scene video.
  • the performing semantic understanding of the at least one piece of scene text to generate sub-scene videos corresponding to each piece of scene text respectively includes: extracting semantic features from the scene text, the semantic features including people, places, and events Converting the scene text into voice information; and generating a sub-scene video in which the person performs the event at the location according to the semantic feature and the voice information.
  • the obtaining the scene video according to the interaction information includes: performing semantic understanding of the interaction information to obtain semantic information of the interaction information; and searching for a related video file as the scene video according to the semantic information .
  • the step of replacing the facial features of the person to be matched in the scene video with the target face features to generate the to-be-played video includes: semantically understanding the scene video, obtaining the protagonist of the entire scene video, and combining all the features of the scene video.
  • the protagonist is used as the person to be matched in the scene video; and the face feature of the person to be matched is replaced with the target face feature to generate the to-be-played video.
  • the step of replacing the facial features of the person to be matched in the scene video with the target face feature to generate the to-be-played video includes: displaying all the characters in the scene video to instruct the user to select all the characters from the scene video. Select a designated person in the selected; obtain the designated person selected by the user, and use the designated person as the person to be matched in the scene video; and replace the facial feature of the person to be matched with the target face feature to generate Play the video.
  • the computer-readable storage medium 700 may be an electronic memory such as flash memory, EEPROM (Electrically Erasable Programmable Read Only Memory), EPROM, hard disk, or ROM.
  • the computer-readable storage medium 700 includes a non-transitory computer-readable storage medium.
  • the computer-readable storage medium 700 has a storage space for executing the program code 710 for each operation in the method embodiment of the present application. These program codes can be read from or written into one or more computer program products.
  • the program code 710 may be compressed in a suitable form, for example.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Library & Information Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Processing Or Creating Images (AREA)

Abstract

一种视频生成方法,包括:获取用户输入的交互信息;根据交互信息获取场景视频,场景视频中包括待匹配人物;获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;输出待播放视频。

Description

视频生成方法、装置、电子设备及存储介质
相关申请的交叉引用
本申请要求于2019年12月04日提交中国专利局、申请号为201911228480.6、发明名称为“视频生成方法、装置、电子设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及电子设备技术领域,更具体地,涉及一种视频生成方法、装置、电子设备及存储介质。
背景技术
随着科技的发展,人们的生活日益丰富,人们获取文本中的信息的方式也越来越多,越来越方便。相比于之前仅能通过阅读的方式来获取文本中的信息,现在还可以通过音频的方式来实现。
然而,通过音频的方式可以方便用户在不用看着文本的情况下,也能获取到文本信息,但较为枯燥,无趣,用户难以了解文本内容的以及环境场景相融合的具体信息,从而降低了用户获取信息的体验感。
发明内容
根据本申请的各种实施例,提供一种视频生成方法、装置、电子设备及存储介质。
第一方面,本申请实施例提供了一种视频生成方法,所述方法包括:
获取用户输入的交互信息;
根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
输出所述待播放视频。
第二方面,本申请实施例提供了一种视频生成装置,所述装置包括:
信息输入模块,用于获取用户输入的交互信息;
场景视频获取模块,用于根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
人脸获取模块,用于获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
视频生成模块,用于以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
输出模块,用于输出所述待播放视频。
第三方面,本申请实施例提供了一种电子设备,所述电子设备包括:
一个或多个处理器;
存储器,与所述一个或多个处理器电连接;
一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于实现以下操作:
获取用户输入的交互信息;
根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
输出所述待播放视频。
第四方面,本申请实施列提供一种计算机可读存储介质,所述计算机可读取存储介质中存储有程序代码,所述程序代码被处理器调用执行时,实现以下操作:
获取用户输入的交互信息;
根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
输出所述待播放视频。
本发明的一个或多个实施例的细节在下面的附图和描述中提出。本发明的其它特征、目的和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更好地描述和说明这里公开的那些发明的实施例和/或示例,可以参考一幅或多幅附图。用于描述附图的附加细节或示例不应当被认为是对所公开的发明、目前描述的实施例和/或示例以及目前理解的这些发明的最佳模式中的任何一者的范围的限制。
图1示出了本申请一个实施例提供的视频生成方法的流程图。
图2示出了本申请一个实施例提供的对待匹配人物的脸部特征进行替换的示意图。
图3示出了本申请另一个实施例提供的视频生成方法的流程图。
图4示出了本申请一个实施例提供的根据视频文本信息生成场景视频的流程示意图。
图5示出了本申请另一个实施例提供的视频生成方法的流程图。
图6示出了本申请另一个实施例提供的视频生成方法的流程图。
图7示出了本申请一个实施例提供的视频生成装置的功能模块图。
图8示出了本申请一个实施例提供的用于执行根据本申请实施例的视频生成方法的电子设备的结构框图。
图9示出了本申请一个实施例提供的用于保存或者携带实现根据本申请实施例的视频生成方法的程序代码的存储介质的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本申请方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。
随着社会及进步,科技发展,人们可以通过各种方式获取信息和知识的途径越来越多,例如,阅读文本,听取音频或是观看视频都可以获取到各种信息。然而通过阅读文本或听取音频的方式较为单调,用户在阅读文本或听取音频的时间较长时,通常会感到枯燥,从而导致用户的体验较差。视频具有较好的表现方式,可以通过声音和画面为用户提供信息,然而,由于画面中的人物不是用户本身,则产生的代入感较弱,从而导致用户的体验较差。
发明人在研究中发现,电子设备在通过视频获取信息时,可以将用户的脸复现在视频中的某个人物上,以增强用户的代入感,更好的获取视频中的信息,从而增强用户的体验。
由此,发明人提出了本申请实施例中的视频生成方法、装置、电子设备及存储介质。电子设备在通过视频展示信息内容的同时,将用户的脸复现在视频的某个人物上,以增强用户的的代入感,从而提升用户的体验。
下面将对本申请实施例进行详细的说明。
请参阅图1,本申请实施例提供了一种视频生成方法,可应用于电子设备。其中,电子 设备可以是具有显示屏、具有拍摄相机、具有音频输出功能且支持数据输入的各种电子设备,包括但不限于智能手机、平板电脑、膝上型便携计算机、台式计算机和可穿戴式电子设备等。具体的,数据输入可以是基于电子设备上具有的语音模块输入语音、字符输入模块输入字符等,具体的该方法可以包括:
操作S110:获取用户输入的交互信息。
本实施例中,可通过电子设备中集成的多种信息输入模块或与电子设备连接的多种信息输入装置获取用户输入的交互信息。
在一些实施方式中,交互信息包括但不限于语音信息、文本信息、图像信息、动作信息等各种类型的信息。其中,语音信息可以包括语音类的音频信息,例如汉语,英语音频等,以及非语言类的音频信息,例如音乐音频等;文本信息可以包括文字类的文本信息,例如中文、英文等,以及非文字类的文本信息,例如特殊符号,字符表情等;图像信息可以包括静态图像信息,例如静态图片、照片等,以及动态图像信息,例如动态图片、视频图像等;动作信息可以包括用户动作信息,例如用户手势、身体动作、表情动作等,以及终端动作信息,例如终端设备的位置、姿态和摇动、旋转等运动状态等。
可以理解的是,对应于不同种类的交互信息,电子设备可以通过不同类型的信息输入模块进行信息采集。例如,电子设备可通过麦克风等音频输入设备采集用户的语音信息,通过触摸屏或物理按键采集用户输入的文本信息,通过摄像头采集图像信息,通过光学传感器、重力传感器等采集动作信息等。
对于同一个请求,可以对应不同的类型的交互信息。例如,用户想要输入“我想听阿拉丁的故事”的请求时,用户可以通过语音输入的方式输入对应的音频,也可以上传与阿拉丁相关的图片或输入对应的文本信息。可以理解的是,对应于同一个请求,可以仅输入一种类型的交互信息,也可以同时输入多种类型的交互信息,使用户的意图更加明确,更易被电子识别。
本实施例中,电子设备通过多种方式来获取不同种类的交互信息,使得用户的多种交互方式可以自由得到响应,不再局限于传统机械式的人机交互手段,实现了人机之间的多态交互,满足更多的交互场景。
操作S120:根据交互信息获取场景视频,场景视频中包括待匹配人物。
电子设备在获取用户输入的交互信息后,可以对交互信息进行语义理解,获取交互信息的语义信息,以实现精准的理解用户的交互信息。
场景视频,可以是电子设备针对用户输入的交互信息,获取的与交互信息相关的视频信息。
作为一种实施方式,电子设备可以根据语义信息,搜索与语义信息相关的视频。例如,用户输入的交互信息为“我想听阿拉丁的故事”,与该交互信息对应的场景视频可以是与阿拉丁对应的的影视作品等。
作为另一种实施方式,电子设备可以根据语义信息,搜索与语义信息相关的视频文本信息。例如,用户输入的交互信息为“我想听阿拉丁的故事”,则电子设备搜索与阿拉丁相关的故事文本,根据故事文本生成对应的场景视频。
具体的,电子设备可以对获取到的视频文本信息按照场景进行切割,获得多段场景文本;基于每一段场景文本进行语义理解,获取每段场景文本中的人物,地点和事件,并将场景文本转换为语音信息。电子设备在生成与场景文本对应的子场景视频时,则可以根据人物,地点和事件,生成人物在地点执行事件的视频画面,将语音信息与视频画面合成,则可以得到与场景文本对应的子场景视频。若生成一个子场景视频,则电子设备将一个子场景视频作为场景视频;若生成多个子场景视频,则电子设备对多个子场景视频进行拼接合成为场景视频。
操作S130:获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征。
电子设备获取用户的人脸信息,并根据用户的人脸信息提取人脸特征。其中,人脸信息可以是人脸图像,或是一段包括人脸的视频。本申请实施例中,人脸特征可以是用于描述人脸全部或部分形态的特征点集合,其记载有人脸上各个特征点在空间中的位置信息和深度信息,通过获取人脸特征即可重建人脸局部或全部的图像。在一些实施方式中,电子设备可以将获取的人脸图像或人脸视频,输入特征提取模型中,以获得人脸特征。其中,可以理解的是人脸特征可以是五官特征,例如,眉毛,眼部,鼻部,嘴部,耳部的特征。
其中,电子设备获取用户的人脸信息,可以通过电子设备的摄像装置采集的用户的人脸图像,也可以是用户所提供的人脸图像。通过摄像装置采集人脸图像时,可以是在电子设备获取到用户输入的交互信息后,启动电子设备的摄像装置采集人脸图像。电子设备根据人脸信息提取人脸特征,可以是将获取到的人脸图像或视频在电子设备端提取人脸特征作为目标人脸;也可以是通过网络等将获取到的人脸图像或视频发送给服务器,由服务器提取人脸特征作为人脸特征。定义目标人脸特征为根据获取到的人脸信息提取到的人脸特征。
操作S140:以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频。
电子设备在获取到与交互信息对应的场景视频,以及目标人脸特征后,可以将目标人脸特征替换场景视频中的待匹配人物的脸部特征生成待播放视频。
其中,待匹配人物为获取的场景视频中需要替换的人物。在一些实施方式中,电子设备可以对用户指定的人物进行脸部特征点的替换。在另一些实施方式中,电子设备可以对场景视频进行语义理解,获取整个场景视频中的主角,对主角的脸部特征进行替换。电子设备将目标人脸特征复现在场景视频中待匹配人物的脸上,得到待播放视频。
电子设备对场景视频中的待匹配人物的脸部特征进行替换时,由于场景视频可以拆分为多帧图像,则可以对场景视频中的每一帧图像进行处理,分别检测每一帧图像中是否存在待匹配人物;若在某一帧图像中存在待匹配人物,则对待匹配人物的脸部特征进行定位确定替换区,将替换区替换为目标人脸特征。由此,若场景视频中存在待匹配人物的画面,待匹配人物的脸部特征都会被替换为目标人脸特征,而场景视频中的其他人物和场景可以不做处理,保持在场景视频中原有的图像。
在以目标人脸特征替换待匹配人物的脸部特征时,电子设备可以对待匹配人物的脸部特征进行定位,获得待替换区,并将待替换区中的脸部特征替换为目标人脸特征。请参阅图2,示出了脸部特征替换的示意图。其中141为场景视频中的待匹配人物,142为对待匹配人物的脸部特征进行定位后得到的替换区,143为获取的目标人脸特征,144为将待匹配人物的脸部特征替换为目标人脸特征后的人物。
操作S150:输出待播放视频。
对待播放视频进行输出,可以是在电子设备上播放待播放视频,结合声音和画面内容,给用户呈现活灵活现的视频内容,并且待播放视频中将用户的脸部特征复现在待播放视频的人物身上,提升了用户对视频内容的代入感。
作为一种实施方式,电子设备获取交互信息后,可以在电子设备本地对交互信息进行识别,并根据交互信息获取场景视频。电子设备采集人脸信息,提取对应的目标人脸特征,对场景视频中的待匹配人物进行脸部特征的替换,以得到待播放视频。
作为一种实施方式,在电子设备与服务器建立通信连接的状态下,电子设备获取到用户输入的交互信息后,还可以将交互信息转发至服务器,由服务器通过对交互信息进行语义理解获取对应的场景视频,电子设备将获取到的人脸信息发送给服务器,由服务器获取进行人脸特征的提取获得目标人脸特征,并将场景视频中的待匹配人物的脸部特征替换为目标人脸特征,得到待播放视频,将待播放视频发送给电子设备进行播放。从而可以减小电子设备的本地运算存储压力。
可以理解的是,操作S120和操作S130的前后顺序并不做限定,可以是在获取到交互 信息后,同时进行操作S120和操作S130,也可以是在获取到用户输入的交互信息后,先执行操作S130获取用户的人脸信息提取目标人脸特征,也可以是先执行操作S120,根据交互信息获取场景视频。在实际的执行过程中,可以根据需要进行设置,在此不做具体的限定。
本申请实施例提出的视频生成方法,电子设备获取用户输入的交互信息;根据交互信息获取场景视频,场景视频中包括待匹配人物;获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;输出待播放视频。从而将信息通过语音和画面相结合的方法,活灵活现的展现在用户面前,同时将用户的脸复现在视频的人物上,交互更直观,增强用户对信息的代入感,从而提升了用户获取信息的体验。
请参阅图3,本申请另一实施例提供了一种视频生成方法,本实施例在前述实施例的基础上,重点描述了根据视频文本信息生成场景视频的过程,该方法可以包括:
操作S210:获取用户输入的交互信息。
本实施例中,操作S210的具体描述可以参考上一实施例中的操作S110,本实施例对此不再赘述。
操作S220:对交互信息进行语义理解,获取交互信息的语义信息。
本实施例中,针对交互信息的不同类型,电子设备可以将交互信息输入与交互信息类型对应的识别模型中,并基于识别模型对该交互信息进行识别,获取对应的语义信息。
作为一种实施方式,若用户输入的交互信息为语音信息,则电子设备可以基于语音识别模型对交互信息进行识别,获取对应的语义信息。若交互信息为文本信息,则电子设备可以基于文字识别模型对交互信息进行识别,获取对应的语义信息。若交互信息为图像信息,则电子设备可以基于图像识别模型对交互信息进行识别,获取对应的语义信息。若交互信息为动作信息,则电子设备可以基于肢体语言识别模型、终端姿态识别模型或手势识别模型来对交互信息进行识别,获取对应的语义信息。
操作S230:根据语义信息搜索相关的视频文本信息。
电子设备在获取到交互信息对应的语义信息后,可以了解到用户的真正的意图,实现更加精准的搜索,根据语义信息搜索相关的视频文本信息,可以理解的是,视频文本信息可以是指描述整个视频内容的文本信息。例如,视频为阿拉丁,那么描述整个视频内容的文本信息则为故事《阿拉丁与神灯》。
作为一种实施方式中,电子设备通过对交互信息的语义理解获取语义信息,可以根据语义信息在网络上搜索相关的视频文本信息。例如,用户的输入的交互信息为“听阿拉丁故事”。电子设备通过语义理解可以知道用户是想要听取阿拉丁的故事,则可以搜索与阿拉丁相关的视频文本信息,即为《阿拉丁与神灯》的故事文本。
作为一种实施方式,电子设备可以预先建立文本数据库,文本数据库中存储着多个标注后的视频文本信息,其中,标注的内容可以是场景,人物,段落等。电子设备在获取到语义信息后,则可以根据语义信息在数据库中搜索对应的视频文本信息。可以理解的是,电子设备可根据实际的需求进行视频文本信息的标注,在此不做限定。
操作S240:根据视频文本信息生成场景视频。
电子设备在获取到视频文本信息后,则可以依据视频文本信息生成对应的场景视频,具体的,可以包括以下操作,可参阅图4所示出的方法流程图。
操作S241:对视频文本信息按照场景进行切割,获得至少一段场景文本。
通常,视频文本信息中涉及多个场景,则电子设备可以将视频文本信息按照场景进行切割,获取对应的场景文本。
作为一种实施方式,电子设备对视频文本信息进行切割,可以是预先对视频文本信息进行人工标注,其中,标注的内容可以是场景信息,人物信息,时间信息等。电子设备可根据实际的需求进行人工标注,在此不做限定。电子设备在标注完成后,可以将标注后的视 频文本信息存储在数据库中,则后续可以通过查询数据库获取标注后的视频文本信息。电子设备根据视频文本信息中的标注信息,对视频文本信息进行切割,获得一段或多段场景文本。若视频文本信息是一个场景,则电子设备获得一段场景文本,若涉及多个场景,则电子设备获得多段场景文本。
例如,电子设备获取的标注后的视频文本信息中包括两个场景,其中一个场景为街道,另一个为屋内。电子设备对该视频文本信息进行切割,获取到两段场景文本。进一步的,电子设备还可以为场景文本添加场景文本在视频文本信息中的位置信息,以便于确定场景的发生顺序。
作为一种实施方式,电子设备对视频文本信息进行切割,可以是将视频文本信息输入第一深度学习模型中进行切割。可以理解的是,第一深度学习模型,可以通过大量的数据进行训练,以实现对视频文本信息按照场景进行切割,从而获取到视频文本信息按照场景切割后的至少一个场景文本。
操作S242:对至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频。
电子设备对视频文本信息按照场景进行切割后,可以获取到至少一段场景文本。其中,若切割后获得一段场景文本,则电子设备对该一段场景文本进行语义理解,生成对应一段场景文本的子场景视频;若获取到多个场景文本,则电子设备分别对每一段场景文本进行语义理解,生成分别对应每一段场景文本的子场景视频。
具体的,电子设备可以是对场景文本进行语义理解,从场景文本中提取语义特征,语义特征包括人物,地点,事件;将场景文本转换为语音信息;根据语义特征和语音信息,生成以人物在地点执行事件的子场景视频。
其中,子场景视频中的音频可以由场景文本转换成的音频信息;子场景视频中的画面内容可以根据语义特征中的人物,事件,地点等信息获取到。
作为一种实施方式,电子设备可以预先建立图像数据库,并为图像数据库的中每个图像添加对应的标签,则可以根据人物获取与该人物对应的图像信息,根据事件获取与该事件对应的动作,根据地点获取与该地点对应的场景,将获取的图像进行叠加合成,则可以得到以人物在地点执行事件的画面内容。
作为一种实施方式,电子设备可以是根据人物,事件,地点,在网络上搜索对应的画面的内容,并将画面内容进行叠加合成,得到以人物在地点获取事件的画面内容。
例如,场景文本为“阿拉丁来到地道口,因为最上面的一级台阶离地面跨度太大,迈不上去,便请求魔法师拉他一把”。电子设备对场景文本进行语义理解,提取对应的语义特征,其中语义特征中包括人物阿拉丁和魔法师,地点为地道口,事件为阿拉丁请求魔法师拉他。
则电子设备可以获取阿拉丁和魔法师的人物形象,伸手请求拉他一把的动作,以及地道口的场景,将画面进行合成叠加,生成阿拉丁在地道口请求魔法师拉他一把的画面内容。电子设备将场景文本转换为语音信息,将画面内容和语音信息进行合成,生成子场景视频。
作为一种实施方式,电子设备将场景文本转换为语音信息时,若已经获取到用户的人脸信息,则可以对用户的人脸信息进行识别,识别人脸信息中人物的性别,年龄等信息,将语音信息的音色与人物进行匹配。例如,电子设备识别的人脸信息为女,年龄10岁,则可以将语音信息的音色处理为甜美型,以贴近用户的身份形象,使得用户在听到语音信息时,产生更好的代入感。
操作S243:若生成一个子场景视频,将一个子场景视频作为场景视频。
若电子设备对视频文本信息进行切割后,获得一段场景文本,则对应该一段场景文本生成一个子场景视频,将一个子场景视频作为场景视频。
操作S244:若生成多个子场景视频,将多个子场景视频合成为场景视频。
若电子设备对视频文本信息进行切割后,获得多段场景文本,则根据每一段场景文本生成对应的多个子场景视频。电子设备将多个子场景视频按照视频文本信息的发生顺序,将多个子场景视频合成为场景视频。
作为一种实施方式,电子设备可以在生成子场景视频时,在子场景视频中添加对应的场景文本在视频文本信息中的位置信息,其中,位置信息可以是场景文本在视频文本信息中所在的段落信息。例如,场景文本在视频文本信息中的段落为第12段,则电子设备可以在生成与场景文本对应的子场景视频时,添加标注位置标注为第12段。
可以理解的是,标注可以通过人工对场景文本进行标注时,同时也标注的对应的段落信息。在通过场景文本生成对应的子场景视频时,则可以获取场景文本的段落信息作为位置标注,添加进子场景视频中。
电子设备将多个子场景视频合成为场景视频,可以是获取每个子场景视频中的位置标注,按照位置标注的先后顺序对子场景视频进行拼接合成得到场景视频。例如,电子设备生成了三个子场景视频,分别为第一子场景视频,第二子场景视频,第三子场景视频。其中,第一子场景视频中的位置标注为第1段,第二子场景视频中的位置标注为第12段,第三子场景视频中的位置标注为第6段,则可以通过位置标注,确定各个子场景视频的发生顺序为第一子场景视频,第三子场景视频,第二子场景视频,则可以按照该顺序将三个子场景视频进行拼接得到场景视频。
可以理解的是,根据视频文本信息生成的场景视频中,可以包括多个人物,其中一个人物可以是待匹配人物,以对待匹配的人物的脸部特征进行替换。
操作S250:获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征。
操作S260:以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频。
操作S270:输出待播放视频。
操作S250至操作S270可参照前述实施例对应部分,在此不再赘述。
本申请实施例提出视频生成方法,电子设备通过交互信息获取视频文本信息,将视频文本信息按照场景进行切割,获得至少一段场景文本;对至少一段场景文本进行语义理解,分别生成对应每一段场景的子场景视频;若生成一个子场景视频,将子场景视频作为场景视频;若生成多个子场景视频,将多个子场景视频合成为场景视频。可以将视频文本信息转换为对应的场景视频,以给用户展示活灵活现的信息内容。
请参阅图5,本申请另一实施例提供了一种视频生成方法,本实施例在前述实施例的基础上,重点描述了根据交互信息获取场景视频的过程,该方法可以包括:
操作S310:获取用户输入的交互信息。
操作S320:对交互信息进行语义理解,获取交互信息的语义信息。
操作S310至操作S320可参照前述实施例部分,在此不再赘述。
操作S330:根据语义信息搜索相关的视频文件作为场景视频。
电子设备获取到交互信息对应的语义信息后,则可以直接根据语义信息搜搜相关的视频文本作为场景视频。例如,用户的交互信息为“怎么做红烧肉”,通过语义理解可以获知用户是想知道怎么做红烧肉,则搜索与做红烧肉相关的视频教程,将搜索的到的视频教程作为场景视频。
电子设备在搜索相关的视频教程时,可能获取到多个视频教程,则可以根据视频的播放量以及评论量,将播放量或评论量最高的视频教程作为场景视频。可以理解的是,如何根据从搜索到的视频教程中选取场景视频可以根据实际的需求进行设置,在此不做限定。
可以理解的是,电子设备在根据语义信息进行搜索时,可以是在专门的数据库中进行搜索,也可以是通过网络进行网络查找,可根据实际的需求进行设置,在此不做限定。
操作S340:获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征。
操作S350:以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频。
操作S360:输出待播放视频。
操作S340至操作S360可参照前述实施例对应部分,在此不再赘述。
本申请实施例提出视频生成方法,电子设备通过获取用户输入的交互信息;对交互信息进行语义理解,获取交互信息的语义信息,根据语义信息搜索相关的视频文件作为场景视频,获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;输出待播放视频。可以语义信息搜索相关的视频,从而通过视频的方式给将信息活灵活现的显示给用户,通过将视频中的待匹配人物的脸部特征进行替换,增强用户的代入感,从而提升用户获取信息的使用体验。
请参阅图6,本申请又一实施例提供了一种视频生成方法,本实施例在前述实施例的基础上,重点描述了确定场景视频中待匹配人物的过程,具体的该方法可以包括:
操作S410:获取用户输入的交互信息。
操作S420:根据交互信息获取场景视频。
操作S430:确定场景视频中的待匹配人物。
电子设备在根据交互信息获取到的场景信息中,可以包括多个人物。在多个人物中,可以选择一个人物作为待匹配人物,进行脸部特征的替换。
作为一种实施方式,可以是对获取到的场景视频进行语义理解,获取整个场景视频中的主角,将所述主角作为待匹配人物,进行后续的脸部特征的替换。例如,获取到的场景视频为阿拉丁相关的视频,则可以对所述场景视频进行语义理解,获知所述场景视频中的主角为阿拉丁,则可以将阿拉丁作为待匹配人物。
具体的,在对所述场景视频进行语义理解时,可以对场景视频中每个人物出现的次数以及时长进行统计,将出现次数最多的人物作为所述场景视频的主角。例如,在一段场景视频中,出现的人物有人物A,人物B和人物C,其中,人物A出现2次,第一次出现的时长为50s,第二次出现的时长为10s;人物B出现一次,时长为10s;人物C出现1次,出现的时长为1s,结合每个人物出现的次数及时长,则可以确定人物A为该场景视频的主角。那么,人物A则可以作为所述场景视频的待匹配人物。
作为一种实施方式,可以是获取所述场景视频中所出现的人物,显示在所述场景视频中出现的人物,以指示用户从所显示的人物中选取指定人物,获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物。
操作S440:获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征。
操作S450:以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频。
作为一种实施方式,电子设备可以对获取到的场景视频进行语义理解,获取整个场景视频中的主角,将主角作为待匹配人物;将待匹配人物的脸部特征替换为目标人脸特征生成待播放视频。例如,电子设备获取到的场景视频为阿拉丁相关的视频,则可以对场景视频进行语义理解,获知场景视频中的主角为阿拉丁,则可以将阿拉丁作为待匹配人物;将待匹配人物的脸部特征替换为目标人脸特征生成待播放视频。
具体的,电子设备在对场景视频进行语义理解时,可以对场景视频中每个人物出现的次数以及时长进行统计,将出现次数最多的人物作为场景视频的主角。例如,在一段场景视频中,出现的人物有人物A,人物B和人物C,其中,人物A出现2次,第一次出现的时长为50s,第二次出现的时长为10s;人物B出现一次,时长为10s;人物C出现1次,出现的时长为1s,结合每个人物出现的次数及时长,则可以确定人物A为该场景视频的主角。那么,人物A则可以作为场景视频的待匹配人物,将人物A的脸部特征替换为目标人脸特征生成待播放视频。
作为一种实施方式,电子设备可以获取场景视频中所出现的人物,显示场景视频中出现的人物,以指示用户从所显示的人物中选取指定人物,获取用户所选取的指定人物,以指定人物作为场景视频中的待匹配人物;将待匹配人物的脸部特征替换为目标人脸特征生 成待播放视频。
操作S460:输出待播放视频。
操作S440至操作S460可参照前述实施例对应部分,在此不再赘述。
请参阅图7,其示出了本申请实施例提供的一种视频生成装置500,应用于电子设备,视频生成装置500包括信息输入模块510,场景视频获取模块520,人脸获取模块530,视频生成模块540以及输出模块550。
信息输入模块510,用于获取用户输入的交互信息;场景视频获取模块520,用于根据交互信息获取场景视频,场景视频中包括待匹配人物;人脸获取模块530,用于获取用户的人脸信息并提取对应的人脸特征作为目标特征;视频生成模块540,用于以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;输出模块550,用于输出待播放视频。
场景视频获取模块520还包括:理解单元,用于对交互信息进行语义理解,获取交互信息的语义信息;视频生成单元,用于根据语义信息搜索相关的视频文本信息;根据视频文本信息生成场景视频。
视频生成单元还包括:切割子单元,用于对视频文本信息按照场景进行切割,获得至少一段场景文本;生成子单元,用于对至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;合成子单元,用于若生成一个子场景视频,将一个子场景视频作为场景视频;若生成多个子场景视频,将多个子场景视频合成为场景视频。
生成子单元还用于从场景文本中提取语义特征,语义特征包括人物,地点,时间;将场景文本转换为语音信息;根据语义特征和语音信息,生成以人物在地点执行事件的子场景视频。
场景视频获取模块520还用于对交互信息进行语义理解,获取交互信息的语义信息;根据语义信息搜索相关的视频文件作为场景视频。
视频生成模块540还包括:确定单元,用于对场景视频进行语义劣迹,获取整个场景视频的主角,将主角作为场景视频中的待匹配人物;替换单元,用于将待匹配人物的脸部特征替换为目标人脸特征。
视频生成模块540还包括:显示单元,用于显示场景视频中的所有人物,以指示用户从所有人物中选取指定人物;获取用户所选取的指定人物,以指定人物作为场景视频中的待匹配人物;替换单元,用于将待匹配人物的脸部特征替换为目标人脸特征。
需要说明的是,所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述装置和模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
综上,电子设备通过获取用户输入的交互信息;根据交互信息获取场景视频,场景视频中包括待匹配人物;获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;输出待播放视频。从而通过视频的方式给将信息活灵活现的显示给用户,并将视频中的特定人物的脸部特征替换为目标人脸特征,增强用户的代入感,从而提升用户获取信息的使用体验。
在本申请所提供的几个实施例中,所显示或讨论的模块相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或模块的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能模块可以集成在一个处理模块中,也可以是各个模块单独物理存在,也可以两个或两个以上模块集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。
请参考图8,其示出了本申请实施例提供的一种电子设备的结构框图。该电子设备600可以是智能手机、平板电脑、电子书等能够运行应用程序的电子设备。本申请中的电子设备600可以包括一个或多个如下部件:处理器610、存储器620,以及一个或多个应用程序, 其中一个或多个应用程序可以被存储在存储器620中并被配置为由一个或多个处理器610执行,一个或多个程序配置用于实现以下操作:获取用户输入的交互信息;根据交互信息获取场景视频,场景视频中包括待匹配人物;获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频;及输出待播放视频。
进一步地,根据交互信息获取场景视频,包括:对交互信息进行语义理解,获取交互信息的语义信息;根据语义信息搜索相关的视频文本信息;及根据视频文本信息生成场景视频。
进一步地,根据视频文本信息生成场景视频,包括:对视频文本信息按照场景进行切割,获得至少一段场景文本;对至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;若生成一个子场景视频,将一个子场景视频作为场景视频;及若生成多个子场景视频,将多个子场景视频合成为场景视频。
进一步地,对至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频,包括:从场景文本中提取语义特征,语义特征包括人物,地点,事件;将场景文本转换为语音信息;及根据语义特征和语音信息,生成以人物在地点执行事件的子场景视频。
进一步地,根据交互信息获取场景视频,包括:对交互信息进行语义理解,获取交互信息的语义信息;及根据语义信息搜索相关的视频文件作为场景视频。
进一步地,以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频,包括:对场景视频进行语义理解,获取整个场景视频的主角,将主角作为场景视频中的待匹配人物;及将待匹配人物的脸部特征替换为目标人脸特征生成待播放视频。
进一步地,以目标人脸特征替换场景视频中待匹配人物的脸部特征生成待播放视频,包括:显示场景视频中的所有人物,以指示用户从所有人物中选取指定人物;获取用户所选取的指定人物,以指定人物作为场景视频中的待匹配人物;及将待匹配人物的脸部特征替换为目标人脸特征生成待播放视频。
处理器610可以包括一个或者多个处理核。处理器610利用各种接口和线路连接整个电子设备600内的各个部分,通过运行或执行存储在存储器620内的指令、程序、代码集或指令集,以及调用存储在存储器620内的数据,执行电子设备600的各种功能和处理数据。可选地,处理器610可以采用数字信号处理(Digital Signal Processing,DSP)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程逻辑阵列(Programmable Logic Array,PLA)中的至少一种硬件形式来实现。处理器610可集成中央处理器(Central Processing Unit,CPU)、图像处理器(Graphics Processing Unit,GPU)和调制解调器等中的一种或几种的组合。其中,CPU主要处理操作系统、用户界面和应用程序等;GPU用于负责显示内容的渲染和绘制;调制解调器用于处理无线通信。可以理解的是,上述调制解调器也可以不集成到处理器610中,单独通过一块通信芯片进行实现。
存储器620可以包括随机存储器(Random Access Memory,RAM),也可以包括只读存储器(Read-Only Memory)。存储器620可用于存储指令、程序、代码、代码集或指令集。存储器620可包括存储程序区和存储数据区,其中,存储程序区可存储用于实现操作系统的指令、用于实现至少一个功能的指令(比如触控功能、声音播放功能、图像播放功能等)、用于实现下述各个方法实施例的指令等。存储数据区还可以存储电子设备600在使用中所创建的数据(比如电话本、音视频数据、聊天记录数据)等。
请参考图9,其示出了本申请实施例提供的一种计算机可读存储介质的结构框图。该计算机可读存储介质700中存储有程序代码,程序代码可被处理器调用执行时,实现以下操作:获取用户输入的交互信息;根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及输出所述待播放视 频。
进一步地,所述根据所述交互信息获取场景视频,包括:对所述交互信息进行语义理解,获取所述交互信息的语义信息;根据所述语义信息搜索相关的视频文本信息;及根据所述视频文本信息生成场景视频。
进一步地,所述根据视频文本信息生成场景视频,包括:对所述视频文本信息按照场景进行切割,获得至少一段场景文本;对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;若生成一个子场景视频,将所述一个子场景视频作为所述场景视频;及若生成多个子场景视频,将所述多个子场景视频合成为所述场景视频。
进一步地,所述对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频,包括:从所述场景文本中提取语义特征,所述语义特征包括人物,地点,事件;将所述场景文本转换为语音信息;及根据所述语义特征和所述语音信息,生成以所述人物在所述地点执行所述事件的子场景视频。
进一步地,所述根据所述交互信息获取场景视频,包括:对所述交互信息进行语义理解,获取所述交互信息的语义信息;及根据所述语义信息搜索相关的视频文件作为所述场景视频。
进一步地,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:对所述场景视频进行语义理解,获取整个场景视频的主角,将所述主角作为所述场景视频中的待匹配人物;及将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
进一步地,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:显示所述场景视频中的所有人物,以指示用户从所述所有人物中选取指定人物;获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物;及将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
计算机可读存储介质700可以是诸如闪存、EEPROM(电可擦除可编程只读存储器)、EPROM、硬盘或者ROM之类的电子存储器。可选地,计算机可读存储介质700包括非瞬时性计算机可读介质(non-transitory computer-readable storage medium)。计算机可读存储介质700具有用于执行根据本申请方法实施例中各操作的程序代码710的存储空间。这些程序代码可以从一个或者多个计算机程序产品中读出或者写入到这一个或者多个计算机程序产品中。程序代码710可以例如以适当形式进行压缩。
最后应说明的是:以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不驱使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (28)

  1. 一种视频生成方法,所述方法包括:
    获取用户输入的交互信息;
    根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
    获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
    以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
    输出所述待播放视频。
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;
    根据所述语义信息搜索相关的视频文本信息;及
    根据所述视频文本信息生成场景视频。
  3. 根据权利要求2所述的方法,其特征在于,所述根据视频文本信息生成场景视频,包括:
    对所述视频文本信息按照场景进行切割,获得至少一段场景文本;
    对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;
    若生成一个子场景视频,将所述一个子场景视频作为所述场景视频;及
    若生成多个子场景视频,将所述多个子场景视频合成为所述场景视频。
  4. 根据权利要求3所述的方法,其特征在于,所述对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频,包括:
    从所述场景文本中提取语义特征,所述语义特征包括人物,地点,事件;
    将所述场景文本转换为语音信息;及
    根据所述语义特征和所述语音信息,生成以所述人物在所述地点执行所述事件的子场景视频。
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;及
    根据所述语义信息搜索相关的视频文件作为所述场景视频。
  6. 根据权利要求1-5任一项所述的方法,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    对所述场景视频进行语义理解,获取整个场景视频的主角,将所述主角作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  7. 根据权利要求1-5任一项所述的方法,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    显示所述场景视频中的所有人物,以指示用户从所述所有人物中选取指定人物;
    获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  8. 一种视频生成装置,所述装置包括:
    信息输入模块,用于获取用户输入的交互信息;
    场景视频获取模块,用于根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
    人脸获取模块,用于获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
    视频生成模块,用于以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
    输出模块,用于输出所述待播放视频。
  9. 根据权利要求8所述的装置,其特征在于,所述场景视频获取模块还用于对所述交互信息进行语义理解,获取所述交互信息的语义信息;
    根据所述语义信息搜索相关的视频文本信息;及
    根据所述视频文本信息生成场景视频。
  10. 根据权利要求9所述的装置,其特征在于,所述场景视频获取模块还用于对所述视频文本信息按照场景进行切割,获得至少一段场景文本;
    对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;
    若生成一个子场景视频,将所述一个子场景视频作为所述场景视频;及
    若生成多个子场景视频,将所述多个子场景视频合成为所述场景视频。
  11. 根据权利要求10所述的装置,其特征在于,所述场景视频获取模块还用于从所述场景文本中提取语义特征,所述语义特征包括人物,地点,事件;
    将所述场景文本转换为语音信息;及
    根据所述语义特征和所述语音信息,生成以所述人物在所述地点执行所述事件的子场景视频。
  12. 根据权利要求8所述的装置,其特征在于,所述场景视频获取模块还用于对所述交互信息进行语义理解,获取所述交互信息的语义信息;及
    根据所述语义信息搜索相关的视频文件作为所述场景视频。
  13. 根据权利要求8-12任一项所述的装置,其特征在于,所述视频生成模块还用于对所述场景视频进行语义理解,获取整个场景视频的主角,将所述主角作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  14. 根据权利要求8-12任一项所述的装置,其特征在于,所述视频生成模块还用于显示所述场景视频中的所有人物,以指示用户从所述所有人物中选取指定人物;
    获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  15. 一种电子设备,所述电子设备包括:
    一个或多个处理器;
    存储器,与所述一个或多个处理器电连接;
    一个或多个应用程序,其中所述一个或多个应用程序被存储在所述存储器中并被配置为由所述一个或多个处理器执行,所述一个或多个应用程序配置用于实现以下操作:
    获取用户输入的交互信息;
    根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
    获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
    以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
    输出所述待播放视频。
  16. 根据权利要求15所述的电子设备,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;
    根据所述语义信息搜索相关的视频文本信息;及
    根据所述视频文本信息生成场景视频。
  17. 根据权利要求16所述的电子设备,其特征在于,所述根据视频文本信息生成场景视频,包括:
    对所述视频文本信息按照场景进行切割,获得至少一段场景文本;
    对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;
    若生成一个子场景视频,将所述一个子场景视频作为所述场景视频;及
    若生成多个子场景视频,将所述多个子场景视频合成为所述场景视频。
  18. 根据权利要求17所述的电子设备,其特征在于,所述对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频,包括:
    从所述场景文本中提取语义特征,所述语义特征包括人物,地点,事件;
    将所述场景文本转换为语音信息;及
    根据所述语义特征和所述语音信息,生成以所述人物在所述地点执行所述事件的子场景视频。
  19. 根据权利要求15所述的电子设备,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;及
    根据所述语义信息搜索相关的视频文件作为所述场景视频。
  20. 根据权利要求15-19任一项所述的电子设备,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    对所述场景视频进行语义理解,获取整个场景视频的主角,将所述主角作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  21. 根据权利要求15-19任一项所述的电子设备,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    显示所述场景视频中的所有人物,以指示用户从所述所有人物中选取指定人物;
    获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  22. 一种计算机可读取存储介质,所述计算机可读取存储介质中存储有程序代码,所述程序代码被处理器调用执行时,实现以下操作:
    获取用户输入的交互信息;
    根据所述交互信息获取场景视频,所述场景视频中包括待匹配人物;
    获取用户的人脸信息并提取对应的人脸特征作为目标人脸特征;
    以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频;及
    输出所述待播放视频。
  23. 根据权利要求22所述的计算机可读取存储介质,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;
    根据所述语义信息搜索相关的视频文本信息;及
    根据所述视频文本信息生成场景视频。
  24. 根据权利要求23所述的计算机可读取存储介质,其特征在于,所述根据视频文本信息生成场景视频,包括:
    对所述视频文本信息按照场景进行切割,获得至少一段场景文本;
    对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频;
    若生成一个子场景视频,将所述一个子场景视频作为所述场景视频;及
    若生成多个子场景视频,将所述多个子场景视频合成为所述场景视频。
  25. 根据权利要求24所述的计算机可读取存储介质,其特征在于,所述对所述至少一段场景文本进行语义理解,分别生成对应每一段场景文本的子场景视频,包括:
    从所述场景文本中提取语义特征,所述语义特征包括人物,地点,事件;
    将所述场景文本转换为语音信息;及
    根据所述语义特征和所述语音信息,生成以所述人物在所述地点执行所述事件的子场 景视频。
  26. 根据权利要求22所述的计算机可读取存储介质,其特征在于,所述根据所述交互信息获取场景视频,包括:
    对所述交互信息进行语义理解,获取所述交互信息的语义信息;及
    根据所述语义信息搜索相关的视频文件作为所述场景视频。
  27. 根据权利要求22-26任一项所述的计算机可读取存储介质,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    对所述场景视频进行语义理解,获取整个场景视频的主角,将所述主角作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
  28. 根据权利要求22-26任一项所述的计算机可读取存储介质,其特征在于,所述以所述目标人脸特征替换所述场景视频中待匹配人物的脸部特征生成待播放视频,包括:
    显示所述场景视频中的所有人物,以指示用户从所述所有人物中选取指定人物;
    获取用户所选取的指定人物,以所述指定人物作为所述场景视频中的待匹配人物;及
    将所述待匹配人物的脸部特征替换为所述目标人脸特征生成待播放视频。
PCT/CN2020/116452 2019-12-04 2020-09-21 视频生成方法、装置、电子设备及存储介质 WO2021109678A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911228480.6 2019-12-04
CN201911228480.6A CN110968736B (zh) 2019-12-04 2019-12-04 视频生成方法、装置、电子设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021109678A1 true WO2021109678A1 (zh) 2021-06-10

Family

ID=70032959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/116452 WO2021109678A1 (zh) 2019-12-04 2020-09-21 视频生成方法、装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN110968736B (zh)
WO (1) WO2021109678A1 (zh)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709548A (zh) * 2021-08-09 2021-11-26 北京达佳互联信息技术有限公司 基于图像的多媒体数据合成方法、装置、设备及存储介质
CN114220051A (zh) * 2021-12-10 2022-03-22 马上消费金融股份有限公司 视频处理方法、应用程序的测试方法及电子设备
CN114445896A (zh) * 2022-01-28 2022-05-06 北京百度网讯科技有限公司 视频中人物陈述内容可置信度的评估方法及装置
CN114968523A (zh) * 2022-05-24 2022-08-30 北京新唐思创教育科技有限公司 不同场景间的人物传送方法、装置、电子设备及存储介质
CN117635784A (zh) * 2023-12-19 2024-03-01 世优(北京)科技有限公司 三维数字人脸部动画自动生成系统

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110968736B (zh) * 2019-12-04 2021-02-02 深圳追一科技有限公司 视频生成方法、装置、电子设备及存储介质
CN111831854A (zh) * 2020-06-03 2020-10-27 北京百度网讯科技有限公司 视频标签的生成方法、装置、电子设备和存储介质
CN112004163A (zh) * 2020-08-31 2020-11-27 北京市商汤科技开发有限公司 视频生成方法及装置、电子设备和存储介质
CN112533069A (zh) * 2020-11-25 2021-03-19 拉扎斯网络科技(上海)有限公司 一种针对合成多媒体数据的处理方法及装置
CN113965802A (zh) * 2021-10-22 2022-01-21 深圳市兆驰股份有限公司 沉浸式视频交互方法、装置、设备和存储介质
CN114222077A (zh) * 2021-12-14 2022-03-22 惠州视维新技术有限公司 视频处理方法、装置、存储介质及电子设备
CN114827752B (zh) * 2022-04-25 2023-07-25 中国平安人寿保险股份有限公司 视频生成方法、视频生成系统、电子设备及存储介质
CN116389853B (zh) * 2023-03-29 2024-02-06 阿里巴巴(中国)有限公司 视频生成方法

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807393A (zh) * 2010-03-12 2010-08-18 青岛海信电器股份有限公司 Ktv系统及其实现方法、电视机
CN105118082A (zh) * 2015-07-30 2015-12-02 科大讯飞股份有限公司 个性化视频生成方法及系统
US20170193280A1 (en) * 2015-09-22 2017-07-06 Tenor, Inc. Automated effects generation for animated content
CN110266994A (zh) * 2019-06-26 2019-09-20 广东小天才科技有限公司 一种视频通话方法、视频通话装置及终端
CN110286756A (zh) * 2019-06-13 2019-09-27 深圳追一科技有限公司 视频处理方法、装置、系统、终端设备及存储介质
CN110968736A (zh) * 2019-12-04 2020-04-07 深圳追一科技有限公司 视频生成方法、装置、电子设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157407A1 (en) * 2007-12-12 2009-06-18 Nokia Corporation Methods, Apparatuses, and Computer Program Products for Semantic Media Conversion From Source Files to Audio/Video Files
CN102750366B (zh) * 2012-06-18 2015-05-27 海信集团有限公司 基于自然交互输入的视频搜索系统及方法
US10332311B2 (en) * 2014-09-29 2019-06-25 Amazon Technologies, Inc. Virtual world generation engine
CN108111779A (zh) * 2017-11-21 2018-06-01 深圳市朗形数字科技有限公司 一种视频处理的方法及终端设备
CN109819313B (zh) * 2019-01-10 2021-01-08 腾讯科技(深圳)有限公司 视频处理方法、装置及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101807393A (zh) * 2010-03-12 2010-08-18 青岛海信电器股份有限公司 Ktv系统及其实现方法、电视机
CN105118082A (zh) * 2015-07-30 2015-12-02 科大讯飞股份有限公司 个性化视频生成方法及系统
US20170193280A1 (en) * 2015-09-22 2017-07-06 Tenor, Inc. Automated effects generation for animated content
CN110286756A (zh) * 2019-06-13 2019-09-27 深圳追一科技有限公司 视频处理方法、装置、系统、终端设备及存储介质
CN110266994A (zh) * 2019-06-26 2019-09-20 广东小天才科技有限公司 一种视频通话方法、视频通话装置及终端
CN110968736A (zh) * 2019-12-04 2020-04-07 深圳追一科技有限公司 视频生成方法、装置、电子设备及存储介质

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113709548A (zh) * 2021-08-09 2021-11-26 北京达佳互联信息技术有限公司 基于图像的多媒体数据合成方法、装置、设备及存储介质
CN113709548B (zh) * 2021-08-09 2023-08-25 北京达佳互联信息技术有限公司 基于图像的多媒体数据合成方法、装置、设备及存储介质
CN114220051A (zh) * 2021-12-10 2022-03-22 马上消费金融股份有限公司 视频处理方法、应用程序的测试方法及电子设备
CN114220051B (zh) * 2021-12-10 2023-07-28 马上消费金融股份有限公司 视频处理方法、应用程序的测试方法及电子设备
CN114445896A (zh) * 2022-01-28 2022-05-06 北京百度网讯科技有限公司 视频中人物陈述内容可置信度的评估方法及装置
CN114445896B (zh) * 2022-01-28 2024-04-05 北京百度网讯科技有限公司 视频中人物陈述内容可置信度的评估方法及装置
CN114968523A (zh) * 2022-05-24 2022-08-30 北京新唐思创教育科技有限公司 不同场景间的人物传送方法、装置、电子设备及存储介质
CN117635784A (zh) * 2023-12-19 2024-03-01 世优(北京)科技有限公司 三维数字人脸部动画自动生成系统
CN117635784B (zh) * 2023-12-19 2024-04-19 世优(北京)科技有限公司 三维数字人脸部动画自动生成系统

Also Published As

Publication number Publication date
CN110968736B (zh) 2021-02-02
CN110968736A (zh) 2020-04-07

Similar Documents

Publication Publication Date Title
WO2021109678A1 (zh) 视频生成方法、装置、电子设备及存储介质
WO2020063319A1 (zh) 动态表情生成方法、计算机可读存储介质和计算机设备
US20240107127A1 (en) Video display method and apparatus, video processing method, apparatus, and system, device, and medium
EP3352438A1 (en) User terminal device for recommending response message and method therefor
EP3195601B1 (en) Method of providing visual sound image and electronic device implementing the same
WO2021083125A1 (zh) 通话控制方法及相关产品
US20120276504A1 (en) Talking Teacher Visualization for Language Learning
JP2011217197A (ja) 電子機器、再生制御システム、再生制御方法及びプログラム
KR101123370B1 (ko) 휴대단말용 객체기반 콘텐츠 제공방법 및 장치
JP2016038601A (ja) Cgキャラクタ対話装置及びcgキャラクタ対話プログラム
JP2019101754A (ja) 要約装置及びその制御方法、要約システム、プログラム
JP2014146066A (ja) 文書データ生成装置、文書データ生成方法及びプログラム
WO2019085625A1 (zh) 表情图片推荐方法及设备
WO2018177134A1 (zh) 用户生成内容处理方法、存储介质和终端
US9697632B2 (en) Information processing apparatus, information processing method, and program
JP2012178028A (ja) アルバム作成装置、アルバム作成装置の制御方法、及びプログラム
CN113391745A (zh) 网络课程的重点内容处理方法、装置、设备及存储介质
WO2023160288A1 (zh) 会议纪要生成方法、装置、电子设备和可读存储介质
JP2008083672A (ja) 表情影像を表示する方法
US11532111B1 (en) Systems and methods for generating comic books from video and images
JP2017045374A (ja) 情報処理装置及びプログラム
KR102281298B1 (ko) 인공지능 기반 동영상 합성을 위한 시스템 및 방법
JP2019101751A (ja) 情報提示装置、情報提示システム、情報提示方法およびプログラム
CN113709521A (zh) 一种根据视频内容自动匹配背景的系统
WO2021062757A1 (zh) 同声传译方法、装置、服务器和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20896402

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 091122)

122 Ep: pct application non-entry in european phase

Ref document number: 20896402

Country of ref document: EP

Kind code of ref document: A1