WO2021244457A1 - 一种视频生成方法及相关装置 - Google Patents

一种视频生成方法及相关装置 Download PDF

Info

Publication number
WO2021244457A1
WO2021244457A1 PCT/CN2021/097047 CN2021097047W WO2021244457A1 WO 2021244457 A1 WO2021244457 A1 WO 2021244457A1 CN 2021097047 W CN2021097047 W CN 2021097047W WO 2021244457 A1 WO2021244457 A1 WO 2021244457A1
Authority
WO
WIPO (PCT)
Prior art keywords
video
pictures
picture
information
keywords
Prior art date
Application number
PCT/CN2021/097047
Other languages
English (en)
French (fr)
Inventor
邵滨
岳俊
钱莉
许松岑
黄雪妍
刘亚娇
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP21818681.5A priority Critical patent/EP4149109A4/en
Publication of WO2021244457A1 publication Critical patent/WO2021244457A1/zh
Priority to US18/070,689 priority patent/US20230089566A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/0002Inspection of images, e.g. flaw detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/30Scenes; Scene-specific elements in albums, collections or shared content, e.g. social network photos or video
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/70Labelling scene content, e.g. deriving syntactic or semantic representations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/172Classification, e.g. identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/07User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail characterised by the inclusion of specific contents
    • H04L51/10Multimedia information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/52User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail for supporting social networking services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/56Unified messaging, e.g. interactions between e-mail, instant messaging or converged IP messaging [CPM]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30168Image quality inspection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/30Subject of image; Context of image processing
    • G06T2207/30196Human being; Person
    • G06T2207/30201Face
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L51/00User-to-user messaging in packet-switching networks, transmitted according to store-and-forward or real-time protocols, e.g. e-mail
    • H04L51/21Monitoring or handling of messages
    • H04L51/222Monitoring or handling of messages using geographical location information, e.g. messages transmitted or received in proximity of a certain spot or area
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a video generation method and related devices.
  • Status sharing is a way that many users in today's new media society use. Status sharing can let others understand themselves and promote communication between people. For example: WeChat location status sharing, talk status sharing, Douyin video sharing, etc.
  • the rich status sharing can promote the friendly development of social platforms and enhance the user's life and friendship experience.
  • the sharing of a single geographic location information, text, or picture makes the information that users can obtain is less, and it is impossible to meet the needs both visually and audibly. Therefore, in order to meet the visual and auditory needs of users, the video can be taken before sharing.
  • sharing the video after shooting brings some inconvenience to the user, requiring a certain amount of time for the user to perform manual photography, and the quality and content of the captured video are also easily restricted by the user's shooting technology and shooting conditions. If you directly use pictures to synthesize a video, it is limited to a switching show in the form of a slideshow, which lacks the richness of content.
  • the embodiments of the present application provide a video generation method and related devices, which can generate videos based on text and pictures, so that users can share their life status in real time.
  • an embodiment of the present application provides a video generation method, which may include: receiving a video generation instruction, and obtaining text information and picture information in response to the video generation instruction, the text information including one or more keywords ,
  • the picture information includes N pictures, N is a positive integer greater than or equal to 1, and each picture of the N pictures corresponding to the one or more keywords is obtained according to the one or more keywords
  • the image characteristics of the image input the one or more keywords and the image characteristics of the N pictures into the target generator network to generate a target video, the target video includes M pictures, the M pictures are based on all For the pictures generated from the image features of the N pictures and corresponding to the one or more keywords, M is a positive integer greater than 1.
  • the electronic device can automatically generate a video based on the text information and the picture information, so that the user can share their life status in real time.
  • the electronic device After the electronic device receives the video generation instruction, it can obtain text information and picture information in response to the video generation instruction, where the text information includes one or more keywords, and the picture information includes N pictures.
  • the text information can be used to describe the video content of the generated video (for example, the one or more keywords can include people, time, place, events, or actions, etc.), the picture information can be used to extract or generate Each frame of video picture, therefore, the image features corresponding to the one or more keywords in each of the N pictures can be extracted according to one or more keywords, and then the one or more The keywords and the image characteristics of the N pictures are input into the target generator network to generate a target video.
  • This use of text and images to generate a video together allows the generated video to adjust the input picture information according to the input text information, which greatly enriches the video content and avoids the direct stacking of multiple pictures on the existing terminal equipment.
  • the video of is only confined to the switching display in the form of slides, which lacks the richness of content and also meets the needs of users.
  • the acquiring text information in response to the video generation instruction includes: responding to the video generation instruction, from text input information, voice input information, user preference information, user physiological data information, One or more of the current environment information is used to obtain the text information, where the current environment information includes one or more of current weather information, current time information, and current geographic location information.
  • the electronic device can obtain user-specific input information (text input information, voice input information) in response to the video generation instruction, or use the sensor on the electronic device to obtain current environment information, or interact from history
  • the text information is extracted from the user preference information extracted from the information, and the target video is generated together with the acquired picture information.
  • the user state for example, the weather environment in the generated video is the same as the weather environment the user is currently in
  • the multi-modal information may include text, preference information, environmental information, and so on.
  • the user can only rely on the current environment information obtained by the sensor or the preferences extracted from the historical interaction information as the input text information, and generate the target video together with the input picture information.
  • the obtaining picture information in response to the video generation instruction includes: in response to the video generation instruction, obtaining information related to the one or more key pictures from a plurality of pre-stored pictures. A picture corresponding to at least one keyword in the word.
  • the electronic device can obtain image information related to the text information according to the text information. For example: when obtaining picture information, the picture of the corresponding location can be obtained according to the current geographic location information or the location information input by the user, and used to generate the target video. For example, when a user visits the Forbidden City, he can obtain pictures related to the Forbidden City, which can be used to synthesize the target video, so that the user can share his life status in real time.
  • the picture of the corresponding place can also be obtained according to the person information input by the user, which is used to generate the target video, which meets the needs of the user. For example: the user enters "Xiao Ming is playing football in the playground", and at least one relevant picture of the keywords “Xiao Ming", “playground”, and “football” can be obtained, which can be used to synthesize the target video.
  • the video generation instruction includes a face recognition request; the obtaining picture information in response to the video generation instruction includes: in response to the video generation instruction, performing face recognition and obtaining a person Face recognition result; according to the face recognition result, at least one picture that matches the face recognition result is obtained from a plurality of pre-stored pictures.
  • the electronic device when it obtains picture information, it may first obtain a face recognition result through face recognition, and then directly obtain a picture containing a user from pre-stored pictures according to the face recognition result. It is convenient to directly generate a video about the user's status, and share the user's current status in time. For example: through face recognition, after identifying user A, you can get user A's picture from multiple pre-stored pictures. This kind of video can generate user A's video without user filtering pictures, which is convenient User operation improves user experience.
  • the video generation instruction includes at least one picture tag, and each picture tag in the at least one picture tag corresponds to at least one picture among a plurality of pictures stored in advance; the response is
  • the video generation instruction to obtain picture information includes: in response to the video generation instruction, according to the at least one picture tag, obtaining a picture tag corresponding to each of the at least one picture tag from a plurality of pre-stored pictures At least one picture of.
  • the electronic device when the electronic device obtains picture information, it can obtain at least one corresponding picture through at least one picture tag carried in the video generation instruction, which is used to generate the target video.
  • the user can directly filter out the pictures that the user is interested in or need to generate a video, which meets the user's viewing needs. For example: the user can select the picture tag as "cat”, after obtaining multiple pictures of the cat, together with the text information, generate a dynamic video with the cat as the protagonist; for another example: the user can also select the picture tag as "childhood Huawei” , After obtaining multiple pictures of Xiao Ming’s childhood, together with the text information, a dynamic video about Xiao Ming’s childhood is generated.
  • the picture quality of each of the acquired N pictures is greater than a preset threshold.
  • a preset threshold it is necessary to score the picture quality of the picture to be selected, and when the picture quality score is lower than a preset threshold, the picture is not used to generate a video.
  • the use of pictures with picture quality greater than the preset threshold to generate a video can therefore ensure that the final target video generated from the picture has a higher picture quality, which satisfies the user's viewing experience.
  • the method further includes: performing image quality scoring on the acquired N pictures to obtain a picture quality scoring result corresponding to each of the N pictures; and comparing the picture quality The picture whose scoring result is less than the preset threshold is subjected to picture quality enhancement processing, and the picture with the enhanced picture quality is updated to the N pictures.
  • picture quality enhancement processing After obtaining the picture information, it is necessary to score the picture quality of all the pictures obtained. When the picture quality is poor, the picture quality can be enhanced for the picture to improve the video quality when the video is generated from the picture. , To meet the user's viewing experience.
  • the inputting the one or more keywords and the image features of the N pictures into the target generator network to generate the target video includes: extracting the one or more key Each keyword in the word corresponds to the first spatial variable in the vector space; extracts the second spatial variable corresponding to the image features of the N pictures in the vector space; compares the first spatial variable and the second The spatial variables are input into the target generator network to generate the target video.
  • the first spatial variable corresponding to the text information and the second spatial vector of the image feature of each picture in the picture information may be extracted respectively, where the first spatial variable may be in the latent vector space
  • the word vector identifying the text information in the, the second space vector of each picture may be a vector that identifies the image feature of the picture in the latent vector space, and extracting the space vector helps the generator network to better generate the target video.
  • the first spatial variable corresponding to each of the one or more keywords in the vector space is extracted through the Word2vec model, and the image features of the N pictures are extracted through the down-sampling convolutional network.
  • the second space variable corresponding to the vector space is extracted through the Word2vec model, and the image features of the N pictures are extracted through the down-sampling convolutional network.
  • the method further includes: obtaining sample text information, sample picture information, and real video data sets, and constructing a discriminator network and a generator network based on video; combining the sample text information with The sample picture information is input into the generator network to generate a sample video; the sample video and the real video data set are used as the input of the discriminator network to obtain the discrimination loss result, wherein, in the sample video When it belongs to the real video data set, the discrimination loss result is 1; according to the discrimination loss result, the generator network is trained to obtain the target generator network. To implement the embodiments of this application, it is necessary to train the generator network and the discriminator network through sample data.
  • the electronic device first generates a video through the generator network according to the sample data, and then inputs the ⁇ generated video, real video> into the discriminator, and the discriminator judges the source of the input, and judges if it comes from the generated video It is false 0, otherwise it is judged to be true 1.
  • the content of the generated video can be further standardized, and the authenticity of the generated video and the quality of the generated video can be gradually improved, which is conducive to video sharing.
  • an embodiment of the present application provides a video generation device, including:
  • the receiving response unit is configured to receive a video generation instruction, and obtain text information and picture information in response to the video generation instruction, the text information includes one or more keywords, the picture information includes N pictures, and N is greater than Or a positive integer equal to 1;
  • An extraction unit configured to obtain image features corresponding to the one or more keywords in each of the N pictures according to the one or more keywords
  • the generating unit is configured to input the one or more keywords and the image characteristics of the N pictures into the target generator network to generate a target video, the target video includes M pictures, and the M pictures are based on For the pictures generated from the image features of the N pictures and corresponding to the one or more keywords, M is a positive integer greater than 1.
  • the receiving response unit is specifically configured to: respond to the video generation instruction, from text input information, voice input information, user preference information, user physiological data information, and current environment information.
  • the receiving response unit is specifically configured to: in response to the video generation instruction, obtain at least one key of the one or more keywords from a plurality of pre-stored pictures. The picture corresponding to the word.
  • the video generation instruction includes a face recognition request; the receiving response unit is specifically configured to: in response to the video generation instruction, perform face recognition and obtain a face recognition result; According to the face recognition result, at least one picture matching the face recognition result is obtained from a plurality of pre-stored pictures.
  • the video generation instruction includes at least one picture tag, and each picture tag in the at least one picture tag corresponds to at least one picture among a plurality of pictures stored in advance; the receiving response The unit is specifically configured to: in response to the video generation instruction, obtain at least one picture corresponding to each picture tag of the at least one picture tag from the plurality of pictures stored in advance according to the at least one picture tag .
  • the picture quality of each of the acquired N pictures is greater than a preset threshold.
  • the device further includes: a scoring unit, configured to score the acquired N pictures on picture quality, and obtain a picture quality scoring result corresponding to each of the N pictures;
  • the enhancement unit is configured to perform picture quality enhancement processing on pictures whose picture quality scoring result is less than a preset threshold, and update the pictures whose picture quality has been enhanced to the N pictures.
  • the generating unit is specifically configured to: extract the first spatial variable corresponding to each keyword in the one or more keywords in the vector space; and extract the information of the N pictures
  • the image features respectively correspond to second spatial variables in the vector space; the first spatial variables and the second spatial variables are input into the target generator network to generate the target video.
  • the device further includes: a training unit, the training unit is used to: obtain sample text information, sample picture information, and real video data sets, and construct a discriminator network and video-based Generator network; input the sample text information and the sample picture information into the generator network to generate a sample video; use the sample video and the real video data set as the input of the discriminator network to obtain Discriminant loss result, wherein, when the sample video belongs to the real video data set, the discriminant loss result is 1; according to the discriminant loss result, train the generator network to obtain the target generator network.
  • a training unit is used to: obtain sample text information, sample picture information, and real video data sets, and construct a discriminator network and video-based Generator network; input the sample text information and the sample picture information into the generator network to generate a sample video; use the sample video and the real video data set as the input of the discriminator network to obtain Discriminant loss result, wherein, when the sample video belongs to the real video data set, the discriminant loss result is 1; according to the discriminant loss result, train the
  • an embodiment of the present application provides an electronic device, the terminal device includes a processor, and the processor is configured to support the terminal device to implement corresponding functions in the video generation method provided in the first aspect.
  • the terminal device may also include a memory, which is used for coupling with the processor, and stores the necessary program instructions and data of the terminal device.
  • the terminal device may also include a communication interface for the network device to communicate with other devices or a communication network.
  • an embodiment of the present application provides a computer storage medium for storing computer software instructions used in the video generation device provided in the second aspect above, which includes a program designed to execute the above aspect.
  • the embodiments of the present application provide a computer program, which includes instructions, when the computer program is executed by a computer, the computer can execute the process executed by the video generating apparatus in the second aspect.
  • this application provides a chip system that includes a processor for supporting terminal devices to implement the functions involved in the above-mentioned first aspect, for example, to generate or process the information involved in the above-mentioned video generation method .
  • the chip system further includes a memory, and the memory is used to store program instructions and data necessary for the data sending device.
  • the chip system can be composed of chips, or include chips and other discrete devices.
  • FIG. 1A is a schematic structural diagram of an electronic device 100 provided by an embodiment of the present application.
  • FIG. 1B is a software structure block diagram of an electronic device 100 provided by an embodiment of the present application.
  • FIG. 2A is a schematic diagram of a set of user interfaces for receiving video generation instructions provided by an embodiment of the present application.
  • Fig. 2B is a schematic diagram of a set of user interfaces for obtaining picture information provided by an embodiment of the present application.
  • FIG. 2C is a schematic diagram of a picture quality scoring provided by an embodiment of the present application.
  • Fig. 2D is a schematic diagram of a user interface for displaying text information provided by an embodiment of the present application.
  • Fig. 2E is a user interface for sharing a group of videos to friends after generating a set of videos provided by an embodiment of the present application.
  • Figure 2F is a schematic diagram of a generator network training process provided by an embodiment of the present application.
  • Fig. 2G is a schematic diagram of a video generation process provided by an embodiment of the present application.
  • Fig. 2H is a set of user interfaces for obtaining text information provided by an embodiment of the present application.
  • Fig. 2I is a set of user interfaces for obtaining picture information based on keywords provided by an embodiment of the present application.
  • Fig. 2J is a user interface for generating a video provided by an embodiment of the present application.
  • Fig. 2K is a schematic diagram of a video generation process based on user preferences provided by an embodiment of the present application.
  • Fig. 2L is another set of user interfaces for obtaining text information provided by an embodiment of the present application.
  • Fig. 2M is another user interface for video generation provided by an embodiment of the present application.
  • FIG. 3A is a schematic structural diagram of a video generation device provided by an embodiment of the present application.
  • FIG. 3B is a schematic flowchart of a video generation method provided by an embodiment of the present application.
  • Fig. 4 is a schematic structural diagram of another video generating device provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of another video generation device provided by an embodiment of the present application.
  • component used in this specification are used to denote computer-related entities, hardware, firmware, a combination of hardware and software, software, or software in execution.
  • the component may be, but is not limited to, a process, a processor, an object, an executable file, a thread of execution, a program, and/or a computer running on a processor.
  • the application running on the computing device and the computing device can be components.
  • One or more components may reside in processes and/or threads of execution, and components may be located on one computer and/or distributed between two or more computers.
  • these components can be executed from various computer readable media having various data structures stored thereon.
  • a component may be based on a signal having one or more data packets (for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal) Communicate through local and/or remote processes.
  • data packets for example, data from two components that interact with another component in a local system, a distributed system, and/or a network, such as the Internet that interacts with other systems through a signal
  • Recurrent Neural Networks It is a type of recurrent neural network that takes sequence data as input, recursively in the evolution direction of the sequence, and all nodes (recurrent units) are connected in a chain Network (recursive neural network).
  • Recurrent neural network has memory, parameter sharing and Turing completeness, so it has certain advantages when learning the nonlinear characteristics of the sequence.
  • Recurrent neural networks have applications in Natural Language Processing (NLP), such as speech recognition, language modeling, machine translation, and other fields, and are also used in various time series forecasts.
  • NLP Natural Language Processing
  • the recurrent neural network constructed by the introduction of Convolutional Neural Network (CNN) can handle computer vision problems involving sequence input.
  • Gesture recognition aims to recognize human physical movements or "gestures”, which can be based on recognizing human movements as input forms. Gesture recognition is also classified as a non-contact user interface. Unlike touch screen devices, devices with non-contact user interfaces can be controlled without touch. The device can have one or more sensors or cameras to monitor the user's movement. When it detects the movement corresponding to the command, it will respond with the appropriate output. For example, waving your hand in a specific pattern in front of the device may tell it to launch a specific application.
  • the Word2vec model is a group of related models used to generate word vectors. These models are shallow two-layer neural networks that are used for training to reconstruct linguistic word text. The network is represented by words, and the input words in adjacent positions need to be guessed. Under the assumption of the bag-of-words model in word2vec, the order of words is not important. After the training is completed, the word2vec model can be used to map each word to a vector, which can be used to represent the relationship between words.
  • FIG. 1A is a schematic structural diagram of an electronic device 100 provided by an embodiment of the present application.
  • the electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, and a universal serial bus (universal serial bus).
  • serial bus universal serial bus
  • USB universal serial bus
  • charging management module 140 power management module 141, battery 142, antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, Earphone interface 170D, sensor module 180, buttons 190, motor 191, indicator 192, camera 193, display screen 194, subscriber identification module (SIM) card interface 195, etc.
  • SIM subscriber identification module
  • the sensor module 180 can include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light Sensor 180L, bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiment of the present application does not constitute a specific limitation on the electronic device 100.
  • the electronic device 100 may include more or fewer components than shown, or combine certain components, or split certain components, or arrange different components.
  • the illustrated components can be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (AP), a modem processor, a graphics processing unit (GPU), and an image signal processor. (image signal processor, ISP), controller, memory, video codec, digital signal processor (digital signal processor, DSP), baseband processor, and/or neural-network processing unit (NPU) Wait.
  • AP application processor
  • modem processor modem processor
  • GPU graphics processing unit
  • image signal processor image signal processor
  • ISP image signal processor
  • controller memory
  • video codec digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • the different processing units may be independent devices or integrated in one or more processors.
  • the controller may be the nerve center and command center of the electronic device 100.
  • the controller can generate operation control signals according to the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • a memory may also be provided in the processor 110 to store instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory can store instructions or data that have just been used or recycled by the processor 110. If the processor 110 needs to use the instruction or data again, it can be directly called from the memory. Repeated accesses are avoided, the waiting time of the processor 110 is reduced, and the efficiency of the system is improved.
  • the processor may: receive a video generation instruction, and obtain text information and picture information in response to the video generation instruction, the text information includes one or more keywords, and the picture information includes N pictures Obtain the image features corresponding to the one or more keywords in each of the N pictures according to the one or more keywords; combine the one or more keywords with the N pictures
  • the image features of the input target generator network are generated to generate a target video, the target video includes M pictures, and the M pictures are generated based on the image characteristics of the N pictures and are related to the one or more key The picture corresponding to the word.
  • the processor 110 may include one or more interfaces.
  • the interface can include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, and a universal asynchronous transmitter (universal asynchronous transmitter) interface.
  • I2C integrated circuit
  • I2S integrated circuit built-in audio
  • PCM pulse code modulation
  • PCM pulse code modulation
  • UART universal asynchronous transmitter
  • MIPI mobile industry processor interface
  • GPIO general-purpose input/output
  • SIM subscriber identity module
  • USB Universal Serial Bus
  • the interface connection relationship between the modules illustrated in the embodiment of the present application is merely a schematic description, and does not constitute a structural limitation of the electronic device 100.
  • the electronic device 100 may also adopt different interface connection modes in the foregoing embodiments, or a combination of multiple interface connection modes.
  • the charging management module 140 is used to receive charging input from the charger.
  • the charger can be a wireless charger or a wired charger.
  • the charging management module 140 may receive the charging input of the wired charger through the USB interface 130.
  • the charging management module 140 may receive the wireless charging input through the wireless charging coil of the electronic device 100. While the charging management module 140 charges the battery 142, it can also supply power to the electronic device through the power management module 141.
  • the power management module 141 is used to connect the battery 142, the charging management module 140 and the processor 110.
  • the power management module 141 receives input from the battery 142 and/or the charging management module 140, and supplies power to the processor 110, the internal memory 121, the external memory, the display screen 194, the camera 193, and the wireless communication module 160.
  • the power management module 141 can also be used to monitor parameters such as battery capacity, battery cycle times, and battery health status (leakage, impedance).
  • the power management module 141 may also be provided in the processor 110.
  • the power management module 141 and the charging management module 140 may also be provided in the same device.
  • the wireless communication function of the electronic device 100 can be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor, and the baseband processor.
  • the antenna 1 and the antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the electronic device 100 can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • antenna 1 can be multiplexed as a diversity antenna of a wireless local area network.
  • the antenna can be used in combination with a tuning switch.
  • the mobile communication module 150 may provide a wireless communication solution including 2G/3G/4G/5G and the like applied to the electronic device 100.
  • the mobile communication module 150 may include at least one filter, a switch, a power amplifier, a low noise amplifier (LNA), and the like.
  • the mobile communication module 150 can receive electromagnetic waves by the antenna 1, and perform processing such as filtering, amplifying and transmitting the received electromagnetic waves to the modem processor for demodulation.
  • the mobile communication module 150 can also amplify the signal modulated by the modem processor, and convert it into electromagnetic wave radiation via the antenna 1.
  • at least part of the functional modules of the mobile communication module 150 may be provided in the processor 110.
  • at least part of the functional modules of the mobile communication module 150 and at least part of the modules of the processor 110 may be provided in the same device.
  • the modem processor may include a modulator and a demodulator.
  • the modem processor may be an independent device.
  • the modem processor may be independent of the processor 110 and be provided in the same device as the mobile communication module 150 or other functional modules.
  • the wireless communication module 160 can provide applications on the electronic device 100 including wireless local area networks (WLAN) (such as wireless fidelity (Wi-Fi) networks), bluetooth (BT), and global navigation satellites.
  • WLAN wireless local area networks
  • BT wireless fidelity
  • GNSS global navigation satellite system
  • FM frequency modulation
  • NFC near field communication technology
  • infrared technology infrared, IR
  • the wireless communication module 160 may be one or more devices integrating at least one communication processing module.
  • the wireless communication module 160 receives electromagnetic waves via the antenna 2, frequency modulates and filters the electromagnetic wave signals, and sends the processed signals to the processor 110.
  • the wireless communication module 160 may also receive a signal to be sent from the processor 110, perform frequency modulation, amplify it, and convert it into electromagnetic waves to radiate through the antenna 2.
  • the antenna 1 of the electronic device 100 is coupled with the mobile communication module 150, and the antenna 2 is coupled with the wireless communication module 160, so that the electronic device 100 can communicate with the network and other devices through wireless communication technology.
  • the wireless communication technology may include global system for mobile communications (GSM), general packet radio service (GPRS), code division multiple access (CDMA), broadband Code division multiple access (wideband code division multiple access, WCDMA), time-division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (LTE), BT, GNSS, WLAN, NFC , FM, and/or IR technology, etc.
  • the GNSS may include the global positioning system (GPS), the global navigation satellite system (GLONASS), the Beidou navigation satellite system (BDS), and the quasi-zenith satellite system (quasi). -zenith satellite system, QZSS) and/or satellite-based augmentation systems (SBAS).
  • GPS global positioning system
  • GLONASS global navigation satellite system
  • BDS Beidou navigation satellite system
  • QZSS quasi-zenith satellite system
  • SBAS satellite-based augmentation systems
  • the electronic device 100 implements a display function through a GPU, a display screen 194, an application processor, and the like.
  • the GPU is a microprocessor for image processing, connected to the display 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations and is used for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, and the like.
  • the display screen 194 includes a display panel.
  • the display panel can adopt liquid crystal display (LCD), organic light-emitting diode (OLED), active-matrix organic light-emitting diode or active-matrix organic light-emitting diode (active-matrix organic light-emitting diode).
  • LCD liquid crystal display
  • OLED organic light-emitting diode
  • active-matrix organic light-emitting diode active-matrix organic light-emitting diode
  • AMOLED flexible light-emitting diode (FLED), Miniled, MicroLed, Micro-oLed, quantum dot light-emitting diode (QLED), etc.
  • the electronic device 100 may include one or N display screens 194, and N is a positive integer greater than one.
  • the electronic device 100 can realize a shooting function through an ISP, a camera 193, a video codec, a GPU, a display screen 194, and an application processor.
  • the ISP is used to process the data fed back from the camera 193. For example, when taking a picture, the shutter is opened, the light is transmitted to the photosensitive element of the camera through the lens, the light signal is converted into an electrical signal, and the photosensitive element of the camera transfers the electrical signal to the ISP for processing and is converted into an image visible to the naked eye.
  • ISP can also optimize the image noise, brightness, and skin color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene.
  • the ISP may be provided in the camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and is projected to the photosensitive element.
  • the photosensitive element may be a charge coupled device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor.
  • CMOS complementary metal-oxide-semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transfers the electrical signal to the ISP to convert it into a digital image signal.
  • ISP outputs digital image signals to DSP for processing.
  • DSP converts digital image signals into standard RGB, YUV and other formats of image signals.
  • the electronic device 100 may include one or N cameras 193, and N is a positive integer greater than one.
  • Digital signal processors are used to process digital signals. In addition to digital image signals, they can also process other digital signals. For example, when the electronic device 100 selects a frequency point, the digital signal processor is used to perform Fourier transform on the energy of the frequency point.
  • Video codecs are used to compress or decompress digital video.
  • the electronic device 100 may support one or more video codecs. In this way, the electronic device 100 can play or record videos in multiple encoding formats, such as: moving picture experts group (MPEG) 1, MPEG2, MPEG3, MPEG4, and so on.
  • MPEG moving picture experts group
  • MPEG2 MPEG2, MPEG3, MPEG4, and so on.
  • NPU is a neural-network processing unit (Neural-network Processing Unit). By drawing on the structure of biological neural networks, such as the transfer mode between human brain neurons, it can quickly process input information, and it can also continuously self-learn.
  • the NPU can realize applications such as intelligent cognition of the electronic device 100, such as image recognition, face recognition, voice recognition, text understanding, and so on.
  • the external memory interface 120 may be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device 100.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to realize the data storage function. For example, save music, video and other files in an external memory card.
  • the internal memory 121 may be used to store computer executable program code, where the executable program code includes instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device 100 by running instructions stored in the internal memory 121.
  • the internal memory 121 may include a storage program area and a storage data area.
  • the storage program area can store an operating system, at least one application program (such as a sound playback function, an image playback function, etc.) required by at least one function.
  • the data storage area can store data (such as audio data, phone book, etc.) created during the use of the electronic device 100.
  • the internal memory 121 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash storage (UFS), and the like.
  • UFS universal flash storage
  • the electronic device 100 can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the earphone interface 170D, and the application processor. For example, music playback, recording, etc.
  • the audio module 170 is used to convert digital audio information into an analog audio signal for output, and is also used to convert an analog audio input into a digital audio signal.
  • the audio module 170 can also be used to encode and decode audio signals.
  • the audio module 170 may be provided in the processor 110, or part of the functional modules of the audio module 170 may be provided in the processor 110.
  • the speaker 170A also called “speaker” is used to convert audio electrical signals into sound signals.
  • the electronic device 100 can listen to music through the speaker 170A, or listen to a hands-free call.
  • the receiver 170B also called a "handset" is used to convert audio electrical signals into sound signals.
  • the electronic device 100 answers a call or voice message, it can receive the voice by bringing the receiver 170B close to the human ear.
  • the microphone 170C also called “microphone”, “microphone”, is used to convert sound signals into electrical signals.
  • the user can make a sound by approaching the microphone 170C through the human mouth, and input the sound signal into the microphone 170C.
  • the electronic device 100 may be provided with at least one microphone 170C. In other embodiments, the electronic device 100 may be provided with two microphones 170C, which can implement noise reduction functions in addition to collecting sound signals. In other embodiments, the electronic device 100 may also be provided with three, four or more microphones 170C to collect sound signals, reduce noise, identify sound sources, and realize directional recording functions.
  • the earphone interface 170D is used to connect wired earphones.
  • the earphone interface 170D may be a USB interface 130, or a 3.5mm open mobile terminal platform (OMTP) standard interface, and a cellular telecommunications industry association of the USA (CTIA) standard interface.
  • OMTP open mobile terminal platform
  • CTIA cellular telecommunications industry association of the USA
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A may be provided on the display screen 194.
  • the capacitive pressure sensor may include at least two parallel plates with conductive materials.
  • the electronic device 100 determines the intensity of the pressure according to the change in capacitance.
  • the electronic device 100 detects the intensity of the touch operation according to the pressure sensor 180A.
  • the electronic device 100 may also calculate the touched position according to the detection signal of the pressure sensor 180A.
  • touch operations that act on the same touch position but have different touch operation strengths can correspond to different operation instructions. For example: when there is a touch operation whose intensity of the touch operation is less than the first pressure threshold is applied to the short message application icon, an instruction to view the short message is executed. When a touch operation with a touch operation intensity greater than or equal to the first pressure threshold acts on the short message application icon, an instruction to create a new short message is executed.
  • the gyro sensor 180B may be used to determine the movement posture of the electronic device 100.
  • the air pressure sensor 180C is used to measure air pressure.
  • the magnetic sensor 180D includes a Hall sensor.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device 100 in various directions (generally three axes). When the electronic device 100 is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices, and be used in applications such as horizontal and vertical screen switching, pedometers and so on.
  • Distance sensor 180F used to measure distance.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector such as a photodiode.
  • LED light emitting diode
  • photodiode a light detector
  • the ambient light sensor 180L is used to sense the brightness of the ambient light.
  • the fingerprint sensor 180H is used to collect fingerprints.
  • the electronic device 100 can use the collected fingerprint characteristics to implement fingerprint unlocking, access application locks, fingerprint photographs, fingerprint answering calls, and so on.
  • the temperature sensor 180J is used to detect temperature.
  • the electronic device 100 uses the temperature detected by the temperature sensor 180J to execute a temperature processing strategy.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K may be provided on the display screen 194, and the touch screen is composed of the touch sensor 180K and the display screen 194, which is also called a “touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • the visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the electronic device 100, which is different from the position of the display screen 194.
  • the bone conduction sensor 180M can acquire vibration signals.
  • the bone conduction sensor 180M can acquire the vibration signal of the vibrating bone mass of the human voice.
  • the bone conduction sensor 180M can also contact the human pulse and receive the blood pressure pulse signal.
  • the bone conduction sensor 180M may also be provided in the earphone, combined with the bone conduction earphone.
  • the audio module 170 can parse the voice signal based on the vibration signal of the vibrating bone block of the voice obtained by the bone conduction sensor 180M, and realize the voice function.
  • the application processor may analyze the heart rate information based on the blood pressure beating signal obtained by the bone conduction sensor 180M, and realize the heart rate detection function.
  • the button 190 includes a power-on button, a volume button, and so on.
  • the button 190 may be a mechanical button. It can also be a touch button.
  • the electronic device 100 may receive key input, and generate key signal input related to user settings and function control of the electronic device 100.
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for incoming call vibration notification, and can also be used for touch vibration feedback.
  • touch operations that act on different applications can correspond to different vibration feedback effects.
  • Acting on touch operations in different areas of the display screen 194, the motor 191 can also correspond to different vibration feedback effects.
  • Different application scenarios for example: time reminding, receiving information, alarm clock, games, etc.
  • the touch vibration feedback effect can also support customization.
  • the indicator 192 may be an indicator light, which may be used to indicate the charging status, power change, or to indicate messages, missed calls, notifications, and so on.
  • the SIM card interface 195 is used to connect to the SIM card.
  • the SIM card can be inserted into the SIM card interface 195 or pulled out from the SIM card interface 195 to achieve contact and separation with the electronic device 100.
  • the electronic device 100 may support 1 or N SIM card interfaces, and N is a positive integer greater than 1.
  • the SIM card interface 195 can support Nano SIM cards, Micro SIM cards, SIM cards, etc.
  • the same SIM card interface 195 can insert multiple cards at the same time. The types of the multiple cards can be the same or different.
  • the SIM card interface 195 can also be compatible with different types of SIM cards.
  • the SIM card interface 195 can also be compatible with external memory cards.
  • the electronic device 100 interacts with the network through the SIM card to implement functions such as call and data communication.
  • the electronic device 100 adopts an eSIM, that is, an embedded SIM card.
  • the eSIM card can be embedded in the electronic device 100 and cannot be separated from the electronic device 100.
  • the software system of the electronic device 100 may adopt a layered architecture, an event-driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture.
  • the embodiment of the present application takes an Android system with a layered architecture as an example to illustrate the software structure of the electronic device 100 by way of example. Please refer to FIG. 1B.
  • FIG. 1B is a software structure block diagram of an electronic device 100 according to an embodiment of the present application.
  • the layered architecture divides the software into several layers, and each layer has a clear role and division of labor. Communication between layers through software interface.
  • the Android system is divided into four layers, from top to bottom, the application layer, the application framework layer, the Android runtime and system library, and the kernel layer.
  • the application layer can include a series of application packages.
  • the application package may include applications such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message, etc.
  • the application framework layer provides an application programming interface (application programming interface, API) and a programming framework for applications in the application layer.
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, a content provider, a view system, a phone manager, a resource manager, a notification manager, and so on.
  • the window manager is used to manage window programs.
  • the window manager can obtain the size of the display screen, determine whether there is a status bar, lock the screen, take a screenshot, etc.
  • the content provider is used to store and retrieve data and make these data accessible to applications.
  • the data may include video, image, audio, phone calls made and received, browsing history and bookmarks, phone book, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, and so on.
  • the view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface that includes a short message notification icon may include a view that displays text and a view that displays pictures.
  • the phone manager is used to provide the communication function of the electronic device 100. For example, the management of the call status (including connecting, hanging up, etc.).
  • the resource manager provides various resources for the application, such as localized strings, icons, pictures, layout files, video files, and so on.
  • the notification manager enables the application to display notification information in the status bar, which can be used to convey notification-type messages, and it can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, and so on.
  • the notification manager can also be a notification that appears in the status bar at the top of the system in the form of a chart or a scroll bar text, such as a notification of an application running in the background, or a notification that appears on the screen in the form of a dialog window.
  • prompt text information in the status bar sound a prompt sound, electronic device vibration, flashing indicator light, etc.
  • Android Runtime includes core libraries and virtual machines. Android runtime is responsible for the scheduling and management of the Android system.
  • the core library consists of two parts: one part is the function function that the java language needs to call, and the other part is the core library of Android.
  • the application layer and the application framework layer run in a virtual machine.
  • the virtual machine executes the java files of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules. For example: surface manager (surface manager), media library (Media Libraries), three-dimensional graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • the surface manager is used to manage the display subsystem and provides a combination of 2D and 3D layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as still image files.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, G.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to realize 3D graphics drawing, image rendering, synthesis, and layer processing.
  • the 2D graphics engine is a graphics engine for 2D drawing.
  • the kernel layer is the layer between hardware and software.
  • the kernel layer contains at least display driver, camera driver, audio driver, and sensor driver.
  • the software system shown in Figure 1B involves the presentation of applications that use sharing capabilities (such as gallery, file manager), instant sharing modules that provide sharing capabilities, and print service and print spooler that provide printing capabilities.
  • sharing capabilities such as gallery, file manager
  • instant sharing modules that provide sharing capabilities
  • print service and print spooler that provide printing capabilities.
  • application framework layer provides printing framework, WLAN service, Bluetooth service, and the core and bottom layer provide WLAN Bluetooth capabilities and basic communication protocols.
  • the corresponding hardware interrupt is sent to the kernel layer.
  • the kernel layer processes the touch operation into the original input event (including touch coordinates, time stamp of the touch operation, etc.).
  • the original input events are stored in the kernel layer.
  • the application framework layer obtains the original input event from the kernel layer and identifies the control corresponding to the input event. Taking the touch operation as a touch operation, and the control corresponding to the touch operation is the control of the camera application icon as an example, the camera application calls the interface of the application framework layer to start the camera application, and then starts the camera driver by calling the kernel layer, and passes the 3D camera model Group 193 captures still images or videos.
  • Application scenario 1 Video generation based on face recognition
  • the current environment information is obtained through the sensor, such as: the current weather information is cloudy, the temperature is 26 degrees, the current time is 10:40 in the morning, and the current geographic location is the playground of xx school, The current exercise state is running.
  • the electronic device 100 constructs a general environment for generating the video according to the current surrounding environment information provided, and uses the result of face recognition as the protagonist of the video, combined with the selected image on the terminal device to generate A status video of a user running on the playground under cloudy conditions with a temperature of approximately 26 degrees for sharing.
  • the electronic device 100 may receive the sent video generation instruction, and perform face recognition in response to the video generation instruction to obtain a picture of a person corresponding to the face recognition, and respond to all
  • the video generation instruction uses the sensor 180 of the electronic device 100 to obtain the current environment information and the user's physiological data, convert it into text information and input the obtained image information into the target generator network to obtain the target video.
  • detecting the current environment can act on the background of the generated video.
  • FIG. 2A is a schematic diagram of a set of user interfaces for receiving video generation instructions according to an embodiment of the present application.
  • the electronic device 100 can detect a user's touch operation through the touch sensor 180K (for example, the touch sensor 180K recognizes that the user makes a pull-down in the status bar of the window display area 201 Operation), in response to the touch operation, as shown in (2) in FIG. 180K recognizes the pull-down operation made by the user in the status bar of the window display area 201).
  • the touch sensor 180K detects a touch operation on the instant share 203 in the status bar, it can obtain a video generation instruction.
  • FIG. 2B is a schematic diagram of a set of user interfaces for obtaining picture information provided by an embodiment of the present application.
  • the user interface of the electronic device 100 is shown in FIG. 2B (1 As shown in ), when the touch sensor 180K detects the user's touch operation on the face recognition control 204 shown in (1) in FIG. 2B, the face recognition program can be started to perform face recognition. That is, in response to the face recognition request, the electronic device 100 performs face recognition and obtains a face recognition result.
  • the electronic device 100 can perform face recognition in response to the video generation instruction and obtain a face recognition result; according to the face recognition result, from a plurality of pre-stored pictures, Obtain at least one picture that matches the face recognition result.
  • the electronic device 100 obtains at least one picture of a person matching the result of the face recognition from a plurality of pictures pre-stored in the first terminal according to the result of the face recognition Is the picture information. For example, the electronic device 100 obtains two pictures of people according to the result of face recognition.
  • the face image during the face recognition can be directly used.
  • the corresponding character picture is generated as the picture information.
  • the picture quality of the picture of the person, the quality of the picture of the person Size select a preset number of pictures from multiple pictures of people. For example, there are one hundred pictures of people in the plurality of pictures stored in advance, and the electronic device selects the five pictures of people closest to the current time as the picture information of the input generator network according to the shooting time.
  • the picture quality of each of the acquired N pictures is greater than a preset threshold.
  • a preset threshold it is necessary to score the picture quality of the picture to be selected, and when the picture quality score is lower than a preset threshold, the picture is not used to generate a video.
  • the use of pictures with picture quality greater than the preset threshold to generate a video can therefore ensure that the final target video generated from the picture has a higher picture quality, which satisfies the user's viewing experience.
  • the method further includes: performing picture quality scoring on the acquired N pictures to obtain a picture quality scoring result corresponding to each of the N pictures; and setting the picture quality scoring result to be less than a preset
  • the picture with the threshold value is subjected to picture quality enhancement processing, and the picture with the enhanced picture quality is updated to the N pictures.
  • the user's terminal device or cloud generally stores a certain amount of user pictures, and high-quality static pictures can be automatically selected through aesthetic evaluation. For example: when the quality of two pictures is very different, one of the images is of high quality and sharpness, while the other image is relatively fuzzy, and the specific details of the image cannot be captured, which is not conducive to the generation of real-time video.
  • the existing image scoring network can be used to input two pictures into the image scoring network to obtain the picture quality scores of the two pictures.
  • the picture is input into the generator network as a static picture to improve the video quality when generating a video from the picture, which is conducive to the generation of the video and satisfies the user's viewing experience.
  • FIG. 2C is a schematic diagram of a picture quality scoring provided by an embodiment of the present application.
  • two pictures, picture A and picture B are obtained, and the obtained two pictures are input into the picture quality scoring model respectively, and the picture quality scoring result corresponding to each picture is obtained.
  • Add the picture whose result of the picture quality score is greater than the preset threshold as a video picture to the picture information.
  • the electronic device 100 may also recognize the user's gesture operation in front of the screen of the electronic device through an infrared sensor or a gesture sensor, etc., which is not limited in the embodiment of the present application.
  • the electronic device 100 may obtain text information in response to the video generation instruction, and obtain the text information from one or more of text input information, voice input information, user preference information, user physiological data information, and current environment information.
  • the current environment information includes one or more of current weather information, current time information, and current geographic location information.
  • the text information is text obtained by keyword extraction from one or more of text input information, voice input information, user preference information, user physiological data information, and current environment information, wherein the obtained one or more keywords It can include people, time, place, events or actions, etc., which are used to indicate the video content of the generated video.
  • the electronic device 100 may obtain text input information and voice input information through user input, and extract through keywords, and obtain information about one or more of a person, time, place, and event from the text input information and voice input information.
  • the electronic device 100 can also obtain the user’s preference information through the user’s historical browsing records, historical input system information, etc., and then obtain information about the user’s interest, user browsing, search or the most frequent occurrence from the user’s preference information Text information;
  • the electronic device 100 can also obtain user physiological data information or current environment information through the sensor 180, and then obtain text information about one or more of the person, time, place, event, sports state, and mental state through keyword extraction ;
  • the current environment information includes one or more of current weather information, current time information, current geographic location information, and current motion state.
  • Figure 2D is a schematic diagram of a user interface for displaying text information provided by an embodiment of the present application.
  • the electronic device 100 can obtain current geographic location information through a GPS positioning system, and further obtain current weather information corresponding to the changed location based on the current time information and current geographic location information, and can also obtain physiological data of the user. That is, the electronic device 100 can obtain information about the time, weather, location, and sports status according to the current weather information, which is cloudy, the temperature is 26 degrees, the current time is 10:40 in the morning, the current geographic location is xx sports field, and the current exercise status is running. Text information.
  • the electronic device 100 uses multi-modal information to extract the text information to assist in the generation of the video, so that the generated video can reflect the current user status (such as the weather environment in the generated video and the current location of the user).
  • the weather environment is the same), where the multi-modal information can include text, preference information, environmental information, and so on.
  • the user can only rely on the current environment information obtained by the sensor or the preferences extracted from the historical interaction information as the input text information, and generate the target video together with the input picture information.
  • the text information can be obtained first and then the image information; the image information can be obtained first and then the text information; the text information and the image information can also be obtained at the same time.
  • the electronic device 100 may also extract text information about one or more of a person, time, place, event, etc. from the obtained picture information through image recognition, which is not limited in the embodiment of the present application.
  • the electronic device 100 extracts image features corresponding to the one or more keywords in each of the N pictures according to the one or more keywords; and compares the one or more keywords with The image features of the N pictures are input into the target generator network to generate a target video.
  • the target video includes M pictures.
  • the M pictures are generated based on the image characteristics of the N pictures and are related to the Pictures corresponding to one or more keywords, M is a positive integer greater than 1. That is, the obtained text information and the image information are input into the target generator network to obtain a target video, and the target video is used to describe the text information.
  • This use of text and images to generate a video together allows the generated video to adjust the input picture information according to the input text information, which greatly enriches the video content and avoids the direct stacking of multiple pictures on the existing terminal equipment.
  • the video of is limited to the form of a slide show, which lacks the richness of the content.
  • the electronic device 100 obtains a picture of user A drinking milk tea and text information about walking on the playground.
  • the electronic device 100 can extract the image features of user A in the above pictures according to the text information, and generate M pictures through the generator network And synthesize the target video of user A walking on the playground.
  • the first spatial variable corresponding to the text information and the second spatial vector of the image feature of each picture in the picture information may be extracted respectively, where the first spatial variable may be in the latent vector space
  • the word vector identifying the text information in the, the second space vector of each picture may be a vector that identifies the image feature of the picture in the latent vector space, and extracting the space vector helps the generator network to better generate the target video.
  • a down-sampled convolutional network is used for the input image to extract the latent space of the image
  • the Word2vec model is applied to the input text information to extract the latent space of the text
  • the latent space of the image and text is used as the video generator network Input to generate video.
  • target audio information matching at least one of the one or more keywords can also be obtained, and the target audio information is added to the target video to obtain sound
  • the video meets the common visual and auditory needs of users.
  • FIG. 2E is a user interface for sharing a group of videos to friends after generating a set of videos provided by an embodiment of the present application.
  • the electronic device 100 uses the text information obtained in FIG. 2D to obtain time, weather, location, and movement status and the picture information obtained in FIG. 2B to generate a target video through a generator network.
  • the video can be viewed or shared.
  • the user interface may be a status video sharing interface provided by a chat tool. Not limited to this, the user interface may also be an interface for status video sharing provided by other applications, and other applications may be social software or the like.
  • the electronic device before generating the target video, the electronic device also needs to train the target generator network. That is, obtain sample text information, sample picture information, and real video data sets, and construct a discriminator network and a generator network based on video generation; input the sample text information and the sample picture information into the generator network, Obtain a sample video; the sample video and the real video data set are used as the input of the discriminator network to obtain a discriminative loss result, and the generator network is trained according to the discriminant loss result to obtain the target generator network, Wherein, the judgment loss result is true when the sample video belongs to the real video data set. Train the generator network and discriminator network through sample data. Please refer to Figure 2F.
  • Figure 2F is a schematic diagram of a generator network training process provided by an embodiment of the present application.
  • Real video>input to the discriminator where the real video is the video obtained in the real world, the discriminator judges the source of the input, if it comes from the generated video, it is judged as false 0, otherwise it is judged as true 1, and passed
  • Such repeated confrontation training methods can further standardize the content of the generated videos, gradually improve the authenticity of the generated videos and improve the quality of the generated videos, which is conducive to the sharing of real-time status videos.
  • Application scenario 2 Video generation based on user input.
  • User A has some old photos on hand, but unfortunately some things have not been done at the time and want to re-experience the sense of the picture, he can describe the picture at that time through voice input or text input. For example, user A has an old photo of his grandson on hand, and user A wants to see how his grandson is playing football, then user A says to the terminal device: "Grandson is playing football carefree on the green grass.” Then the status video generation system automatically extracts the keywords "green grass”, “playing football”, and “grandson", and based on the photos of the grandson on the terminal device of user A, it can generate a section that meets the needs of user A. video.
  • the electronic device 100 may receive the sent video generation instruction, and obtain picture information in response to the video generation instruction, and at the same time pass the sensor of the electronic device 100 in response to the video generation instruction. 180 obtains the current environment information, converts it into text information and inputs the obtained image information into the target generator network to obtain the target video. In addition, detecting the current environment information can act on the background of the generated video.
  • FIG. 2G is a schematic diagram of a video generation process provided by an embodiment of the present application.
  • the text information is obtained according to the user’s text input information and voice input information, such as: the character is a grandson, the location is a green grass, and the event is football.
  • the electronic device 100 constructs a video based on the user’s voice input.
  • the electronic device obtains a picture corresponding to at least one of the one or more keywords from a plurality of pre-stored pictures. That is, the electronic device can obtain picture information related to the text information according to the text information. For example, when a user visits the Forbidden City, he can obtain pictures related to the Forbidden City, which can be used to synthesize the target video, so that the user can share his life status in real time.
  • the picture information the picture of the corresponding place can also be obtained according to the person information input by the user, which is used to generate the target video, which meets the needs of the user. For example: the user enters "Xiao Ming is playing football in the playground", and at least one relevant picture of the keywords "Xiao Ming", “playground”, and "football” can be obtained, which can be used to synthesize the target video.
  • FIG. 2H is a set of user interfaces for obtaining text information provided by an embodiment of the present application.
  • the electronic device 100 can respond to this Command, switch to the interface shown in Figure 2H (2).
  • the touch sensor 180K detects the user's touch operation on the voice input control
  • the electronic device 100 can respond to the touch operation to receive the user's voice input: "Grandson is playing football carefree on the green grass.”
  • Voice input can obtain text information about people, places, and events, and obtain keywords such as “grandson", "grass”, “playing football”, etc. according to the text information. Please refer to FIG. 2I.
  • FIG. 2I FIG.
  • 2I is a set of user interfaces for obtaining image information based on keywords according to an embodiment of the present application.
  • the keywords "grandson” and "grass” are obtained according to the text information, and further, as shown in (2) in FIG.
  • the two keywords obtain at least one picture corresponding to the keyword from a plurality of pre-stored pictures.
  • the video generation instruction includes at least one picture tag, and each picture tag of the at least one picture tag corresponds to at least one picture of a plurality of pictures stored in advance; the response to the video generation instruction Obtaining picture information includes: in response to the video generation instruction, according to the at least one picture tag, acquiring at least one picture corresponding to each picture tag of the at least one picture tag from a plurality of pre-stored pictures .
  • the electronic device when it obtains picture information, it can obtain at least one corresponding picture through at least one picture tag carried in the video generation instruction, which is used to generate the target video.
  • the user can directly filter out the pictures that the user is interested in or need to generate a video, which meets the user's viewing needs. For example: the user can select the picture tag as "cat”, after obtaining multiple pictures of the cat, together with the text information, generate a dynamic video with the cat as the protagonist; for another example: the user can also select the picture tag as "childhood Huawei” , After obtaining multiple pictures of Xiao Ming’s childhood, together with the text information, a dynamic video about Xiao Ming’s childhood is generated.
  • the text information is obtained from one or more of text input information, voice input information, user preference information, user physiological data information, and current environment information, where, as shown in FIG. 2I
  • one or more keywords are obtained according to the user's voice input information, such as: the character is a grandson, the location is a green grass, and the event is football.
  • the electronic device 100 uses a generator network as a core module for video generation, where the generator network may adopt an RNN cyclic neural network to take into account the semantic information of the upper and lower frames, so as to promote the stability of the generated video between frames.
  • the generator network is a part of the generative confrontation network.
  • the generator samples the noise distribution and uses it as input, while the discriminator judges the source of the input data.
  • Such a game form can well promote the progress of the two networks in the entire confrontation network.
  • FIG. 2J which is a user interface for generating a video provided by an embodiment of the present application. As shown in Figure 2J, after the electronic device 100 inputs the obtained one or more keywords and the image features corresponding to the keywords into the generator network, a video of grandson playing football on the grass is generated. View or share videos in the interface shown in 2J.
  • Application scenario 3 Video generation based on user preferences
  • the electronic device 100 can obtain the user's behavior information on the terminal device, extract effective keywords, and perform video generation. For example, a user who likes traveling very much, often mentions that he wants to travel to "Bali” in a chat with his friends, or frequently searches for information about "Bali tourism” in the browser, the electronic device 100 can generate information based on the keyword information A scene of a user traveling in Bali.
  • FIG. 2K is a schematic diagram of a video generation process based on user preferences provided in an embodiment of the present application.
  • the electronic device 100 may receive the sent video generation instruction, and obtain text information and picture information from the user's historical preference information and current environment information in response to the video generation instruction, or, at the same time, respond to the The video generation instruction obtains the preference information input by the user, and then converts it into text information and inputs the obtained image information into the target generator network together to obtain the target video.
  • the detected current environment information can act on the background of the generated video.
  • the electronic device 100 can be used as the input information of the video according to the user's preference.
  • FIG. 2L is another set of user interfaces for obtaining text information provided by an embodiment of the present application.
  • the electronic device 100 can detect the touch operation of the user through the touch sensor 180K (for example, the touch sensor 180K recognizes the user's click operation on the status video in the window display area 202), and responds to This touch operation can cause the electronic device to start generating video.
  • the electronic device 100 can obtain user preference information and current environment information after responding to the click operation.
  • the text information is obtained from the above information, that is, the one or more keywords are extracted, and the picture corresponding to at least one keyword is obtained according to the one or more keywords. For example: according to the keyword "Bali" to obtain pictures about Bali, and obtain the weather information of Beauty according to the current time, so that the weather environment of Bali in the generated video is consistent with the current weather environment of Beauty.
  • the electronic device obtains the user's preference information, and without receiving other input from the user, extracts one or more keywords from the user's preference information and current environment information to obtain the text information of the video input. For example, as shown in (2) of Figure 2L, the electronic device obtains information that the user likes to play in Bali, and combines the current time and the weather in Beauty to obtain at least one keyword related to time, location, and person.
  • the electronic device 100 inputs the obtained text information and the image information into the target generator network to obtain a target video, where the target video is used to describe the text information.
  • the first space vector of each picture in the picture information is extracted according to the text information, and the first space vector of each picture is used to identify an image feature in the picture corresponding to the text information.
  • FIG. 2M is another user interface for video generation provided by an embodiment of the present application.
  • the electronic device generates a video of the user playing in Bali based on the obtained text information and picture information.
  • the user Lisa can send the video to his friend Emmy, which enriches the communication form between friends, and can even generate and friends Play the video of Bali together and satisfy the regret of not being able to travel with friends.
  • an embodiment of the present application provides a video generation apparatus applied to the above-mentioned electronic equipment in FIG. 1A.
  • FIG. 3A is a video generation apparatus provided by an embodiment of the present application.
  • the video generating device may include three modules: an input module, an offline module, and an online module.
  • the input module includes: a static image acquisition submodule, a sensor information acquisition submodule, a preference information acquisition submodule, and a user input acquisition submodule;
  • the offline module includes: a video generation submodule and a video optimization submodule.
  • the terminal device mentioned in the following embodiments is equivalent to the electronic device 100 in this application.
  • the input module is to provide the original input for the generator network that generates the video.
  • the auxiliary generator network completes the generation of the video. Clear input condition information and high-quality static pictures are all conducive to generating better real-time status videos; users You can add the status video elements you want to generate according to your own subjective wishes to enrich the video content, which can be presented in text or voice.
  • the static image acquisition sub-module Generally speaking, there are many photos to choose from on the user's terminal device. When the user wants to generate a real-time status video that tends to contain the activity of the person, the terminal device automatically selects the one that contains the user's own For photos, you can also use methods such as image quality evaluation to select high-quality images. For example, some pictures are blurred due to camera shake when taking pictures, or the light is dark, which leads to poor photo results. Then such pictures should be filtered out and not used as input for video generation.
  • Sensor information acquisition sub-module There are many sensor originals on the terminal device.
  • the GPS position sensor can obtain the user's geographic location information
  • the temperature sensor can obtain the temperature information around the user
  • the air pressure sensor can obtain the user's relative altitude.
  • sensors all of which have a good function of providing real-time information around the user.
  • Preference information acquisition sub-module acquire the user's historical interaction information on the terminal device, and extract the user's historical preference information based on the interaction information. For example: chat records, search records, etc. on social software, applications of these terminal devices can extract a large amount of user interaction information. Or, collect user search information from the browser to extract user interest information.
  • User input acquisition sub-module the form of user input can be voice or text. If it is voice, you can use the voice assistant function of the mobile phone to extract keywords, convert the keywords into text and store them, and wait for the follow-up and text input to combine to form the final input.
  • the user can input some keywords of the real-time status to be generated in the terminal device to describe the scene of the status video to be generated. For example: time, person, place, event, etc.
  • the offline module is mainly used for model training, using generative confrontation network for video generation, and optimizing the generator network.
  • the video generation sub-module is mainly composed of generators in the generative confrontation network
  • the video optimization sub-module is mainly composed of the confrontation network, which makes the generated video more realistic, and the result of video optimization can be fed back to the video generation sub-module for
  • the video generation sub-module trains the generator network model.
  • the video generation sub-module the generator network (Generator) in the video generation module can be implemented by the RNN network, which has a good ability to memorize context information.
  • the generator network is a fully convolutional network, which is composed of multiple layers of convolution and multiple layers of upsampling, where the input can be composed of ⁇ picture information, text information>.
  • the generator can generate a wealth of videos. Through the input constraints, it can be further standardized to generate the samples through the generator and send them to the discriminator network together with the video data in the real world. In order to improve the quality of the generated video, it is conducive to the generation of real-time status video.
  • Video optimization sub-module consists of a discriminator network (Discriminator).
  • the discriminator network receives the data results from the video generation module and the video data collected in the real world for confrontation training. Its main purpose is to make the generated video more realistic, and to avoid the generated video being too smooth or the patch effect is more obvious.
  • the input of the discriminator is ⁇ generated video, real video>, where the discriminator judges whether the generated video can be regarded as a real video through the two input videos, that is, when the difference between the generated video and the real video is small , The discriminator determines that the generated video is a real video, and the discriminator’s discrimination loss result is 1; when the generated video is significantly different from the real video, the discriminator determines that the generated video is not a real video.
  • the discriminant loss result of the discriminator is 0, the video quality generated by the generator network is poor, and further training is needed. Therefore, when the video generated by the generator network can be judged to be a real video, it is considered that the training and optimization of the generator network is successful. Through such a confrontation training method, the authenticity of the generated video is gradually improved.
  • the online module uses the generator model trained by the offline module to share the user's real-time status on the terminal device.
  • the video generation sub-module also needs ⁇ static image, sensor information, user input information> as input, while the video optimization sub-module is not needed at this time, which reduces the amount of model parameters and reduces the power consumption of the mobile phone.
  • the offline module may also only need to deploy the trained video generation sub-module to complete the real-time state sharing of the terminal device. This application does not specifically limit this .
  • Figure 3B is a schematic flow chart of a video generation method provided by an embodiment of the present application.
  • the method can be applied to the electronic device described in Figure 1A above.
  • the method flow shown in 3B is step S301-step S307.
  • the description will be made from the side of the video generating device with reference to FIG. 3B.
  • the method may include the following steps S301-S307.
  • Step S301 Receive a video generation instruction.
  • the video generation device receives a video generation instruction
  • the video generation instruction may receive a user's video generation instruction through touch operation recognition, gesture operation recognition, voice control recognition, and the like.
  • Step S302 Acquire text information and picture information in response to the video generation instruction.
  • the video generation device acquires text information and picture information in response to a video generation instruction. After the video generation device receives the video generation instruction, it obtains text information and picture information in response to the instruction.
  • the text information is used to describe the content of the subsequently generated video
  • the picture information includes N pictures
  • the N pictures are used by the video generating device to generate M pictures in the video according to the text information and the N pictures.
  • a picture is a picture generated based on the image features of the N pictures and corresponding to the one or more keywords, and M is a positive integer greater than 1.
  • the acquiring text information in response to the video generation instruction includes: responding to the video generation instruction, from text input information, voice input information, user preference information, user physiological data information,
  • One or more of the current environment information is used to obtain the text information, where the current environment information includes one or more of current weather information, current time information, and current geographic location information.
  • the electronic device can obtain user-specific input information (text input information, voice input information); or use sensors on the electronic device to obtain current environment information; or extract text information from user preference information extracted from historical interaction information. For example, when the user does not have or is unable to perform manual or voice input, the user can only rely on the current environment information obtained by the sensor or the preferences extracted from the historical interaction information as the input text information, and generate the target video together with the input picture information.
  • the obtaining picture information in response to the video generation instruction includes: in response to the video generation instruction, obtaining information related to the one or more key pictures from a plurality of pre-stored pictures. A picture corresponding to at least one keyword in the word.
  • the electronic device can obtain picture information related to the text information according to the text information. For example: when obtaining picture information, the picture of the corresponding location can be obtained according to the current geographic location information or the location information input by the user, and used to generate the target video. For example, when a user visits the Forbidden City, he can obtain pictures related to the Forbidden City, which can be used to synthesize the target video, so that the user can share his life status in real time.
  • the picture of the corresponding place can also be obtained according to the person information input by the user, which is used to generate the target video, which meets the needs of the user. For example: the user enters "Xiao Ming is playing football in the playground", and at least one relevant picture of the keywords “Xiao Ming", “playground”, and “football” can be obtained, which can be used to synthesize the target video.
  • the video generation instruction includes a face recognition request; the obtaining picture information in response to the video generation instruction includes: in response to the video generation instruction, performing face recognition and obtaining a person Face recognition result; according to the face recognition result, at least one picture that matches the face recognition result is obtained from a plurality of pre-stored pictures.
  • the electronic device may first obtain a face recognition result through face recognition, and then directly obtain a picture containing a user from pre-stored pictures according to the face recognition result. It is convenient to directly generate a video about the user's status, and share the user's current status in time. For example: through face recognition, after identifying user A, you can get user A's picture from multiple pre-stored pictures. This kind of video can generate user A's video without user filtering pictures, which is convenient User operation improves user experience.
  • the video generation instruction includes at least one picture tag, and each picture tag in the at least one picture tag corresponds to at least one picture among a plurality of pictures stored in advance; the response is
  • the video generation instruction to obtain picture information includes: in response to the video generation instruction, according to the at least one picture tag, obtaining a picture tag corresponding to each of the at least one picture tag from a plurality of pre-stored pictures At least one picture of.
  • the user can select the picture tag as "cat”, after obtaining multiple pictures of the cat, together with the text information, generate a dynamic video with the cat as the protagonist; for another example: the user can also select the picture tag as "childhood Xiaoming" , After obtaining multiple pictures of Xiao Ming’s childhood, together with the text information, a dynamic video about Xiao Ming’s childhood is generated.
  • the electronic device obtains the picture information, it may obtain at least one corresponding picture through the at least one picture tag carried in the video generation instruction for the generated target video.
  • a user wants to generate an interesting video from a few pictures he can directly filter out the pictures that the user is interested in or need to generate a video, which meets the user's viewing needs.
  • the picture quality of each of the acquired N pictures is greater than a preset threshold. Before obtaining the picture information, it is necessary to score the picture quality of the picture to be selected. When the picture quality score is lower than a preset threshold, the picture is not used to generate a video. The use of pictures with picture quality greater than the preset threshold to generate a video can therefore ensure that the final target video generated from the picture has a higher picture quality, which satisfies the user's viewing experience.
  • the method further includes: performing image quality scoring on the acquired N pictures to obtain a picture quality scoring result corresponding to each of the N pictures; and comparing the picture quality
  • the picture whose scoring result is less than the preset threshold is subjected to picture quality enhancement processing, and the picture with the enhanced picture quality is updated to the N pictures.
  • the picture quality can be enhanced for the picture to improve the video quality when the video is generated from the picture and meet the user's viewing experience .
  • Step S303 Obtain image features corresponding to one or more keywords in each of the N pictures according to the one or more keywords.
  • the video generating device extracts image features corresponding to one or more keywords in each of the N pictures according to one or more keywords. For example, if the text information includes the keyword football, the video generating device needs to extract the image features of the football in each of the N pictures, so that the video generating device generates a video based on the image features of the football.
  • Step S304 Extract the first space variable corresponding to each keyword in the vector space in the one or more keywords.
  • the video generating device may separately extract the first spatial variable corresponding to each keyword in the vector space in the one or more keywords.
  • the first space variable is the word vector of the keyword in the vector space.
  • Step S305 Extract the second space variables corresponding to the image features of the N pictures in the vector space.
  • the video generation device may extract image features corresponding to one or more keywords in each of the N pictures, and respectively correspond to the second spatial variables in the vector space.
  • the second space variable is the vector of the image feature in the vector space, and is used to represent the image feature.
  • Step S306 Input the first spatial variable and the second spatial variable into the target generator network to generate a target video.
  • the video generating device inputs the first spatial variable and the second spatial variable into the target generator network to generate the target video. That is, the one or more keywords and the image characteristics of the N pictures are input into the target generator network to generate a target video.
  • the inputting the one or more keywords and the image features of the N pictures into the target generator network to generate the target video includes: extracting the one or more key Each keyword in the word corresponds to the first spatial variable in the vector space; extracts the second spatial variable corresponding to the image features of the N pictures in the vector space; compares the first spatial variable and the second The spatial variables are input into the target generator network to generate the target video.
  • the first spatial variable corresponding to the text information and the second spatial vector of the image feature of each picture in the picture information may be extracted first, where the first spatial variable may be the identification of the text information in the latent vector space
  • the second space vector of each picture may be a vector that identifies the image feature of the picture in the latent vector space, and extracting the space vector helps the generator network to better generate the target video.
  • the method further includes: obtaining sample text information, sample picture information, and real video data sets, and constructing a discriminator network and a generator network based on video; combining the sample text information with The sample picture information is input into the generator network to generate a sample video; the sample video and the real video data set are used as the input of the discriminator network to obtain the discrimination loss result, wherein, in the sample video When it belongs to the real video data set, the discrimination loss result is 1; according to the discrimination loss result, the generator network is trained to obtain the target generator network.
  • the video generation device needs to train the generator network and the discriminator network through sample data.
  • the discriminator judges the source of the input, and if it comes from the generated video, it is judged as false 0, otherwise it is judged as true 1.
  • the content of the generated video can be further standardized, and the authenticity of the generated video and the quality of the generated video can be gradually improved.
  • Step S307 Receive a video sharing instruction, and send the target video to the target device in response to the video sharing instruction.
  • the video generating device may receive a video sharing instruction, and send the target video to the target device in response to the video sharing instruction. After the target video is generated, the user can also share the video to Moments. By sharing the generated target video to the target terminal device, the friendly development of social platforms can be promoted and the user's life and friendship experience can be improved.
  • step S301 to step S307 in the embodiment of the present application can also be referred to the related description of the embodiment of FIG. 2A to FIG. 2M, which will not be repeated here.
  • the electronic device can generate a video based on text information and picture information so that users can share their life status in real time.
  • the electronic device receives the video generation instruction, it can obtain text information and picture information in response to the video generation instruction.
  • the text information includes one or more keywords
  • the picture information includes N pictures
  • the text information can be Used to describe the video content of the generated video (for example: the one or more keywords may include people, time, place, events or actions, etc.)
  • the picture information may be used to extract or generate a video picture of each frame Therefore, the image features corresponding to the one or more keywords in each of the N pictures can be obtained according to one or more keywords, and then the one or more keywords and the N pictures
  • the image feature of the picture is input into the target generator network to generate a target video, where the target video may include M pictures generated based on the image characteristics of the N pictures and corresponding to the one or more keywords .
  • the use of text and images to generate video together allows the generated video to adjust the input picture information according to the input text information, which greatly enriches the video content and avoids directing multiple pictures on the existing terminal equipment.
  • the videos generated by the stack are limited to the switching display in the form of slides, which lacks the richness of content, and also meets the visual and auditory needs of users.
  • FIG. 4 is a schematic structural diagram of another video generating device provided by an embodiment of the present application.
  • the video generating device 10 may include a receiving response unit 401, an extracting unit 402, and a generating unit 403, and may also include a scoring unit 404, an enhancement unit 405, and a training unit 406. Among them, the detailed description of each unit is as follows.
  • the receiving response unit 401 is configured to receive a video generation instruction and obtain text information and picture information in response to the video generation instruction.
  • the text information includes one or more keywords
  • the picture information includes N pictures, where N is A positive integer greater than or equal to 1;
  • the extraction unit 402 is configured to obtain image features corresponding to the one or more keywords in each of the N pictures according to the one or more keywords;
  • the generating unit 403 is configured to input the one or more keywords and the image characteristics of the N pictures into the target generator network to generate a target video, the target video includes M pictures, and the M pictures are For the pictures generated based on the image features of the N pictures and corresponding to the one or more keywords, M is a positive integer greater than 1.
  • the receiving response unit 401 is specifically configured to: respond to the video generation instruction, from text input information, voice input information, user preference information, user physiological data information, and current environment information. One or more of to obtain the text information, where the current environment information includes one or more of current weather information, current time information, and current geographic location information.
  • the receiving response unit 401 is specifically configured to: in response to the video generation instruction, obtain at least one of the one or more keywords from multiple pre-stored pictures. The image corresponding to the keyword.
  • the video generation instruction includes a face recognition request; the receiving response unit 401 is specifically configured to: in response to the video generation instruction, perform face recognition and obtain a face recognition result; According to the face recognition result, at least one picture that matches the face recognition result is acquired from a plurality of pre-stored pictures.
  • the video generation instruction includes at least one picture tag, and each picture tag in the at least one picture tag corresponds to at least one picture among a plurality of pictures stored in advance; the receiving response The unit 401 is specifically configured to: in response to the video generation instruction, obtain at least one picture tag corresponding to each of the at least one picture tag from a plurality of pre-stored pictures according to the at least one picture tag picture.
  • the picture quality of each of the acquired N pictures is greater than a preset threshold.
  • the device further includes: a scoring unit 404, configured to score the acquired N pictures on picture quality, and obtain a picture quality scoring result corresponding to each of the N pictures Enhancement unit 405, configured to perform picture quality enhancement processing on pictures whose picture quality scoring result is less than a preset threshold, and update pictures with enhanced picture quality to the N pictures.
  • a scoring unit 404 configured to score the acquired N pictures on picture quality, and obtain a picture quality scoring result corresponding to each of the N pictures
  • Enhancement unit 405 configured to perform picture quality enhancement processing on pictures whose picture quality scoring result is less than a preset threshold, and update pictures with enhanced picture quality to the N pictures.
  • the generating unit 403 is specifically configured to: extract the first spatial variable corresponding to each keyword in the one or more keywords in the vector space; and extract the N pictures The image features of each correspond to the second spatial variable in the vector space; the first spatial variable and the second spatial variable are input into the target generator network to generate the target video.
  • the device further includes: a training unit 406, the training unit 406 is used to: obtain sample text information, sample picture information, and real video data sets, and construct a discriminator network and video-based Generated generator network; input the sample text information and the sample picture information into the generator network to generate a sample video; use the sample video and the real video data set as the input of the discriminator network , Obtain the discriminative loss result, wherein, when the sample video belongs to the real video data set, the discriminant loss result is 1; according to the discriminative loss result, train the generator network to obtain the target generator network .
  • a training unit 406 is used to: obtain sample text information, sample picture information, and real video data sets, and construct a discriminator network and video-based Generated generator network; input the sample text information and the sample picture information into the generator network to generate a sample video; use the sample video and the real video data set as the input of the discriminator network , Obtain the discriminative loss result, wherein, when the sample video belongs to the real video data set, the discriminant loss
  • each functional unit in the video generating device 10 described in the embodiment of the present application can be referred to the related description of step S301 to step S307 in the method embodiment described in FIG. 3B, which will not be repeated here. .
  • FIG. 5 is a schematic structural diagram of another video generation device provided by an embodiment of the present application.
  • the device 20 includes at least one processor 501, at least one memory 502, and at least one communication interface 503.
  • the device may also include general components such as antennas, which will not be described in detail here.
  • the processor 501 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits used to control the execution of the above program programs.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication interface 503 is used to communicate with other devices or communication networks, such as Ethernet, wireless access network (RAN), core network, and wireless local area networks (WLAN).
  • devices or communication networks such as Ethernet, wireless access network (RAN), core network, and wireless local area networks (WLAN).
  • Ethernet wireless access network
  • RAN wireless access network
  • WLAN wireless local area networks
  • the memory 502 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), or other types that can store information and instructions
  • the dynamic storage device can also be electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), CD-ROM (Compact Disc Read-Only Memory, CD-ROM) or other optical disc storage, optical disc storage (Including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures and can be used by a computer Any other media accessed, but not limited to this.
  • the memory can exist independently and is connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 502 is used to store application program codes for executing the above solutions, and the processor 501 controls the execution.
  • the processor 501 is configured to execute application program codes stored in the memory 202.
  • the code stored in the memory 202 can execute the above video generation method, such as receiving a video generation instruction, and obtaining text information and picture information in response to the video generation instruction.
  • the text information includes one or more keywords
  • the picture information includes N pictures, where N is a positive integer greater than or equal to 1; obtain the image features corresponding to the one or more keywords in each of the N pictures according to the one or more keywords;
  • the one or more keywords and the image characteristics of the N pictures are input into the target generator network to generate a target video, the target video includes M pictures, and the M pictures are images based on the N pictures
  • M is a positive integer greater than 1.
  • each functional unit in the video generating device 20 described in the embodiment of the present application can refer to the related description of step S301 to step S307 in the method embodiment described in FIG. 3B, and will not be repeated here. .
  • the disclosed device may be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the above-mentioned units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical or other forms.
  • the units described above as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the aforementioned integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server or a network device, etc., specifically a processor in a computer device) execute all or part of the steps of the above methods of the various embodiments of the present application.
  • the aforementioned storage media may include: U disk, mobile hard disk, magnetic disk, optical disk, read-only memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.
  • U disk mobile hard disk
  • magnetic disk magnetic disk
  • optical disk read-only memory
  • Read-Only Memory abbreviation: ROM
  • Random Access Memory Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Business, Economics & Management (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Evolutionary Computation (AREA)
  • Strategic Management (AREA)
  • Data Mining & Analysis (AREA)
  • Quality & Reliability (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Artificial Intelligence (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Operations Research (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Processing Or Creating Images (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请提供了一种视频生成方法及相关设备,可应用于人工智能领域中的图像处理、视频生成领域,其中,一种视频生成方法包括:接收视频生成指令,并响应于视频生成指令获取文本信息和图片信息,文本信息包括一个或多个关键字,图片信息包括N张图片;根据一个或多个关键字获取N张图片的每张图片中与一个或多个关键字对应的图像特征;将一个或多个关键字和N张图片的图像特征输入目标生成器网络中,生成目标视频,目标视频包括M张图片,M张图片为基于N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片。实施本申请实施例,保证了视频内容的丰富性的前提下,自动生成视频。

Description

一种视频生成方法及相关装置
本申请要求于2020年5月30日提交中国专利局、申请号为202010480675.6、申请名称为“一种视频生成方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种视频生成方法及相关装置。
背景技术
状态分享是当今新媒体社会内许多用户都会使用的方式,通过状态分享可以让他人了解自己,促进人与人之间的交流。例如:微信的地理位置状态分享、说说状态分享,抖音的视频分享等,丰富的状态分享能促进社交平台友好发展,提升用户生活交友体验。
然而,在社交平台上,单一的地理位置信息、文字或图片的分享使得用户间所能获取的信息较少,无法在视觉和听觉上同时满足需求。因此,为了满足用户在视觉上和听觉上的需求,可以拍摄视频后再分享。然而,拍摄视频后再分享,给用户带来一定的不便,需要耗费用户一定的时间进行手动摄影,而且拍摄的视频质量和内容,也很容易受到用户拍摄技术和拍摄条件的限制。如果直接利用图片合成视频,只局限于以幻灯片的形式切换式放映,缺少内容的丰富性。
因此,如何在保证视频内容的丰富性的前提下,自动生成视频,是亟待解决的问题。
发明内容
本申请实施例提供一种视频生成方法及相关装置,能够根据文本和图片生成视频,以便用户可以实时分享自己的生活状态。
第一方面,本申请实施例提供了一种视频生成方法,可包括:接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
实施本申请实施例,电子设备可以根据文本信息和图片信息自动生成视频,以便用户可以实时分享自己的生活状态。当电子设备接收视频生成指令后,可以响应于该视频生成指令获取文本信息和图片信息,其中,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片。由于所述文本信息可以用于描述生成视频的视频内容(如:所述一个或多个关键字可以包括人物、时间、地点、事件或动作等等),所述图片信息可以用于提取或生成每一帧的视频图片,因此,可以根据一个或多个关键字提取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征,再将所述一个或多个关键字和所述N张图片 的图像特征输入目标生成器网络中,生成目标视频。这种利用文本和图像共同生成视频,使得生成的视频可以根据输入的文本信息对输入的图片信息进行调整,大大的丰富了视频内容,避免了现有终端设备上的由多张图片直接堆叠生成的视频,只局限于以幻灯片的形式切换式放映,缺少内容的丰富性,同时也满足了用户需求。
在一种可能实现的方式中,所述响应于所述视频生成指令获取文本信息,包括:响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。实施本申请实施例,电子设备可以响应于所述视频生成指令获取用户特定输入的信息(文本输入信息、语音输入信息)、或者利用该电子设备上的传感器获取当前环境信息、又或者从历史交互信息中提取的用户偏好信息中提取文本信息,与获取的图片信息一起生成目标视频,这种利用多模态的信息提取所述文本信息以辅助视频的生成,使得所生成的视频可以反应出当前用户状态(如:生成的视频中的天气环境和所述用户当前所处的天气环境相同),其中,多模态信息可以包括文字、偏好信息、环境信息等等。例如,当用户没有或无法进行手动或语音输入时,还可以只依赖传感器获取的当前环境信息或者从历史交互信息中提取的偏好作为输入的文本信息,与输入的图片信息一起生成目标视频。
在一种可能实现的方式中,所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。实施本申请实施例,电子设备可以根据文本信息,获取与文本信息相关的图片信息。例如:获取图片信息时,可以根据当前的地理位置信息或者用户输入的地点信息获取对应地点的图片,用于生成目标视频。例如:用户在故宫游览时,可以获得故宫相关的图片,用于合成目标视频,方便用户实时分享自己的生活状态。获取图片信息时,还可以根据用户输入的人物信息获取对应地点的图片,用于生成目标视频,满足了用户需求。例如:用户输入“小明在操场踢足球”,可以获得关键字“小明”、“操场”、“足球”的至少一张相关图片,用于合成目标视频。
在一种可能实现的方式中,所述视频生成指令包括人脸识别请求;所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。实施本申请实施例,电子设备在获取图片信息时,可以首先通过人脸识别获得人脸识别结果,进而根据该人脸识别结果直接从预先存储的图片中获取包含有用户的图片。方便直接生成有关于用户的状态视频,及时分享用户的当前状态。例如:通过人脸识别,识别出是用户A后,可以从预先存储的多张图片中,获取用户A的图片,这种不需要用户筛选图片,就可以生成包含有用户A的视频,方便了用户操作,提升了用户体验。
在一种可能实现的方式中,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的 至少一张图片。实施本申请实施例,电子设备在获取图片信息时,可以通过视频生成指令中携带的至少一个图片标签,获取对应的至少一个图片,用于生成的目标视频。当用户想将几张图片,生成有趣的视频时,可以直接筛选出用户感兴趣或者需要的图片生成视频,满足了用户的观看需求。例如:用户可以选择图片标签为“猫”,在获取猫的多张图片后,与文本信息一起生成一段以猫为主角的动态视频;又例如:用户还可以选择图片标签为“童年的小明”,在获取到小明小时候的多张图片后,与文本信息一起生成一段有关小明童年的动态视频。
在一种可能实现的方式中,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。实施本申请实施例,在获取图片信息前,需要对即将选择的图片进行图片质量评分,当图片质量评分低于预设阈值时,则不使用该图片生成视频。而用图片质量均大于预设阈值的图片生成视频,可因此保证最终由此图片生成的目标视频的画质较高,满足了用户的观看体验。
在一种可能实现的方式中,所述方法还包括:将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。实施本申请实施例,在获取图片信息后,需要对获取的所有图片进行图片质量评分,当图片质量较差时可以对该图片进行图片质量增强,以提升在通过该图片生成视频时的视频质量,满足用户的观看体验。
在一种可能实现的方式中,所述将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,包括:提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;提取所述N张图片的图像特征分别在向量空间上对应的第二空间变量;将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。实施本申请实施例,可以首先分别提取文本信息对应的所述第一空间变量以及图片信息中每张图片的图像特征的第二空间向量,其中,所述第一空间变量可以是在隐向量空间中标识所述文本信息的词向量,所述每张图片的第二空间向量可以是在隐向量空间中标识该图片的图像特征的向量,通过提取空间向量有利于生成器网络更好地生成目标视频。如:通过所述Word2vec模型提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量,通过下采样的卷积网络提取所述N张图片的图像特征分别在向量空间上对应的所述第二空间变量。
在一种可能实现的方式中,所述方法还包括:获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本视频;将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。实施本申请实施例,需要通过样本数据进行生成器网络和判别器网络的训练。其中,电子设备首先根据样本数据通过所述生成器网络生成视频,再将<生成的视频,真实的视频>输入到判别器中,判别器对输入进行判断来源,如果来源于生成的视频则判断为假0,否则判断为真1,通过这样多次重复的对抗训练方式,能进一 步的规范生成视频的内容,逐步的提升生成视频的真实性以及提升生成视频的质量,有利于视频的共享。
第二方面,本申请实施例提供了一种视频生成装置,包括:
接收响应单元,用于接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;
提取单元,用于根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;
生成单元,用于将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
在一种可能实现的方式中,所述接收响应单元,具体用于:响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。
在一种可能实现的方式中,所述接收响应单元,具体用于:响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。
在一种可能实现的方式中,所述视频生成指令包括人脸识别请求;所述接收响应单元,具体用于:响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。
在一种可能实现的方式中,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述接收响应单元,具体用于:响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。
在一种可能实现的方式中,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。
在一种可能实现的方式中,所述装置还包括:评分单元,用于将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;增强单元,用于将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。
在一种可能实现的方式中,所述生成单元,具体用于:提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;提取所述N张图片的图像特征分别在向量空间上对应的第二空间变量;将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。
在一种可能实现的方式中,所述装置还包括:训练单元,所述训练单元,用于:获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本 视频;将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。
第三方面,本申请实施例提供一种电子设备,该终端设备中包括处理器,处理器被配置为支持该终端设备实现第一方面提供的视频生成方法中相应的功能。该终端设备还可以包括存储器,存储器用于与处理器耦合,其保存该终端设备必要的程序指令和数据。该终端设备还可以包括通信接口,用于该网络设备与其他设备或通信网络通信。
第四方面,本申请实施例提供一种计算机存储介质,用于储存为上述第二方面提供的一种视频生成装置所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
第五方面,本申请实施例提供了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第二方面中的视频生成装置所执行的流程。
第六方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持终端设备实现上述第一方面中所涉及的功能,例如,生成或处理上述视频生成方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1A是本申请实施例提供的一种电子设备100的结构示意图。
图1B是本申请实施例提供的一种电子设备100的软件结构框图。
图2A是本申请实施例提供的一组接收视频生成指令的用户界面示意图。
图2B是本申请实施例提供的一组获取图片信息的用户界面示意图。
图2C是本申请实施例提供的一种图片质量评分示意图。
图2D是本申请实施例提供的一种显示文本信息的用户界面示意图。
图2E是本申请实施例提供的一组视频生成后分享至好友的用户界面。
图2F是本申请实施例提供的一种生成器网络训练流程示意图。
图2G是本申请实施例提供的一种视频生成的流程示意图。
图2H是本申请实施例提供的一组获取文本信息的用户界面。
图2I是本申请实施例提供的一组根据关键字获取图片信息的用户界面。
图2J是本申请实施例提供的一种生成视频的用户界面。
图2K是本申请实施例提供的一种基于用户偏好的视频生成流程示意图。
图2L是本申请实施例提供的另一组获取文本信息的用户界面。
图2M是本申请实施例提供的另一种视频生成的用户界面。
图3A是本申请实施例提供的一种视频生成装置的结构示意图。
图3B是本申请实施例提供的一种视频生成方法的流程示意图。
图4是本申请实施例提供的另一种视频生成装置的结构示意图。
图5是本申请实施例提供的又一种视频生成装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”和“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)循环神经网络(Recurrent Neural Networks,RNN):是一类以序列(sequence)数据为输入,在序列的演进方向进行递归(recursion)且所有节点(循环单元)按链式连接的递归神经网络(recursive neural network)。循环神经网络具有记忆性、参数共享并且图灵完备(Turing completeness),因此在对序列的非线性特征进行学习时具有一定优势。循环神经网络在自然语言处理(Natural Language Processing,NLP),例如语音识别、语言建模、机器翻译等领域有应用,也被用于各类时间序列预报。引入了卷积神经网络(Convoutional Neural Network,CNN)构筑的循环神经网络可以处理包含序列输入的计算机视觉问题。
(3)手势识别,手势识别旨在识别人类的物理运动或“手势”,可以基于将人类运动识别为输入形式。手势识别也被分类为一种非接触式用户界面,与触摸屏设备不同,具有非接触式用户界面的设备无需触摸即可控制,该设备可以具有一个或多个传感器或摄像头,可监控用户的移动,当它检测到与命令相对应的移动时,它会以适当的输出响应。例如,在设备前面以特定模式挥动您的手可能会告诉它启动特定的应用程序。
(4)Word2vec模型,为一群用来产生词向量的相关模型。这些模型为浅层双层的神经网络,用来训练以重新建构语言学之词文本。网络以词表现,并且需猜测相邻位置的输 入词,在word2vec中词袋模型假设下,词的顺序是不重要的。训练完成之后,word2vec模型可用来映射每个词到一个向量,可用来表示词对词之间的关系。
接下来,介绍本申请以下实施例中提供的示例性电子设备。
请参考附图1A,图1A是本申请实施例提供的一种电子设备100的结构示意图,其中,电子设备100可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(universal serial bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(subscriber identification module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。
可以理解的是,本申请实施例示意的结构并不构成对电子设备100的具体限定。在本申请另一些实施例中,电子设备100可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(application processor,AP),调制解调处理器,图形处理器(graphics processing unit,GPU),图像信号处理器(image signal processor,ISP),控制器,存储器,视频编解码器,数字信号处理器(digital signal processor,DSP),基带处理器,和/或神经网络处理器(neural-network processing unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。
其中,控制器可以是电子设备100的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。例如:在本申请中处理器可以:接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片;根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片。
在一些实施例中,处理器110可以包括一个或多个接口。接口可以包括集成电路(inter-integrated circuit,I2C)接口,集成电路内置音频(inter-integrated circuit sound,I2S)接 口,脉冲编码调制(pulse code modulation,PCM)接口,通用异步收发传输器(universal asynchronous receiver/transmitter,UART)接口,移动产业处理器接口(mobile industry processor interface,MIPI),通用输入输出(general-purpose input/output,GPIO)接口,用户标识模块(subscriber identity module,SIM)接口,和/或通用串行总线(universal serial bus,USB)接口等。
可以理解的是,本申请实施例示意的各模块间的接口连接关系,只是示意性说明,并不构成对电子设备100的结构限定。在本申请另一些实施例中,电子设备100也可以采用上述实施例中不同的接口连接方式,或多种接口连接方式的组合。
充电管理模块140用于从充电器接收充电输入。其中,充电器可以是无线充电器,也可以是有线充电器。在一些有线充电的实施例中,充电管理模块140可以通过USB接口130接收有线充电器的充电输入。在一些无线充电的实施例中,充电管理模块140可以通过电子设备100的无线充电线圈接收无线充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。
电源管理模块141用于连接电池142,充电管理模块140与处理器110。电源管理模块141接收电池142和/或充电管理模块140的输入,为处理器110,内部存储器121,外部存储器,显示屏194,摄像头193,和无线通信模块160等供电。电源管理模块141还可以用于监测电池容量,电池循环次数,电池健康状态(漏电,阻抗)等参数。在其他一些实施例中,电源管理模块141也可以设置于处理器110中。在另一些实施例中,电源管理模块141和充电管理模块140也可以设置于同一个器件中。
电子设备100的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。
天线1和天线2用于发射和接收电磁波信号。电子设备100中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。例如:可以将天线1复用为无线局域网的分集天线。在另外一些实施例中,天线可以和调谐开关结合使用。
移动通信模块150可以提供应用在电子设备100上的包括2G/3G/4G/5G等无线通信的解决方案。移动通信模块150可以包括至少一个滤波器,开关,功率放大器,低噪声放大器(low noise amplifier,LNA)等。移动通信模块150可以由天线1接收电磁波,并对接收的电磁波进行滤波,放大等处理,传送至调制解调处理器进行解调。移动通信模块150还可以对经调制解调处理器调制后的信号放大,经天线1转为电磁波辐射出去。在一些实施例中,移动通信模块150的至少部分功能模块可以被设置于处理器110中。在一些实施例中,移动通信模块150的至少部分功能模块可以与处理器110的至少部分模块被设置在同一个器件中。
调制解调处理器可以包括调制器和解调器。在一些实施例中,调制解调处理器可以是独立的器件。在另一些实施例中,调制解调处理器可以独立于处理器110,与移动通信模块150或其他功能模块设置在同一个器件中。
无线通信模块160可以提供应用在电子设备100上的包括无线局域网(wireless local area networks,WLAN)(如无线保真(wireless fidelity,Wi-Fi)网络),蓝牙(bluetooth,BT),全球导航卫星系统(global navigation satellite system,GNSS),调频(frequency modulation, FM),近距离无线通信技术(near field communication,NFC),红外技术(infrared,IR)等无线通信的解决方案。无线通信模块160可以是集成至少一个通信处理模块的一个或多个器件。无线通信模块160经由天线2接收电磁波,将电磁波信号调频以及滤波处理,将处理后的信号发送到处理器110。无线通信模块160还可以从处理器110接收待发送的信号,对其进行调频,放大,经天线2转为电磁波辐射出去。
在一些实施例中,电子设备100的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备100可以通过无线通信技术与网络以及其他设备通信。所述无线通信技术可以包括全球移动通讯系统(global system for mobile communications,GSM),通用分组无线服务(general packet radio service,GPRS),码分多址接入(code division multiple access,CDMA),宽带码分多址(wideband code division multiple access,WCDMA),时分码分多址(time-division code division multiple access,TD-SCDMA),长期演进(long term evolution,LTE),BT,GNSS,WLAN,NFC,FM,和/或IR技术等。所述GNSS可以包括全球卫星定位系统(global positioning system,GPS),全球导航卫星系统(global navigation satellite system,GLONASS),北斗卫星导航系统(beidou navigation satellite system,BDS),准天顶卫星系统(quasi-zenith satellite system,QZSS)和/或星基增强系统(satellite based augmentation systems,SBAS)。
电子设备100通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(liquid crystal display,LCD),有机发光二极管(organic light-emitting diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(active-matrix organic light emitting diode的,AMOLED),柔性发光二极管(flex light-emitting diode,FLED),Miniled,MicroLed,Micro-oLed,量子点发光二极管(quantum dot light emitting diodes,QLED)等。在一些实施例中,电子设备100可以包括1个或N个显示屏194,N为大于1的正整数。
电子设备100可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现拍摄功能。
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像。ISP还可以对图像的噪点,亮度,肤色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(charge coupled device,CCD)或互补金属氧化物半导体(complementary metal-oxide-semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像信号。ISP将数字图像信号输出到DSP加工处理。DSP将数字图像信号转换成标准的RGB,YUV等格式的图像信号。在一些实施例中,电子设备100可以包括1个或N个摄像头193,N为大于1的正整数。
数字信号处理器用于处理数字信号,除了可以处理数字图像信号,还可以处理其他数字信号。例如,当电子设备100在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。
视频编解码器用于对数字视频压缩或解压缩。电子设备100可以支持一种或多种视频编解码器。这样,电子设备100可以播放或录制多种编码格式的视频,例如:动态图像专家组(moving picture experts group,MPEG)1,MPEG2,MPEG3,MPEG4等。
NPU为神经网络计算处理单元(Neural-network Processing Unit),通过借鉴生物神经网络结构,例如借鉴人脑神经元之间传递模式,对输入信息快速处理,还可以不断的自学习。通过NPU可以实现电子设备100的智能认知等应用,例如:图像识别,人脸识别,语音识别,文本理解等。
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备100的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备100的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像播放功能等)等。存储数据区可存储电子设备100使用过程中所创建的数据(比如音频数据,电话本等)等。此外,内部存储器121可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件,闪存器件,通用闪存存储器(universal flash storage,UFS)等。
电子设备100可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。
音频模块170用于将数字音频信息转换成模拟音频信号输出,也用于将模拟音频输入转换为数字音频信号。音频模块170还可以用于对音频信号编码和解码。在一些实施例中,音频模块170可以设置于处理器110中,或将音频模块170的部分功能模块设置于处理器110中。
扬声器170A,也称“喇叭”,用于将音频电信号转换为声音信号。电子设备100可以通过扬声器170A收听音乐,或收听免提通话。
受话器170B,也称“听筒”,用于将音频电信号转换成声音信号。当电子设备100接听电话或语音信息时,可以通过将受话器170B靠近人耳接听语音。
麦克风170C,也称“话筒”,“传声器”,用于将声音信号转换为电信号。当拨打电话或发送语音信息时,用户可以通过人嘴靠近麦克风170C发声,将声音信号输入到麦克风170C。电子设备100可以设置至少一个麦克风170C。在另一些实施例中,电子设备100可以设置两个麦克风170C,除了采集声音信号,还可以实现降噪功能。在另一些实施例中,电子设备100还可以设置三个,四个或更多麦克风170C,实现采集声音信号,降噪,还可以识别声音来源,实现定向录音功能等。
耳机接口170D用于连接有线耳机。耳机接口170D可以是USB接口130,也可以是3.5mm的开放移动电子设备平台(open mobile terminal platform,OMTP)标准接口,美国蜂 窝电信工业协会(cellular telecommunications industry association of the USA,CTIA)标准接口。
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。压力传感器180A的种类很多,如电阻式压力传感器,电感式压力传感器,电容式压力传感器等。电容式压力传感器可以是包括至少两个具有导电材料的平行板。当有力作用于压力传感器180A,电极之间的电容改变。电子设备100根据电容的变化确定压力的强度。当有触控操作作用于显示屏194,电子设备100根据压力传感器180A检测所述触控操作强度。电子设备100也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触控操作强度的触控操作,可以对应不同的操作指令。例如:当有触控操作强度小于第一压力阈值的触控操作作用于短消息应用图标时,执行查看短消息的指令。当有触控操作强度大于或等于第一压力阈值的触控操作作用于短消息应用图标时,执行新建短消息的指令。
陀螺仪传感器180B可以用于确定电子设备100的运动姿态。
气压传感器180C用于测量气压。
磁传感器180D包括霍尔传感器。
加速度传感器180E可检测电子设备100在各个方向上(一般为三轴)加速度的大小。当电子设备100静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。
距离传感器180F,用于测量距离。
接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。
环境光传感器180L用于感知环境光亮度。
指纹传感器180H用于采集指纹。电子设备100可以利用采集的指纹特性实现指纹解锁,访问应用锁,指纹拍照,指纹接听来电等。
温度传感器180J用于检测温度。在一些实施例中,电子设备100利用温度传感器180J检测的温度,执行温度处理策略。
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触控操作。触摸传感器可以将检测到的触控操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触控操作相关的视觉输出。在另一些实施例中,触摸传感器180K也可以设置于电子设备100的表面,与显示屏194所处的位置不同。
骨传导传感器180M可以获取振动信号。在一些实施例中,骨传导传感器180M可以获取人体声部振动骨块的振动信号。骨传导传感器180M也可以接触人体脉搏,接收血压跳动信号。在一些实施例中,骨传导传感器180M也可以设置于耳机中,结合成骨传导耳机。音频模块170可以基于所述骨传导传感器180M获取的声部振动骨块的振动信号,解析出语音信号,实现语音功能。应用处理器可以基于所述骨传导传感器180M获取的血压跳动信号解析心率信息,实现心率检测功能。
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备100可以接收按键输入,产生与电子设备100的用户设置以及功能控制有关的键 信号输入。
马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。例如,作用于不同应用(例如拍照,音频播放等)的触控操作,可以对应不同的振动反馈效果。作用于显示屏194不同区域的触控操作,马达191也可对应不同的振动反馈效果。不同的应用场景(例如:时间提醒,接收信息,闹钟,游戏等)也可以对应不同的振动反馈效果。触摸振动反馈效果还可以支持自定义。
指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。
SIM卡接口195用于连接SIM卡。SIM卡可以通过插入SIM卡接口195,或从SIM卡接口195拔出,实现和电子设备100的接触和分离。电子设备100可以支持1个或N个SIM卡接口,N为大于1的正整数。SIM卡接口195可以支持Nano SIM卡,Micro SIM卡,SIM卡等。同一个SIM卡接口195可以同时插入多张卡。所述多张卡的类型可以相同,也可以不同。SIM卡接口195也可以兼容不同类型的SIM卡。SIM卡接口195也可以兼容外部存储卡。电子设备100通过SIM卡和网络交互,实现通话以及数据通信等功能。在一些实施例中,电子设备100采用eSIM,即:嵌入式SIM卡。eSIM卡可以嵌在电子设备100中,不能和电子设备100分离。
电子设备100的软件系统可以采用分层架构,事件驱动架构,微核架构,微服务架构,或云架构。本申请实施例以分层架构的Android系统为例,示例性说明电子设备100的软件结构。请参考附图1B,图1B是本申请实施例提供的一种电子设备100的软件结构框图。
可以理解的是,本申请实施例示意的软件结构框图并不构成对电子设备100的软件结构框图具体限定。
分层架构将软件分成若干个层,每一层都有清晰的角色和分工。层与层之间通过软件接口通信。在一些实施例中,将Android系统分为四层,从上至下分别为应用程序层,应用程序框架层,安卓运行时(Android runtime)和系统库,以及内核层。
应用程序层可以包括一系列应用程序包。
如图1B所示,应用程序包可以包括相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序。
应用程序框架层为应用程序层的应用程序提供应用编程接口(application programming interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。
如图1B所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示 界面,可以包括显示文字的视图以及显示图片的视图。
电话管理器用于提供电子设备100的通信功能。例如通话状态的管理(包括接通,挂断等)。
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话窗口形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。
Android Runtime包括核心库和虚拟机。Android runtime负责安卓系统的调度和管理。
核心库包含两部分:一部分是java语言需要调用的功能函数,另一部分是安卓的核心库。
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的java文件执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。
系统库可以包括多个功能模块。例如:表面管理器(surface manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),2D图形引擎(例如:SGL)等。
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了2D和3D图层的融合。
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,G.264,MP3,AAC,AMR,JPG,PNG等。
三维图形处理库用于实现三维图形绘图,图像渲染,合成,和图层处理等。
2D图形引擎是2D绘图的绘图引擎。
内核层是硬件和软件之间的层。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动。
图1B所示的软件系统涉及到使用分享能力的应用呈现(如图库,文件管理器),提供分享能力的即时分享模块,提供打印能力的打印服务(print service)和打印后台服务(print spooler),以及应用框架层提供打印框架、WLAN服务、蓝牙服务,以及内核和底层提供WLAN蓝牙能力和基本通信协议。
下面结合捕获拍照场景,示例性说明电子设备100软件以及硬件的工作流程。
当触摸传感器180K接收到触摸操作,相应的硬件中断被发给内核层。内核层将触摸操作加工成原始输入事件(包括触摸坐标,触摸操作的时间戳等信息)。原始输入事件被存储在内核层。应用程序框架层从内核层获取原始输入事件,识别该输入事件所对应的控件。以该触摸操作是触摸操作,该触摸操作所对应的控件为相机应用图标的控件为例,相机应用调用应用框架层的接口,启动相机应用,进而通过调用内核层启动摄像头驱动,通过3D摄像模组193捕获静态图像或视频。
下面介绍本申请实施例涉及的几个应用场景以及各个应用场景下的用户界面(user interface,UI)实施例。需要说明的是,本申请实施例中提到的用户界面可以理解为本申请中用于分享观看视频的窗口。
应用场景1:基于人脸识别的视频生成
许多的用户间信息分享的操作方式都比较繁复,为了提升状态分享的可操控性和用户体验,可以使用图片自动生成的视频进行状态共享。当用户在某个旅游景点游玩、在郊外跑步等时,可以分享自己的游玩视频,此时用户可以首先通过人脸识别获得人脸识别结果,进而根据该人脸识别结果直接从预先存储的图片中获取包含有用户的图片。方便直接生成有关于用户的状态视频,及时分享用户的当前状态。
在该场景下,根据手机用户的位置,通过传感器获取当前环境信息,如:当前天气信息为阴天、温度为26度、当前时间是上午10:40、当前地理位置是xx学校的操场上、当前运动状态为跑步,电子设备100根据所提供的周围的当前环境信息,构建出生成视频的大致环境,并以人脸识别的结果作为视频的主角,结合所选取的终端设备上的图像,生成一段用户在阴天条件下,温度大致为26度,在操场上跑步的一段状态视频,进行分享。
基于前述场景,下面介绍电子设备100上实现的一些UI实施例。
在该基于人脸识别的状态视频生成的场景下,电子设备100可以接收发送的视频生成指令,并响应于所述视频生成指令进行人脸识别获取人脸识别对应的人物图片,并响应于所述视频生成指令通过电子设备100的传感器180获取当前环境信息,用户的生理数据,将其转换为文本信息和获取的图像信息一起输入目标生成器网络中,获得目标视频。此外,检测到当前环境可以作用于生成视频的背景。
下面从以下几个方面进行详细说明。
(1)如何获取图片信息。
请参考附图2A,图2A是本申请实施例提供的一组接收视频生成指令的用户界面示意图。
具体的,如图2A中(1)所示,电子设备100可以通过触摸传感器180K检测到用户的触控操作(如,触摸传感器180K识别出用户在窗口显示区201的状态栏中做出的下拉操作),响应于该触控操作,如图2A中(2)所示,电子设备100可以显示完整的状态栏202并识别到用户对状态栏中即时分享203的触控操作(如,触摸传感器180K识别出用户在窗口显示区201的状态栏中做出的下拉操作)。当触摸传感器180K检测到对状态栏中即时分享203的触控操作时,即可获取到视频生成指令。
请参考附图2B,图2B是本申请实施例提供的一组获取图片信息的用户界面示意图,响应对状态栏中即时分享203的触控操作,电子设备100的用户界面如图2B中(1)所示,当触摸传感器180K检测用户对图2B中(1)所示的人脸识别控件204的触控操作时,可以开启人脸识别程序进行人脸识别。即:响应于所述人脸识别请求,通过所述电子设备100进行人脸识别并获得人脸识别结果。例如:用户点击进行识别人脸,电子设备100即可以响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。如图2B中(2)所示,电子设备100根据所述人脸识别结果,从所述第一终端预先存储的多张图片中,获 取与所述人脸识别结果匹配的至少一张人物图片为所述图片信息。例如,电子设备100根据人脸识别结果获得了两张人物图片。
可选的,若根据人脸识别结果,无法在预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张人物图片,则可以根据该人脸识别时的人脸图像直接生成对应的人物图片为所述图片信息。
可选的,若根据人脸识别结果,在预先存储的多张图片中获得很多张与所述人脸识别结果匹配的人物图片,则可以根据拍摄的时间、人物图片的图片质量、人物图片的大小,从多张人物图片选择预设数量的几张图片。例如,预先存储的多张图片中有一百张人物图片,电子设备根据拍摄时间,选择与当前时间最近的五张人物图片作为输入生成器网络的图片信息。
可选的,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。实施本申请实施例,在获取图片信息前,需要对即将选择的图片进行图片质量评分,当图片质量评分低于预设阈值时,则不使用该图片生成视频。而用图片质量均大于预设阈值的图片生成视频,可因此保证最终由此图片生成的目标视频的画质较高,满足了用户的观看体验。
可选的,所述方法还包括:将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。用户的终端设备或者云端一般都存储着用户一定量的图片,可通过美学评估的方式,自动选取高质量的静态图片。例如:当两张图片的质量相差甚远时,一个图像质量清晰度较高,而另一个图像比较模糊,无法捕捉到图像的具体细节,不利于实时状态视频的生成。因此,可以采用现有的图像打分网络,分别把两张图片输入到图像打分网络里,分别得到两张图片的图片质量分数,分数越高则说明该图片的质量越好,则选取评分高的图片,作为静态图片输入到生成器网络里,以提升在通过该图片生成视频时的视频质量,有利于视频的生成,满足用户的观看体验。例如:请参考附图2C,图2C是本申请实施例提供的一种图片质量评分示意图。如图2C所示,响应于所述视频生成指令,获取两张图片,图片A和图片B,将获取的两张图片分别输入到图片质量评分模型中,获得每张图片对应的图片质量评分结果;将所述图片质量评分结果大于预设阈值的图片作为视频图片添加至所述图片信息中。
不限于上述列出的通过触摸传感器180K识别触控操作时的用户操作,在具体实现中还可以有其他的识别用户操作的方式。例如,电子设备100还可以通过红外传感器或手势传感器等识别用户的在电子设备屏幕前的手势操作等,本申请实施例对此不作限定。
(2)如何获取文本信息。
具体的,电子设备100可以响应所述视频生成指令获取文本信息,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。所述文本信息为由文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个通过关键字提取获取的文本,其中,获取的一个或多个关键字可以包括人物、时间、地点、事件或动作等等,用于指示生成视频的视频内容。例如:电子设备100可以通过用户输入获取文本输入信息、语音输入信息, 并通过关键字提取,从所述文本输入信息、语音输入信息中获取关于人物、时间、地点、事件中的一个或多个的文本信息;电子设备100还可以通过用户的历史浏览记录、历史输入系信息等获取用户的偏好信息,再从所述用户偏好信息中的获取关于用户兴趣、用户浏览、搜索或出现频率最高的文本信息;电子设备100还可以通过传感器180获取用户生理数据信息或当前环境信息,然后通过关键字提取获取关于人物、时间、地点、事件、运动状态、心理状态中的一个或多个的文本信息;其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息、当前运动状态中的一个或多个。
例如:请参考附图2D,图2D是本申请实施例提供的一种显示文本信息的用户界面示意图。如图2D所示,电子设备100可以通过GPS定位系统获取当前地理位置信息,进一步的根据所述当前时间信息和当前地理位置信息获取改地点对应的当前天气信息,还可以获取用户的生理数据。即,电子设备100可以根据当前天气信息为阴天、温度为26度、当前时间是上午10:40、当前地理位置是xx运动场、当前运动状态为跑步,获取时间、天气、地点、运动状态的文本信息。电子设备100这种利用多模态的信息提取所述文本信息以辅助视频的生成,使得所生成的视频可以反应出当前用户状态(如:生成的视频中的天气环境和所述用户当前所处的天气环境相同),其中,多模态信息可以包括文字、偏好信息、环境信息等等。例如,当用户没有或无法进行手动或语音输入时,还可以只依赖传感器获取的当前环境信息或者从历史交互信息中提取的偏好作为输入的文本信息,与输入的图片信息一起生成目标视频。
需要说明的是,在视频生成过程中对响应于所述视频生成指令,获取文本信息和图片信息的先后顺序不做具体的限定。例如:可以先获取文字信息再获取图片信息;也可以先获取图片信息再获取文字信息;还可以文本信息和图片信息同时获取。
不限于上述列出的响应所述视频生成指令获取文本信息的方式,在具体实现中还可以有其他的获取文本信息的方式。例如,电子设备100还可以通过图像识别,从获得图片信息中提取出有关于人物、时间、地点、事件中的一个或多个的文本信息等,本申请实施例对此不作限定。
(3)如何生成视频。
具体的,电子设备100根据所述一个或多个关键字提取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。即,将获得的所述文本信息和所述图像信息输入到目标生成器网络中,获得目标视频,所述目标视频用于描述所述文本信息。这种利用文本和图像共同生成视频,使得生成的视频可以根据输入的文本信息对输入的图片信息进行调整,大大的丰富了视频内容,避免了现有终端设备上的由多张图片直接堆叠生成的视频,只局限于以幻灯片的形式切换式放映,缺少内容的丰富性。例如:电子设备100获取到用户A喝奶茶的图片,以及在操场上散步的文本信息,电子设备100可以根据文本信息提取出上述图片中的用户A的图像特征,通过生成器网络生成M张图片并合成用户A在操场上散步的目标视频。
可选的,提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变 量;提取所述N张图片的图像特征分别在向量空间上对应的所述第二空间变量;将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。实施本申请实施例,可以首先分别提取文本信息对应的所述第一空间变量以及图片信息中每张图片的图像特征的第二空间向量,其中,所述第一空间变量可以是在隐向量空间中标识所述文本信息的词向量,所述每张图片的第二空间向量可以是在隐向量空间中标识该图片的图像特征的向量,通过提取空间向量有利于生成器网络更好地生成目标视频。
例如:通过所述Word2vec模型提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量,通过下采样的卷积网络提取所述N张图片的图像特征分别在向量空间上对应的所述第二空间变量。首先对输入的图片采用一个下采样的卷积网络,提取图片的空间向量(latent space),对输入的文本信息应用Word2vec模型提取文本的latent space,把图片和文本的latent space作为视频生成器网络的输入,进行视频的生成。
可选的,在生成目标视频后,还可以获取与所述一个或多个关键字中至少一个关键字匹配的目标音频信息,将所述目标音频信息添加至目标视频中,以获得带有声音的视频,满足了用户在视觉和听觉的共同需求。
可选的,接收视频分享指令,并响应于所述视频分享指令将所述目标视频发送至目标设备。电子设备100通过将生成的目标视频分享至目标终端设备中,可以促进社交平台友好发展,提升用户生活交友体验。请参考附图2E,图2E是本申请实施例提供的一组视频生成后分享至好友的用户界面。如图2E中(1)所示,电子设备100将上述图2D中获取时间、天气、地点、运动状态的文本信息以及上述图2B中获取的图片信息,通过生成器网络生成目标视频,其中该视频可以进行查看或分享。如图2E中(2)所示,该用户界面可以是聊天工具提供的状态视频分享界面。不限于此,该用户界面还可以是其他应用程序提供的用于状态视频分享界面,其他应用程序可以是社交软件等。
可选的,在生成目标视频之前,电子设备还需要训练目标生成器网络。即,获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;将所述样本文本信息和所述样本图片信息输入所述生成器网络中,获得样本视频;所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,并根据所述判别损失结果训练所述生成器网络获得所述目标生成器网络,其中,所述判别损失结果在所述样本视频属于所述真实视频数据集时为真。通过样本数据进行生成器网络和判别器网络的训练。其中,请参考附图2F,图2F是本申请实施例提供的一种生成器网络训练流程示意图,如图2F所示,根据样本数据通过所述生成器网络生成视频,再将<生成的视频,真实的视频>输入到判别器中,其中,真实的视频为真实世界获取的视频,判别器对输入进行判断来源,如果来源于生成的视频则判断为假0,否则判断为真1,通过这样多次重复的对抗训练方式,能进一步的规范生成视频的内容,逐步的提升生成视频的真实性以及提升生成视频的质量,有利于实时状态视频的共享。
应用场景2:基于用户输入的视频生成。
用户A手头上有一些之前的老照片,但遗憾于当时有些事情没有去完成想要重新体验一下该画面感,则可以通过语音输入或者文字输入,来描述当时的画面。比如:用户A手 头上有一张孙子的老照片,用户A想看看孙子踢球的样子,那么用户A对着终端设备说:“孙子在绿油油的草地上,无忧无虑的踢足球”,那么状态视频生成系统则自动提取“绿油油的草地”、“踢足球”、“孙子”,这几个关键字,并基于用户A终端设备上的孙子的照片,则可以生成一段符合用户A需求的视频。
基于前述场景,下面介绍电子设备100上实现的一些UI实施例。
在该基于用户输入的状态视频生成的场景下,电子设备100可以接收发送的视频生成指令,并响应于所述视频生成指令获取图片信息,同时响应于所述视频生成指令通过电子设备100的传感器180获取当前环境信息,将其转换为文本信息和获取的图像信息一起输入目标生成器网络中,获得目标视频。此外,检测到当前环境信息可以作用于生成视频的背景。
在该场景下,请参考附图2G,图2G是本申请实施例提供的一种视频生成的流程示意图。如图2G所示,根据用户的文本输入信息、语音输入信息获取文本信息,如:人物是孙子、地点是绿油油的草地,事件是踢足球,电子设备100根据用户的语音输入构建出生成视频的大致环境,并以“孙子”作为视频的主角,结合选取的终端设备上的图像,生成一段孙子在在绿油油的草地上,无忧无虑的踢足球的状态视频。
下面从以下几个方面进行详细说明。
(1)如何获取图片信息。
具体的,电子设备响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。即,电子设备可以根据文本信息,获取与文本信息相关的图片信息。例如:用户在故宫游览时,可以获得故宫相关的图片,用于合成目标视频,方便用户实时分享自己的生活状态。获取图片信息时,还可以根据用户输入的人物信息获取对应地点的图片,用于生成目标视频,满足了用户需求。例如:用户输入“小明在操场踢足球”,可以获得关键字“小明”、“操场”、“足球”的至少一张相关图片,用于合成目标视频。
在该场景下,请参考附图2H,图2H是本申请实施例提供的一组获取文本信息的用户界面。
根据图2H中(1)所示,当触摸传感器180K检测用户对窗口显示区201中视频生成控件(如:跳过人脸识别,直接生成视频)的触控操作时,电子设备100可以响应该指令,切换至图2H中(2)所示界面。当触摸传感器180K检测用户对语音输入控件的触控操作时,电子设备100可以响应该触控操作接收用户的语音输入:“孙子在绿油油的草地上,无忧无虑的踢足球”,根据该语音输入可以获取关于人物、地点和事件的文本信息,根据该文本信息获得关键字“孙子”、“草地”、“踢足球”等。请参考附图2I,图2I是本申请实施例提供的一组根据关键字获取图片信息的用户界面。如图2I中(1)所示,根据该文本信息获得关键字“孙子”、“草地”,进而,如图2I中(2)所示,电子设备100可以根据“孙子”、“草地”这两个关键字从预先存储的多张图片中,获取与该关键字中对应的至少一张图片。
可选的,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储 的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。实施本申请实施例,电子设备在获取图片信息时,可以通过视频生成指令中携带的至少一个图片标签,获取对应的至少一个图片,用于生成的目标视频。当用户想将几张图片,生成有趣的视频时,可以直接筛选出用户感兴趣或者需要的图片生成视频,满足了用户的观看需求。例如:用户可以选择图片标签为“猫”,在获取猫的多张图片后,与文本信息一起生成一段以猫为主角的动态视频;又例如:用户还可以选择图片标签为“童年的小明”,在获取到小明小时候的多张图片后,与文本信息一起生成一段有关小明童年的动态视频。
需要说明的是,关于如何获取图片信息的相关描述还可以对应参考上述应用场景1中有关如何获取图片信息的相关描述,本申请实施例此处不再赘述。
(2)如何获取文本信息。
具体的,响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,如图2I中(1)所示,根据用户的语音输入信息获取一个或多个关键字,如:人物是孙子、地点是绿油油的草地,事件是踢足球。
需要说明的是,关于如何获取文本信息的相关描述还可以对应参考上述应用场景1中有关如何获取文本信息的相关描述,本申请实施例此处不再赘述。
(3)如何生成视频。
具体的,电子设备100利用生成器网络作为视频生成的核心模块,其中,生成器网络可以采用RNN循环神经网络把上下帧的语义信息考虑进来,促进所生成的视频的帧间稳定性。生成器网络是生成式对抗网络里的一部分,生成器通过对噪声分布进行采样,并以此作为输入,而判别器则是对输入的数据判断来源。这样的博弈形式,在整个对抗网络中能够很好的促进两个网络的进步。请参考附图2J,图2J是本申请实施例提供的一种生成视频的用户界面。如图2J所示,电子设备100将获得的一个或多个关键字和该关键字对应的图像特征输入到生成器网络后,生成了一段孙子在草地上踢足球的视频,用户可以在如图2J所示的界面中查看或分享视频。
需要说明的是,关于如何获取生成视频的相关描述还可以对应参考上述应用场景1中有关如何生成视频的相关描述,本申请实施例此处不再赘述。
应用场景3:基于用户偏好的视频生成
电子设备100可以获取用户在终端设备上的行为信息,并提取有效的关键字,进行视频生成。比如某用户非常喜欢旅游,在和好友聊天中经常提到想去“巴厘岛”旅游或者在浏览器中经常搜索有关“巴厘岛旅游”的信息,则电子设备100可以根据该关键字信息,生成一段用户在巴厘岛旅游的场景。
基于前述场景,下面介绍电子设备100上实现的一些UI实施例。
在该基于用户输入的状态视频生成的场景下,参考附图2K,图2K是本申请实施例提供的一种基于用户偏好的视频生成流程示意图。如图2K所示,电子设备100可以接收发送的视频生成指令,并响应于所述视频生成指令从用户的历史偏好信息、当前环境信息中获取文本信息和图片信息,或者,同时响应于所述视频生成指令获取用户输入的偏好信息, 再将其转换为文本信息和获取的图像信息一起输入目标生成器网络中,获得目标视频。其中,检测到当前环境信息可以作用于生成视频的背景。
下面从以下几个方面进行详细说明。
(1)如何获取图片信息。
具体的,在当前场景下,用户没有对生成视频的视频内容做相关输入,因此为了充实视频内容,电子设备100可以根据用户偏好,作为视频的输入信息。请参考附图2L,图2L是本申请实施例提供的另一组获取文本信息的用户界面。如图2L中(1)所示,电子设备100可以通过触摸传感器180K检测到用户的触控操作(如,触摸传感器180K识别出用户在窗口显示区202中对于状态视频的点击操作),响应于该触控操作,可以使得电子设备开始进行视频的生成。如图2L中(2)所示,电子设备100可以响应该点击操作后,获取用户偏好信息和当前环境信息。并从上述信息中获取所述文本信息,即,提取所述一个或多个关键字,以及根据该一个或多个关键字获取至少一个关键字对应的图片。例如:根据关键字“巴厘岛”获取有关巴厘岛的图片,并根据当前的时间获取到巴厘岛的天气信息,使得在生成的视频中巴厘岛的天气环境和当前的巴厘岛的天气环境一致。
需要说明的是,关于如何获取图片信息的相关描述还可以对应参考上述应用场景1或2中有关如何获取图片信息的相关描述,本申请实施例对此不再赘述。
(2)如何获取文本信息。
具体的,电子设备获取到用户的偏好信息,在没有接收到用户其他输入的情况下,从用户偏好信息和当前环境信息中提取一个或多个关键字,获取视频输入的文本信息。例如:如图2L中(2)所示,电子设备获取到用户喜欢巴厘岛游玩的信息,结合当前的时间、巴厘岛的天气,进而获得有关时间、地点和人物的至少一个关键字。
需要说明的是,关于如何获取文本信息的相关描述还可以对应参考上述应用场景一种有关如何获取文本信息的相关描述,本申请实施例对此不再赘述。
(3)如何生成视频。
具体的,电子设备100将获得的所述文本信息和所述图像信息输入到目标生成器网络中,获得目标视频,所述目标视频用于描述所述文本信息。其中,根据所述文本信息提取所述图片信息中每张图片的第一空间向量,所述每张图片的第一空间向量用于标识图片中与所述文本信息对应的图像特征。例如:请参考附图2M,图2M是本申请实施例提供的另一种视频生成的用户界面。如图2M所示,电子设备根据获得的文本信息和图片信息生成用户在巴厘岛游玩的视频,用户Lisa可以将该视频发送给朋友Emmy,丰富了朋友间的交流形式,甚至还可以生成和朋友一起游玩巴厘岛的视频,满足不能和朋友一起出游的遗憾。
需要说明的是,关于如何获取生成视频的相关描述还可以对应参考上述应用场景1或2中有关如何生成视频的相关描述,本申请实施例此处不再赘述。
因此,通过文本,声音,电子设备的传感器信息和历史偏好信息,少量图片,生成视频进行分享。其中,历史偏好信息是从终端设备的用户交互信息中或者是浏览器搜索记录中提取而来的,主要指的是用户兴趣。利用多模态的信息输入,能较好的对当前用户的状态进行有效描述,进而对生成的视频进行约束。用状态视频进行分享,相比于地理位置或者方位信息,能在视觉和听觉上都满足用户需求,能为用户带来的更加丰富的体验。
需要说明的是,上述三种应用场景的只是本申请实施例中的几种示例性的实施方式,本申请实施例中的应用场景包括但不仅限于以上应用场景。
基于上述电子设备和上述应用场景,本申请实施例提供一种应用于上述图1A所述电子设备中的视频生成装置,请参见图3A,图3A是本申请实施例提供的一种视频生成装置的结构示意图。如图3A所示,该视频生成装置可包括输入模块、离线模块、在线模块三个模块组成。其中,所述输入模块包括:静态图像获取子模块,传感器信息获取子模块,偏好信息获取子模块,用户输入获取子模块;所述离线模块包括:视频生成子模块和视频优化子模块。需要说明的是下述实施例提及的终端设备,相当于本申请中的电子设备100。
(1)输入模块
输入模块是为生成视频的生成器网络提供原始的输入,辅助生成器网络完成视频的生成,明确的输入条件信息,以及质量较高的静态图片,都有利于生成较好的实时状态视频;用户能根据自己的主观意愿加入所想要生成的状态视频要素以丰富视频内容,可以用文字或者语音进行呈现。
其中,静态图像获取子模块:一般来说,用户终端设备上都有许多的照片可供选择,当用户想要生成偏向于含有人物活动的实时状态视频时,则终端设备自动选取含有用户自身的照片,也可以采用图片质量评估等方法来选取优质图片。比如:有些图片由于拍照时相机抖动等原因造成效果模糊,或者光线较暗,导致照片结果不佳,那么这类图片应当过滤掉,而不被用来当做视频生成的输入。
传感器信息获取子模块:终端设备上含有许多的传感器原件。比如GPS位置传感器,能够获取用户的地理位置信息;温度传感器,能获取用户周围的温度信息;气压传感器,能获取用户的相对高度。还有许多传感器,都有着很好的提供用户周围实时信息的功能。
偏好信息获取子模块:获取终端设备上用户的历史交互信息,基于该交互信息提取用户的历史偏好信息。例如:社交软件上的聊天记录,搜索记录等等,这些终端设备的应用程序都可提取大量的用户交互信息。或者,从浏览器搜集用户搜索信息,提取用户兴趣信息。
用户输入获取子模块:用户输入的形式可以是语音也可以是文本。如果是语音,则可通过手机语音助手功能,提取关键字,把关键字变成文本存储下来,等着后续和文字输入组合成最终的输入。用户通过文本输入,可以在终端设备输入所要生成实时的状态的一些关键字,来描述要生成的状态视频的场景。例如:时间、人物、地点、事件等等。
(2)离线模块
离线模块主要用于模型训练,利用生成式对抗网络进行视频生成,并且对生成器网络优化。其中视频生成子模块主要由生成式对抗网络里的生成器组成,而视频优化子模块主要由对抗网络组成,使得生成的视频更加逼真,而且视频优化的结果可以反馈至视频生成子模块,用于视频生成子模块训练生成器网络模型。
其中,视频生成子模块:视频生成模块中的生成器网络(Generator)可以采用RNN网络进行实现,RNN网络具有良好的记忆上下文信息的能力。生成器网络是全卷积网络,通过多层卷积以及多层上采样层组成,其中,输入可以由<图片信息,文本信息>组成。生成 器能够产生丰富的视频,通过输入的约束条件,能进一步的规范生成通过生成器的采样与真实世界中的视频数据一起送入到判别器网络中。以提升生成视频的质量,有利于实时状态视频的生成。
视频优化子模块:由判别器网络(Discriminator)构成。判别器网络接收来自视频生成模块的数据结果以及真实世界中采集的视频数据,来进行对抗训练。其主要目的是让生成的视频更加逼真,避免生成的视频过于光滑或者斑块效应较为明显。判别器的输入是<生成的视频,真实的视频>,其中,判别器通过输入的两种视频判断生成的视频是否可以认为真实的视频,即,当该生成的视频与真实的视频差别很小时,判别器判定生成的视频为真实的视频,此时判别器的判别损失结果为1;当该生成的视频与真实的视频差别较大时,判别器判定生成的视频不是真实的视频,此时判别器的判别损失结果为0,生成器网络生成的视频质量较差,还需要继续训练。因此,当生成器网络生成的视频可以被判断为真实的视频时,则认为生成器网络训练优化成功。通过这样的对抗训练方式,逐步的提升生成的视频的真实性。
(3)在线模块
在线模块式利用离线模块训练好的生成器模型,来进行用户在终端设备上,进行实时的状态共享。这时候生成视频子模块也需要<静态图像,传感器信息,用户输入信息>作为输入来,而视频优化子模块这时候则不需要,减少了模型参数量,降低手机功耗。
需要说明的是,当视频生成装置需要部署到终端设备上时,离线模块中还可以只需要部署训练完毕的视频生成子模块,从而完成终端设备的实时状态共享,本申请对此不作具体的限定。
基于图3A提供的视频生成装置,基于前述图2A-图2M提供的三种场景及各个场景下的UI实施例,接下来介绍本申请实施例提供的一种视频生成方法,对本申请中提出的技术问题进行具体分析和解决。
参见图3B,图3B是本申请实施例提供的一种视频生成方法的流程示意图,该方法可应用于上述图1A中所述的电子设备中,其中,视频生成装置可以用于支持并执行图3B中所示的方法流程步骤S301-步骤S307。下面将结合附图3B从视频生成装置侧进行描述。该方法可以包括以下步骤S301-步骤S307。
步骤S301:接收视频生成指令。
具体的,视频生成装置接收视频生成指令,该视频生成指令可以通过触控操作识别、手势操作识别、语音控制识别等接收用户的视频生成指令。
步骤S302:响应于视频生成指令获取文本信息和图片信息。
具体的,视频生成装置响应于视频生成指令获取文本信息和图片信息。视频生成装置接收到视频生成指令后,响应该指令获取文本信息和图片信息。其中文本信息用于描述后续生成的视频的内容,所述图片信息包括N张图片,该N张图片用于视频生成装置根据文本信息和该N张图片生成视频中的M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
在一种可能实现的方式中,所述响应于所述视频生成指令获取文本信息,包括:响应 于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。电子设备可以获取用户特定输入的信息(文本输入信息、语音输入信息);或者利用该电子设备上的传感器获取当前环境信息;又或者从历史交互信息中提取的用户偏好信息中提取文本信息。例如,当用户没有或无法进行手动或语音输入时,还可以只依赖传感器获取的当前环境信息或者从历史交互信息中提取的偏好作为输入的文本信息,与输入的图片信息一起生成目标视频。
在一种可能实现的方式中,所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。电子设备可以根据文本信息,获取与文本信息相关的图片信息。例如:获取图片信息时,可以根据当前的地理位置信息或者用户输入的地点信息获取对应地点的图片,用于生成目标视频。例如:用户在故宫游览时,可以获得故宫相关的图片,用于合成目标视频,方便用户实时分享自己的生活状态。获取图片信息时,还可以根据用户输入的人物信息获取对应地点的图片,用于生成目标视频,满足了用户需求。例如:用户输入“小明在操场踢足球”,可以获得关键字“小明”、“操场”、“足球”的至少一张相关图片,用于合成目标视频。
在一种可能实现的方式中,所述视频生成指令包括人脸识别请求;所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。电子设备在获取图片信息时,可以首先通过人脸识别获得人脸识别结果,进而根据该人脸识别结果直接从预先存储的图片中获取包含有用户的图片。方便直接生成有关于用户的状态视频,及时分享用户的当前状态。例如:通过人脸识别,识别出是用户A后,可以从预先存储的多张图片中,获取用户A的图片,这种不需要用户筛选图片,就可以生成包含有用户A的视频,方便了用户操作,提升了用户体验。
在一种可能实现的方式中,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述响应于所述视频生成指令获取图片信息,包括:响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。例如:用户可以选择图片标签为“猫”,在获取猫的多张图片后,与文本信息一起生成一段以猫为主角的动态视频;又例如:用户还可以选择图片标签为“童年的小明”,在获取到小明小时候的多张图片后,与文本信息一起生成一段有关小明童年的动态视频。电子设备在获取图片信息时,可以通过视频生成指令中携带的至少一个图片标签,获取对应的至少一个图片,用于生成的目标视频。当用户想将几张图片,生成有趣的视频时,可以直接筛选出用户感兴趣或者需要的图片生成视频,满足了用户的观看需求。
在一种可能实现的方式中,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。在获取图片信息前,需要对即将选择的图片进行图片质量评分,当图片质量评分低于预设阈值时,则不使用该图片生成视频。而用图片质量均大于预设阈值的图片生成视频,可因此保证最终由此图片生成的目标视频的画质较高,满足了用户的观看体验。
在一种可能实现的方式中,所述方法还包括:将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。在获取图片信息后,需要对获取的所有图片进行图片质量评分,当图片质量较差时可以对该图片进行图片质量增强,以提升在通过该图片生成视频时的视频质量,满足用户的观看体验。
步骤S303:根据一个或多个关键字获取N张图片的每张图片中与一个或多个关键字对应的图像特征。
具体的,视频生成装置根据一个或多个关键字提取N张图片的每张图片中与一个或多个关键字对应的图像特征。例如:文本信息中包括关键字足球,则视频生成装置需要将所述N张图片的每张图片中足球的图像特征提取出来,以便视频生成装置基于该足球的图像特征生成视频。
步骤S304:提取一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量。
具体的,视频生成装置可以分别提取一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量。第一空间变量为该关键字在向量空间上的词向量。
步骤S305:提取N张图片的图像特征分别在向量空间上对应的第二空间变量。
具体的,视频生成装置可以提取N张图片每张图片中与一个或多个关键字对应的图像特征,分别在向量空间上对应的第二空间变量。第二空间变量为该图像特征在向量空间上的向量,用于表示该图像特征。
步骤S306:将第一空间变量和所述第二空间变量输入目标生成器网络中,生成目标视频。
具体的,视频生成装置将第一空间变量和所述第二空间变量输入目标生成器网络中,生成目标视频。即,将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频。
在一种可能实现的方式中,所述将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,包括:提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;提取所述N张图片的图像特征分别在向量空间上对应的第二空间变量;将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。可以首先分别提取文本信息对应的所述第一空间变量以及图片信息中每张图片的图像特征的第二空间向量,其中,所述第一空间变量可以是在隐向量空间中标识所述文本信息的词向量,所述每张图片的第二空间向量可以是在隐向量空间中标识该图片的图像特征的向量,通过提取空间向量有利于生成器网络更好地生成目标视频。
在一种可能实现的方式中,所述方法还包括:获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本视频;将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。视频生成装置需要通过样本数据进行生成器网络和判别 器网络的训练。其中,首先根据样本数据通过所述生成器网络生成视频,再将<生成的视频,真实的视频>输入到判别器中,判别器对输入进行判断来源,如果来源于生成的视频则判断为假0,否则判断为真1,通过这样多次重复的对抗训练方式,能进一步的规范生成视频的内容,逐步的提升生成视频的真实性以及提升生成视频的质量。
步骤S307:接收视频分享指令,并响应于视频分享指令将目标视频发送至目标设备。
具体的,视频生成装置可以接收视频分享指令,并响应于视频分享指令将目标视频发送至目标设备。当生成目标视频后,用户还可以将该视频分享至朋友圈,通过将生成的目标视频分享至目标终端设备中,可以促进社交平台友好发展,提升用户生活交友体验。
需要说明的是,本申请实施例中步骤S301-步骤S307的相关描述,还可以对应参考上述图2A-图2M实施例的相关描述,此处不再赘述。
实施本申请实施例,电子设备可以根据文本信息和图片信息生成视频,以便用户可以实时分享自己的生活状态。当电子设备接收视频生成指令后,可以响应于该视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,由于所述文本信息可以用于描述生成视频的视频内容(如:所述一个或多个关键字可以包括人物、时间、地点、事件或动作等等),所述图片信息可以用于提取或生成每一帧的视频图片,因此可以根据一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征,再将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,其中,所述目标视频可以包括M张基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片。因此,这种利用文本和图像共同生成视频,使得生成的视频可以根据输入的文本信息对输入的图片信息进行调整,大大的丰富了视频内容,避免了现有终端设备上的由多张图片直接堆叠生成的视频,只局限于以幻灯片的形式切换式放映,缺少内容的丰富性,同时也满足了用户在视觉上和听觉上的需求。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的相关装置。
请参见图4,图4是本申请实施例提供的另一种视频生成装置的结构示意图,该视频生成装置10可以包括接收响应单元401、提取单元402和生成单元403,还可以包括:评分单元404、增强单元405和训练单元406其中,各个单元的详细描述如下。
接收响应单元401,用于接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;
提取单元402,用于根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;
生成单元403,用于将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
在一种可能实现的方式中,所述接收响应单元401,具体用于:响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信 息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。
在一种可能实现的方式中,所述接收响应单元401,具体用于:响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。
在一种可能实现的方式中,所述视频生成指令包括人脸识别请求;所述接收响应单元401,具体用于:响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。
在一种可能实现的方式中,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述接收响应单元401,具体用于:响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。
在一种可能实现的方式中,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。
在一种可能实现的方式中,所述装置还包括:评分单元404,用于将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;增强单元405,用于将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。
在一种可能实现的方式中,所述生成单元403,具体用于:提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;提取所述N张图片的图像特征分别在向量空间上对应的所述第二空间变量;将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。
在一种可能实现的方式中,所述装置还包括:训练单元406,所述训练单元406,用于:获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本视频;将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。
需要说明的是,本申请实施例中所描述的视频生成装置10中各功能单元的功能可参见上述图3B中所述的方法实施例中步骤S301-步骤S307的相关描述,此处不再赘述。
如图5所示,图5是本申请实施例提供的又一种视频生成装置的结构示意图,该装置20包括至少一个处理器501,至少一个存储器502、至少一个通信接口503。此外,该设备还可以包括天线等通用部件,在此不再详述。
处理器501可以是通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制以上方案程序执行的集成电路。
通信接口503,用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),核 心网,无线局域网(Wireless Local Area Networks,WLAN)等。
存储器502可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,所述存储器502用于存储执行以上方案的应用程序代码,并由处理器501来控制执行。所述处理器501用于执行所述存储器202中存储的应用程序代码。
存储器202存储的代码可执行以上视频生成方法,比如接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
需要说明的是,本申请实施例中所描述的视频生成装置20中各功能单元的功能可参见上述图3B中所述的方法实施例中的步骤S301-步骤S307相关描述,此处不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个 单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务端或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (22)

  1. 一种视频生成方法,其特征在于,包括:
    接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;
    根据所述一个或多个关键字获取所述N张图片中与所述一个或多个关键字对应的图像特征;
    将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
  2. 根据权利要求1所述方法,其特征在于,所述响应于所述视频生成指令获取文本信息,包括:
    响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。
  3. 根据权利要求1或2所述方法,其特征在于,所述响应于所述视频生成指令获取图片信息,包括:
    响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。
  4. 根据权利要求1或2所述方法,其特征在于,所述视频生成指令包括人脸识别请求;所述响应于所述视频生成指令获取图片信息,包括:
    响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;
    根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。
  5. 根据权利要求1或2所述方法,其特征在于,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述响应于所述视频生成指令获取图片信息,包括:
    响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。
  6. 根据权利要求3-5所述的任意一项方法,其特征在于,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。
  7. 根据权利要求1-5所述的任意一项方法,其特征在于,所述方法还包括:
    将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;
    将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。
  8. 根据权利要求1-7所述的任意一项方法,其特征在于,所述将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,包括:
    提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;
    提取所述N张图片的图像特征分别在向量空间上对应的第二空间变量;
    将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。
  9. 根据权利要求1-8所述的任意一项方法,其特征在于,所述方法还包括:
    获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;
    将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本视频;
    将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;
    根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。
  10. 一种视频生成装置,其特征在于,包括:
    接收响应单元,用于接收视频生成指令,并响应于所述视频生成指令获取文本信息和图片信息,所述文本信息包括一个或多个关键字,所述图片信息包括N张图片,N为大于或等于1的正整数;
    提取单元,用于根据所述一个或多个关键字获取所述N张图片的每张图片中与所述一个或多个关键字对应的图像特征;
    生成单元,用于将所述一个或多个关键字和所述N张图片的图像特征输入目标生成器网络中,生成目标视频,所述目标视频包括M张图片,所述M张图片为基于所述N张图片的图像特征生成的、且与所述一个或多个关键字对应的图片,M为大于1的正整数。
  11. 根据权利要求10所述装置,其特征在于,所述接收响应单元,具体用于:
    响应于所述视频生成指令,从文本输入信息、语音输入信息、用户偏好信息、用户生理数据信息、当前环境信息中的一个或多个,获取所述文本信息,其中,所述当前环境信息包括当前天气信息、当前时间信息、当前地理位置信息中的一个或多个。
  12. 根据权利要求10或11所述装置,其特征在于,所述接收响应单元,具体用于:
    响应于所述视频生成指令,从预先存储的多张图片中,获取与所述一个或多个关键字中至少一个关键字对应的图片。
  13. 根据权利要求10或11所述装置,其特征在于,所述视频生成指令包括人脸识别请求;所述接收响应单元,具体用于:
    响应于所述视频生成指令,进行人脸识别并获得人脸识别结果;
    根据所述人脸识别结果,从预先存储的多张图片中,获取与所述人脸识别结果匹配的至少一张图片。
  14. 根据权利要求10或11所述装置,其特征在于,所述视频生成指令包括至少一个图片标签,所述至少一个图片标签中每一个图片标签与预先存储的多张图片中的至少一张图片对应;所述接收响应单元,具体用于:
    响应于所述视频生成指令,根据所述至少一个图片标签,从预先存储的多张图片中,获取与所述至少一个图片标签中每一个图片标签对应的至少一张图片。
  15. 根据权利要求12-14所述的任意一项装置,其特征在于,所述获取的所述N张图片中每张图片的图片质量均大于预设阈值。
  16. 根据权利要求10-14所述的任意一项装置,其特征在于,所述装置还包括:
    评分单元,用于将获取的所述N张图片进行图片质量评分,获得所述N张图片中每张图片对应的图片质量评分结果;
    增强单元,用于将所述图片质量评分结果小于预设阈值的图片进行图片质量增强处理,并将图片质量增强后的图片更新至所述N张图片中。
  17. 根据权利要求10-16所述的任意一项装置,其特征在于,所述生成单元,具体用于:
    提取所述一个或多个关键字中每一个关键字在向量空间上对应的第一空间变量;
    提取所述N张图片的图像特征分别在向量空间上对应的第二空间变量;
    将所述第一空间变量和所述第二空间变量输入所述目标生成器网络中,生成所述目标视频。
  18. 根据权利要求10-17所述的任意一项装置,其特征在于,所述装置还包括:训练单元,所述训练单元,用于:
    获取样本文本信息、样本图片信息以及真实视频数据集,并构建判别器网络和基于视频生成的生成器网络;
    将所述样本文本信息和所述样本图片信息输入所述生成器网络中,生成样本视频;
    将所述样本视频和所述真实视频数据集作为所述判别器网络的输入,获得判别损失结果,其中,在所述样本视频属于所述真实视频数据集时,所述判别损失结果为1;
    根据所述判别损失结果,训练所述生成器网络获得所述目标生成器网络。
  19. 一种电子设备,其特征在于,包括处理器和存储器,其中,所述存储器用于存储信 息发送程序代码,所述处理器用于调用所述程序代码来执行权利要求1-9任一项所述的方法。
  20. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,权利要求1-9中任意一项所述的方法得以实现。
  21. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述权利要求1-9任意一项所述的方法。
  22. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被计算机执行时,使得所述计算机执行如权利要求1-9中任意一项所述的方法。
PCT/CN2021/097047 2020-05-30 2021-05-29 一种视频生成方法及相关装置 WO2021244457A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21818681.5A EP4149109A4 (en) 2020-05-30 2021-05-29 VIDEO PRODUCTION METHOD AND APPARATUS
US18/070,689 US20230089566A1 (en) 2020-05-30 2022-11-29 Video generation method and related apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010480675.6 2020-05-30
CN202010480675.6A CN111669515B (zh) 2020-05-30 2020-05-30 一种视频生成方法及相关装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/070,689 Continuation US20230089566A1 (en) 2020-05-30 2022-11-29 Video generation method and related apparatus

Publications (1)

Publication Number Publication Date
WO2021244457A1 true WO2021244457A1 (zh) 2021-12-09

Family

ID=72385342

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/097047 WO2021244457A1 (zh) 2020-05-30 2021-05-29 一种视频生成方法及相关装置

Country Status (4)

Country Link
US (1) US20230089566A1 (zh)
EP (1) EP4149109A4 (zh)
CN (1) CN111669515B (zh)
WO (1) WO2021244457A1 (zh)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111669515B (zh) * 2020-05-30 2021-08-20 华为技术有限公司 一种视频生成方法及相关装置
CN112165585A (zh) * 2020-09-28 2021-01-01 北京小米移动软件有限公司 短视频生成方法、装置、设备及存储介质
CN112995537B (zh) * 2021-02-09 2023-02-24 成都视海芯图微电子有限公司 一种视频构建方法及系统
CN113434733B (zh) * 2021-06-28 2022-10-21 平安科技(深圳)有限公司 基于文本的视频文件生成方法、装置、设备及存储介质
WO2023044669A1 (en) * 2021-09-23 2023-03-30 Intel Corporation Methods and apparatus to implement scalable video coding for distributed source and client applications
CN114237785A (zh) * 2021-11-17 2022-03-25 维沃移动通信有限公司 图像生成方法、装置
CN117035004A (zh) * 2023-07-24 2023-11-10 北京泰策科技有限公司 基于多模态学习技术的文本、图片、视频生成方法及系统
CN117135416B (zh) * 2023-10-26 2023-12-22 环球数科集团有限公司 一种基于aigc的视频生成系统
CN118102031A (zh) * 2024-04-19 2024-05-28 荣耀终端有限公司 一种视频生成方法及电子设备

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100131571A1 (en) * 2008-11-25 2010-05-27 Reuveni Yoseph Method application and system for characterizing multimedia content
CN103842997A (zh) * 2011-08-04 2014-06-04 K·波波夫 搜索和创建自适应内容
CN104732593A (zh) * 2015-03-27 2015-06-24 厦门幻世网络科技有限公司 一种基于移动终端的3d动画编辑方法
CN105893412A (zh) * 2015-11-24 2016-08-24 乐视致新电子科技(天津)有限公司 图像分享方法及装置
CN106210450A (zh) * 2016-07-20 2016-12-07 罗轶 基于slam的影视人工智能
CN109658369A (zh) * 2018-11-22 2019-04-19 中国科学院计算技术研究所 视频智能生成方法及装置
CN109978021A (zh) * 2019-03-07 2019-07-05 北京大学深圳研究生院 一种基于文本不同特征空间的双流式视频生成方法
CN111669515A (zh) * 2020-05-30 2020-09-15 华为技术有限公司 一种视频生成方法及相关装置

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5305108A (en) * 1992-07-02 1994-04-19 Ampex Systems Corporation Switcher mixer priority architecture
JP4970302B2 (ja) * 2008-02-14 2012-07-04 富士フイルム株式会社 画像処理装置,画像処理方法及び撮像装置
CN103188608A (zh) * 2011-12-31 2013-07-03 中国移动通信集团福建有限公司 基于用户位置的彩像图片生成方法、装置及系统
US10673977B2 (en) * 2013-03-15 2020-06-02 D2L Corporation System and method for providing status updates
US20160014482A1 (en) * 2014-07-14 2016-01-14 The Board Of Trustees Of The Leland Stanford Junior University Systems and Methods for Generating Video Summary Sequences From One or More Video Segments
CN106844677A (zh) * 2017-01-24 2017-06-13 宇龙计算机通信科技(深圳)有限公司 一种信息分享的方法及装置
CN108460104B (zh) * 2018-02-06 2021-06-18 北京奇虎科技有限公司 一种实现内容定制的方法和装置
CN108665414A (zh) * 2018-05-10 2018-10-16 上海交通大学 自然场景图片生成方法
US11442986B2 (en) * 2020-02-15 2022-09-13 International Business Machines Corporation Graph convolutional networks for video grounding

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100131571A1 (en) * 2008-11-25 2010-05-27 Reuveni Yoseph Method application and system for characterizing multimedia content
CN103842997A (zh) * 2011-08-04 2014-06-04 K·波波夫 搜索和创建自适应内容
CN104732593A (zh) * 2015-03-27 2015-06-24 厦门幻世网络科技有限公司 一种基于移动终端的3d动画编辑方法
CN105893412A (zh) * 2015-11-24 2016-08-24 乐视致新电子科技(天津)有限公司 图像分享方法及装置
CN106210450A (zh) * 2016-07-20 2016-12-07 罗轶 基于slam的影视人工智能
CN109658369A (zh) * 2018-11-22 2019-04-19 中国科学院计算技术研究所 视频智能生成方法及装置
CN109978021A (zh) * 2019-03-07 2019-07-05 北京大学深圳研究生院 一种基于文本不同特征空间的双流式视频生成方法
CN111669515A (zh) * 2020-05-30 2020-09-15 华为技术有限公司 一种视频生成方法及相关装置

Also Published As

Publication number Publication date
US20230089566A1 (en) 2023-03-23
EP4149109A1 (en) 2023-03-15
CN111669515A (zh) 2020-09-15
EP4149109A4 (en) 2023-10-25
CN111669515B (zh) 2021-08-20

Similar Documents

Publication Publication Date Title
WO2021244457A1 (zh) 一种视频生成方法及相关装置
WO2020078299A1 (zh) 一种处理视频文件的方法及电子设备
WO2021104485A1 (zh) 一种拍摄方法及电子设备
WO2021036568A1 (zh) 辅助健身的方法和电子装置
WO2020029306A1 (zh) 一种图像拍摄方法及电子设备
WO2021013132A1 (zh) 输入方法及电子设备
WO2022095788A1 (zh) 目标用户追焦拍摄方法、电子设备及存储介质
WO2021052139A1 (zh) 手势输入方法及电子设备
CN112214636A (zh) 音频文件的推荐方法、装置、电子设备以及可读存储介质
CN112840635A (zh) 智能拍照方法、系统及相关装置
WO2022160958A1 (zh) 一种页面分类方法、页面分类装置和终端设备
WO2022073417A1 (zh) 融合场景感知机器翻译方法、存储介质及电子设备
WO2022156473A1 (zh) 一种播放视频的方法及电子设备
CN114242037A (zh) 一种虚拟人物生成方法及其装置
WO2022037479A1 (zh) 一种拍摄方法和拍摄系统
WO2022033432A1 (zh) 内容推荐方法、电子设备和服务器
WO2022022406A1 (zh) 一种灭屏显示的方法和电子设备
WO2021238371A1 (zh) 生成虚拟角色的方法及装置
US20230162529A1 (en) Eye bag detection method and apparatus
CN114444000A (zh) 页面布局文件的生成方法、装置、电子设备以及可读存储介质
WO2023179490A1 (zh) 应用推荐方法和电子设备
EP4383096A1 (en) Search method, terminal, server and system
WO2022007757A1 (zh) 跨设备声纹注册方法、电子设备及存储介质
CN114080258B (zh) 一种运动模型生成方法及相关设备
WO2021238338A1 (zh) 语音合成方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21818681

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021818681

Country of ref document: EP

Effective date: 20221209

NENP Non-entry into the national phase

Ref country code: DE