CN114286169A - Video generation method, device, terminal, server and storage medium - Google Patents

Video generation method, device, terminal, server and storage medium Download PDF

Info

Publication number
CN114286169A
CN114286169A CN202111013239.9A CN202111013239A CN114286169A CN 114286169 A CN114286169 A CN 114286169A CN 202111013239 A CN202111013239 A CN 202111013239A CN 114286169 A CN114286169 A CN 114286169A
Authority
CN
China
Prior art keywords
target
video
keyword
keywords
audio
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202111013239.9A
Other languages
Chinese (zh)
Other versions
CN114286169B (en
Inventor
康洪文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202111013239.9A priority Critical patent/CN114286169B/en
Publication of CN114286169A publication Critical patent/CN114286169A/en
Priority to PCT/CN2022/112842 priority patent/WO2023029984A1/en
Priority to US18/140,296 priority patent/US12026354B2/en
Application granted granted Critical
Publication of CN114286169B publication Critical patent/CN114286169B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0481Interaction techniques based on graphical user interfaces [GUI] based on specific properties of the displayed interaction object or a metaphor-based environment, e.g. interaction with desktop elements like windows or icons, or assisted by a cursor's changing behaviour or appearance
    • G06F3/0482Interaction with lists of selectable items, e.g. menus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0484Interaction techniques based on graphical user interfaces [GUI] for the control of specific functions or operations, e.g. selecting or manipulating an object, an image or a displayed text element, setting a parameter value or selecting a range
    • G06F3/04842Selection of displayed objects or displayed text elements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser
    • G06F3/0488Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser using a touch-screen or digitiser, e.g. input of commands through traced gestures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • H04N21/2343Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements
    • H04N21/234336Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs involving reformatting operations of video signals for distribution or compliance with end-user requests or end-user device requirements by media transcoding, e.g. video is transformed into a slideshow of still pictures or audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/41Structure of client; Structure of client peripherals
    • H04N21/422Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS]
    • H04N21/42203Input-only peripherals, i.e. input devices connected to specially adapted client devices, e.g. global positioning system [GPS] sound input device, e.g. microphone
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/4402Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display
    • H04N21/440236Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving reformatting operations of video signals for household redistribution, storage or real-time display by media transcoding, e.g. video is transformed into a slideshow of still pictures, audio is converted into text
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/83Generation or processing of protective or descriptive data associated with content; Content structuring
    • H04N21/84Generation or processing of descriptive data, e.g. content descriptors
    • H04N21/8405Generation or processing of descriptive data, e.g. content descriptors represented by keywords
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Security & Cryptography (AREA)
  • Television Signal Processing For Recording (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application discloses a video generation method, a video generation device, a terminal, a server and a storage medium, and relates to the technical field of video processing. The method comprises the following steps: responding to audio input operation of an audio input interface, and displaying a keyword selection interface based on the acquired initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio; determining at least one target keyword in response to an editing operation on a recommended keyword in a keyword selection interface; responding to video synthesis operation in the keyword selection interface, and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video fragments, and the target video fragments are obtained by matching the target keywords. In a video generation scene, a user can obtain a video related to voice by inputting a section of voice, and the video generation efficiency is improved.

Description

Video generation method, device, terminal, server and storage medium
Technical Field
The embodiment of the application relates to the technical field of video processing, in particular to a video generation method, a video generation device, a video generation terminal, a video generation server and a storage medium.
Background
With the development of internet technology, creators attract traffic and attention by publishing audio and video on a platform.
In the related technology, if an audio/video capable of being released needs to be generated, the audio/video may need to be manually recorded, and then the audio/video obtained by recording is cut and edited to obtain the audio/video capable of being released; or the audio and video to be released is generated by collecting the existing video materials and splicing.
Obviously, the audio and video generation mode needs manual participation, and the video generation efficiency is low, so that the timeliness of video release is influenced.
Disclosure of Invention
The embodiment of the application provides a video generation method, a video generation device, a terminal, a server and a storage medium, and can improve video generation efficiency. The technical scheme is as follows:
according to an aspect of the present application, there is provided a video generation method, the method including:
responding to audio input operation of an audio input interface, and displaying a keyword selection interface based on the acquired initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio;
responding to editing operation of the recommended keywords in the keyword selection interface, and determining at least one target keyword;
responding to the video synthesis operation in the keyword selection interface, and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video segments, and the target video segments are obtained by matching the target video segments based on the target keywords.
According to another aspect of the present application, there is provided a video generation method, the method including:
in response to receiving an initial audio, performing audio recognition on the initial audio, and determining at least one recommended keyword;
responding to a video synthesis request, and performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface comprises the recommended keywords;
and generating the target video based on the target video segment.
According to another aspect of the present application, there is provided a video generating apparatus, the apparatus including:
the first display module is used for responding to audio input operation on an audio input interface, displaying a keyword selection interface based on the acquired initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio;
the first determination module is used for responding to the editing operation of the recommended keywords in the keyword selection interface and determining at least one target keyword;
and the second display module is used for responding to the video synthesis operation in the keyword selection interface and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video clips, and the target video clips are obtained by matching the target video clips based on the target keywords.
According to another aspect of the present application, there is provided a video generating apparatus, the apparatus including:
the second determination module is used for responding to the received initial audio, performing audio recognition on the initial audio and determining at least one recommendation keyword;
the third determining module is used for responding to a video synthesis request, performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface comprises the recommended keywords;
a first generation module to generate the target video based on the target video segment.
According to another aspect of the present application, there is provided a terminal comprising a processor and a memory, the memory having stored therein at least one program, the at least one program being loaded and executed by the processor to implement the video generation method as described above.
According to another aspect of the present application, there is provided a server comprising a processor and a memory, the memory having stored therein at least one program, the at least one program being loaded and executed by the processor to implement the video generation method as described above.
According to another aspect of the present application, there is provided a computer-readable storage medium having stored therein at least one program which is loaded and executed by a processor to implement the video generation method as described above.
According to another aspect of the application, a computer program product or computer program is provided, comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the navigation video generation method provided in the above-mentioned alternative implementation, or to implement the video generation method provided in the above-mentioned alternative implementation.
The beneficial effects brought by the technical scheme provided by the embodiment of the application at least comprise:
the method comprises the steps of performing audio recognition on initial audio input by a user, determining recommended keywords, performing video segment matching based on the recommended keywords, and generating a target video based on the matched target video segment, so that conversion from the audio to the related video can be realized, the user can obtain the video related to the voice by inputting a section of voice in a video generation scene, the video generation efficiency is improved, and the video release efficiency is improved; in addition, a related key word selection interface is provided, so that a user can manually adjust recommended key words, and further the generated target video is more in line with the requirements of the user.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
FIG. 1 illustrates a schematic diagram of an implementation environment shown in an exemplary embodiment of the present application;
FIG. 2 illustrates a flow chart of a video generation method provided by an exemplary embodiment of the present application;
FIG. 3 shows a schematic diagram of a video generation process shown in an exemplary embodiment of the present application;
FIG. 4 illustrates a flow chart of a video generation method provided by another exemplary embodiment of the present application;
FIG. 5 is a diagram illustrating an initial audio acquisition process according to an exemplary embodiment of the present application;
FIG. 6 is a schematic diagram illustrating an initial audio acquisition process according to another exemplary embodiment of the present application;
FIG. 7 illustrates an editing process diagram for a keyword selection interface in accordance with an exemplary embodiment of the present application;
FIG. 8 shows a schematic diagram of a video generation process shown in another exemplary embodiment of the present application;
FIG. 9 illustrates a flow chart of a video generation method provided by an exemplary embodiment of the present application;
FIG. 10 shows a flow diagram of a video generation method provided by another example embodiment of the present application;
FIG. 11 illustrates a flow chart of a video generation method shown in an exemplary embodiment of the present application;
fig. 12 is a block diagram of a video generating apparatus according to an exemplary embodiment of the present application;
fig. 13 is a block diagram of a video generation apparatus according to an exemplary embodiment of the present application;
fig. 14 shows a schematic structural diagram of a computer device provided in an embodiment of the present application.
Detailed Description
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
An embodiment of the present application provides a method for converting audio to video, please refer to fig. 1, which shows a schematic diagram of an implementation environment shown in an exemplary embodiment of the present application, where the implementation environment includes: a terminal 110 and a server 120.
The terminal 110 is a device running a video-type application. The video-class application may be: a video clip application, a video distribution application, a video playback application, etc. In the embodiment of the application, the terminal 110 is provided with an audio-to-video function, and after a user inputs a section of initial audio, recommended keywords can be obtained based on the initial audio, the user selects target keywords that need to be video-synthesized, and a video synthesis request is sent to the server 120; optionally, the terminal 110 may also upload the initial audio to the server 120, and the server 120 feeds back the extracted recommended keywords to the terminal 110.
Optionally, the terminal 110 includes, but is not limited to, a smart phone, a computer, a smart voice interaction device, a smart appliance, a vehicle-mounted terminal, and the like.
The terminal 110 and the server 120 are directly or indirectly connected through wired or wireless communication.
The server 120 is a cloud computing resource pool in the cloud technology field, and multiple types of virtual resources are deployed in the resource pool and are selectively used by external customers. The cloud computing resource pool mainly comprises: computing devices (which are virtualized machines, including operating systems), storage devices, and network devices. The cloud server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud service, a cloud database, cloud computing, a cloud function, cloud storage, Network service, cloud communication, middleware service, domain name service, security service, Content Delivery Network (CDN), big data and an artificial intelligence platform. In this embodiment, the server 120 may be a background server of a video application, and may receive a video synthesis request sent by the terminal 110, obtain a target video segment based on target keyword matching included in the video synthesis request, further obtain a target video based on target video segment synthesis, and feed the target video back to the terminal 110; optionally, the server 120 may also receive an initial audio sent by the terminal 110, perform audio recognition on the initial audio, determine a recommended keyword, and feed back the determined recommended keyword to the terminal 110, so that the user may select the target keyword based on the recommended keyword.
Referring to fig. 2, a flowchart of a video generation method according to an exemplary embodiment of the present application is shown. The embodiment of the present application is described by taking an example that the method is applied to the terminal shown in fig. 1, and the method includes:
step 201, responding to an audio input operation on an audio input interface, and displaying a keyword selection interface based on the obtained initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio.
In order to improve video generation efficiency, and without manually shooting and cutting videos, the embodiment of the application provides a mode of automatically generating related videos based on audio input by a user directly.
Optionally, the audio-to-video function may be integrated as a specific function in a video application, for example, in a video clip application (video clip platform), a user clicks an audio-to-video control, and may enter an audio input interface correspondingly to input an initial audio that needs to be converted into a related video.
Optionally, an audio input control may be displayed in the audio input interface, and a user may click the audio input control, receive a trigger operation on the audio input control, collect a sound signal through a microphone, and determine the collected sound signal as an initial audio.
Optionally, a file upload control may be displayed in the audio input interface, the user may click the file upload control, the corresponding terminal receives a trigger operation on the file upload control, and may upload a specified audio file, and the corresponding terminal obtains the initial audio.
In order to realize the audio-to-video function, in a possible implementation manner, the terminal may perform audio identification on the obtained initial audio to obtain an initial text content corresponding to the initial audio, further perform keyword extraction on the initial text content to obtain a recommended keyword which can be used for generating a subsequent video, and display the recommended keyword in a keyword selection interface, so that a user can confirm whether the recommended keyword meets the user requirement or confirm whether the recommended keyword is accurately identified.
As shown in fig. 3, which shows a schematic diagram of a video generation process shown in an exemplary embodiment of the present application. An audio input control 302 is displayed in the audio input interface 301, and after a user clicks the audio input control 302, the terminal receives an audio input operation on the audio input interface 301 to obtain an initial audio; after the determination of the recommended keywords based on the initial audio is completed, a keyword selection interface 303 is displayed, and a plurality of recommended keywords 304 are displayed in the keyword selection interface 303.
Optionally, in order to reduce power consumption of the terminal, in a possible implementation manner, the server may also execute an obtaining process of the recommended keyword, that is, after the terminal obtains the initial audio, the initial audio is uploaded to the server, the server performs audio recognition on the initial audio to obtain at least one recommended keyword, and the obtained recommended keyword is fed back to the terminal, so that the terminal may display the recommended keyword in the keyword selection interface.
Alternatively, the recommendation keywords may be extracted from the initial audio, that is, only the recommendation keywords contained in the initial audio are displayed in the keyword selection interface.
In order to enrich the number of recommended keywords and further facilitate subsequent generation of a target video with richer information, in one possible implementation manner, after candidate keywords are extracted from initial text content corresponding to initial audio, associated recommendation can be performed based on the candidate keywords, that is, hot words related to the candidate keywords are obtained, and the hot words and the candidate keywords are jointly used as recommended keywords.
Step 202, in response to an editing operation on a recommended keyword in the keyword selection interface, at least one target keyword is determined.
In a possible implementation mode, an editing function for the recommended keywords is provided in the keyword selection interface, the user can manually modify and identify wrong recommended keywords, delete redundant recommended keywords, add the recommended keywords which are not entered in the initial audio, and the like, and after the user is modified, the terminal can determine the final recommended keywords displayed in the keyword selection interface as target keywords.
And 203, responding to the video synthesis operation in the keyword selection interface, and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video segments, and the target video segments are obtained by matching the target video segments based on target keywords.
In a possible implementation manner, after the user finishes editing the recommended keywords in the keyword selection interface, the user can click the video composition control in the keyword selection interface, the corresponding terminal receives the video composition operation in the keyword selection interface, and a video display interface can be displayed, wherein the target video is displayed in the video display interface.
As shown in fig. 3, after the user finishes editing the recommended keywords in the keyword selection interface 303, the user may click the composition control 305, the corresponding terminal receives a video composition operation in the keyword selection interface, may perform a subsequent video composition operation, and display a synthesized target video 307 in the video display interface 306. Optionally, the video display interface 306 may further include a publishing control 308, and when the user clicks the publishing control 308, the corresponding terminal receives a publishing operation on the target video 307, and may publish the target video 307 to the target display platform.
Aiming at the mode of generating the target video, the video clip matching can be carried out based on the target keywords to obtain the hotspot video clip (target video clip) which is consistent with the target keywords, and then the target video which is consistent with the target keywords is synthesized based on the target video clip.
Optionally, the synthesizing of the target video based on the target video segment may include: if the number of the target keywords is one, matching a target video segment, and synthesizing a target video based on the target video segment, the target caption and the target dubbing; optionally, if the target keywords are more than two, matching multiple target video segments, splicing and synthesizing the multiple target video segments to obtain a target video, where both the dubbing and the subtitles in the target video can adopt the original subtitles and the original dubbing in the original target video segments; or, a plurality of target video segments can be spliced and synthesized with the target subtitles and the target dubbing together into the target video, and the target subtitles and the target dubbing are regenerated based on the target keywords.
Optionally, the target subtitles included in the target video may be generated by target keywords and an initial text corresponding to the initial audio, and the target dubbing is performed by performing speech synthesis based on the target subtitles to obtain an audio stream, and then performing video rendering based on the target subtitles, the audio stream, and the video to obtain the target video finally displayed in the video display interface.
Optionally, the process of generating the target video may be executed by a server, the corresponding terminal sends the target keyword to the server, the server performs matching based on the received target keyword to obtain a target video clip, synthesizes the target video based on the target video clip, and feeds the synthesized target video back to the terminal; and the corresponding terminal displays the received target video in a video display interface.
In summary, in the embodiment of the application, the recommended keywords are obtained by performing audio recognition on the initial audio input by the user, video segment matching is performed based on the recommended keywords, and the target video is generated based on the matched target video segment, so that conversion from the audio to the related video can be realized, and in a video generation scene, the user can obtain the video related to the voice by inputting a section of voice, the video generation efficiency is improved, and the video publishing efficiency is improved; in addition, a related key word selection interface is provided, so that a user can manually adjust recommended key words, and further the generated target video is more in line with the requirements of the user.
In order to improve the accuracy of generating a target video based on recommended keywords and avoid the situation that the recommended keywords are wrong due to errors in voice recognition or the situation that a user needs to repeatedly input audio due to the fact that the user inputs the initial audio by mistake, in a possible implementation mode, a terminal displays the recommended keywords in a keyword selection interface and provides an editing function for the keyword selection interface, so that the user can manually modify the recommended keywords, and the accuracy of subsequently generating the target video is improved.
In an illustrative example, as shown in fig. 4, a flow chart of a video generation method provided by another illustrative embodiment of the present application is shown. The embodiment of the present application is described by taking an example that the method is applied to the terminal shown in fig. 1, and the method includes:
step 401, responding to an audio input operation on an audio input interface, and displaying a keyword selection interface based on the obtained initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio.
Optionally, the initial audio used in the process of converting the audio into the video may be obtained by a user recording the audio in time, or may be a pre-recorded audio file, and in an exemplary example, the process of acquiring the initial audio by the terminal may include the first step and the second step.
The method comprises the steps of responding to triggering operation of an audio input control in an audio input interface, and collecting initial audio through a microphone.
In order to meet the requirement of a user on timeliness of converting audio into video, in one possible implementation manner, an audio input control is displayed in an audio input interface, the user can input initial audio in time through the audio input control, a corresponding terminal receives triggering operation of the audio input control in the audio input interface, and the initial audio is collected through a microphone.
Optionally, the triggering operation on the audio input control may be a clicking operation on the audio input control, that is, the user clicks the audio input control for the first time, starts to collect the initial audio, stops collecting the initial audio when the user clicks the audio input control again, and determines the audio collected in the two clicking processes as the initial audio; or, the triggering operation on the audio input control may also be a long-time pressing operation on the audio input control, that is, the user presses the audio input control for a long time, starts to collect the initial audio, stops collecting the initial audio when the user stops pressing the audio input control for a long time, and determines the audio collected in the process of pressing the audio input control for a long time by the user as the initial audio.
Optionally, in order to avoid complexity of a subsequent video generation process due to an excessively long audio entry process, in a possible implementation manner, an audio entry duration is set, when a user starts to enter an initial audio, a duration countdown is displayed, and after the countdown is finished, even if the user does not stop entering the audio, the audio entry is automatically stopped. Illustratively, the audio entry duration may be 30 s. Optionally, the audio entry duration may also be customized by the user.
As shown in fig. 5, a schematic diagram of an initial audio acquisition process according to an exemplary embodiment of the present application is shown. The audio input interface 501 is displayed with an audio input control 502, when a user clicks the audio input control 502, the terminal receives a click operation on the audio input control 502, an initial audio is collected by the microphone, and when the user stops clicking the audio input control 502, the audio input interface 501 may display prompt information 503 "upload in the middle" to prompt the user terminal to obtain the initial audio.
And secondly, responding to the triggering operation of the audio uploading control in the audio input interface, and acquiring an audio file corresponding to the initial audio.
Optionally, if the terminal stores in advance an initial audio that the user needs to convert audio into video, in order to avoid the need for the user to repeatedly perform audio entry operations, in one possible implementation manner, the audio input interface includes an audio upload control, after the user clicks the audio upload control, the corresponding terminal may receive a trigger operation on the audio upload control in the audio input interface, and the user selects an audio file that needs to convert audio into video, so that the terminal may obtain the audio file corresponding to the initial audio.
Optionally, the trigger operation may be any one of a click operation, a double click operation, and a long press operation, which is not limited in this embodiment of the application.
As shown in fig. 6, a schematic diagram of an initial audio acquisition process according to another exemplary embodiment of the present application is shown. The audio input interface 601 displays an audio uploading control 602, when a user clicks the audio uploading control 602, the terminal receives a click operation on the audio uploading control 602, calls a folder, and selects an audio file from the folder by the user, and when the upload operation on the audio file is received, the audio input interface 601 may display prompt information 603 "in uploading" to prompt the user terminal to obtain the audio file corresponding to the initial audio.
Optionally, the process of audio recognition of the initial audio is performed by the server, and in an exemplary example, step 401 may include step 401A and step 401B.
Step 401A, in response to an audio input operation on an audio input interface, sending an obtained initial audio to a server, where the server is configured to perform audio recognition on the initial audio and determine at least one recommended keyword.
In a possible implementation manner, after the terminal receives an audio input operation on the audio input interface, the obtained initial audio can be sent to the server, the server performs audio recognition on the initial audio, and the determined recommended keywords are fed back to the terminal.
Optionally, after the server performs audio recognition on the initial audio, the initial text content corresponding to the initial audio may be obtained, and then the recommended keywords may be extracted from the initial text content.
The following embodiments may be referred to in the process of performing audio recognition and keyword extraction on the initial audio by the server, which is not described herein again in this embodiment.
And step 401B, displaying a keyword selection interface based on the recommended keywords sent by the server.
The corresponding terminal receives the recommended keywords sent by the server, and the recommended keywords can be displayed in the keyword selection interface.
Optionally, when the recommended keywords are displayed, the recommended keywords may be sequentially displayed according to the order of the recommended keywords in the initial text content corresponding to the initial audio.
Step 402, responding to the trigger operation of adding the control in the keyword selection interface, and newly adding a recommendation keyword in the keyword selection interface.
In order to improve the accuracy of generating a target video based on recommended keywords and avoid the situation that the recommended keywords are wrong due to errors in voice recognition or the situation that a user needs to repeatedly input audio due to the fact that the user inputs the initial audio by mistake, in a possible implementation mode, a terminal displays the recommended keywords in a keyword selection interface and provides an editing function for the keyword selection interface, so that the user can manually modify the recommended keywords, and the accuracy of subsequently generating the target video is improved.
Optionally, the editing function of the keyword selection interface at least includes: at least one of a recommended keyword adding function, a recommended keyword deleting function, and a recommended keyword modifying function.
Because the recommended keywords are only some words in the initial text content corresponding to the initial audio, but not all the initial text content, there may be some keywords contained in the initial audio, and the keywords are not displayed in the keyword selection interface; or after the user records the initial audio, the user forgets to record the audio containing some important keywords, so as to avoid increasing the cost of converting the audio into the video due to the fact that the user repeatedly records the audio, in a possible implementation manner, an adding control is provided in a keyword selection interface, so that the user can add the required recommended keywords through the adding control; and when the corresponding terminal receives the trigger operation of adding the control, newly adding a recommendation keyword in the keyword selection interface.
Fig. 7 is a diagram illustrating an editing process for a keyword selection interface according to an exemplary embodiment of the present application. The keyword selection interface 701 includes an addition control 702, and each recommended keyword corresponds to a deletion control 703. When the user clicks the add control 702, the corresponding terminal receives a trigger operation on the add control 702 in the keyword selection interface, and a recommendation keyword 705 is newly added in the keyword selection interface 701.
Because video synthesis is required to be carried out on a target video segment matched with the recommended keywords subsequently, in order to improve the fluency of the target video and avoid the existence of abrupt video segments or discontinuous video segments, when a user newly adds the recommended keywords in the keyword selection interface, whether the newly added keywords are directly added in the keyword selection interface can be determined by comparing the association degrees of the newly added keywords and other recommended keywords. In one illustrative example, step 402 may include steps 402A through 402C.
Step 402A, responding to a trigger operation of adding a control in the keyword selection interface, and obtaining a newly added keyword.
In a possible implementation manner, after the terminal receives a trigger operation for adding a control in the keyword selection interface, a newly added keyword can be obtained, and then whether the newly added recommended keyword is directly added or not is determined by comparing the association degree between the newly added keyword and other recommended keywords.
Step 402B, determining the association degree between the newly added keywords and each recommended keyword.
Because the recommended keywords need to have a certain degree of association, the subsequent target video segments determined based on the recommended keywords can have a certain degree of association, and the fluency of the subsequent target video is facilitated, in a possible implementation manner, after the newly added keywords are obtained, the degree of association between the newly added keywords and each recommended keyword can be determined, and the degree of association is used for judging whether the recommended keywords are directly newly added or not.
Optionally, if the newly added keywords are words in the initial text content corresponding to the initial audio, the newly added keywords may be directly displayed in the keyword selection interface without determining the association degree.
Step 402C, in response to the existence of the association degree larger than the association degree threshold value, newly adding the recommendation keyword in the keyword selection interface.
In a possible implementation manner, if the association degree between the newly added keyword and a certain recommended keyword is higher than an association degree threshold, it indicates that the newly added keyword is beneficial to subsequently synthesizing the target video, and the recommended keyword can be correspondingly and directly added in the keyword selection interface.
Illustratively, the relevance threshold may be preset by a developer, and the relevance threshold may be 85%.
Optionally, if the association degrees between the newly added keyword and any recommended keyword are less than the association degree threshold, the association between the target video segment determined based on the newly added keyword and other target video segments may be small, and the continuity of the subsequent target video may be affected.
And 403, in response to the triggering operation of the target deletion control in the keyword selection interface, deleting the recommended keyword corresponding to the target deletion control in the keyword selection interface.
Because the keyword selection interface not only displays the keywords extracted based on the initial audio of the user, but also displays other hot words recommended based on the keywords, so that the user can select the required recommended keywords from the keywords, in a possible implementation manner, the keyword selection interface includes a target deletion control, and the recommended keywords corresponding to the target deletion control in the keyword selection interface can be deleted through triggering operation on the target deletion control.
As shown in fig. 7, when the user needs to delete the recommendation keyword 704, the deletion control 703 corresponding to the recommendation keyword 704 may be clicked, and the corresponding terminal receives a trigger operation on the deletion control 703 to delete the recommendation keyword 704 in the keyword selection interface 701.
More recommended keywords can improve the richness of the generated target video and the accuracy of the matched target video segments, so that a quantity threshold value is set to remind a user in time when the quantity of the remaining recommended keywords of the user is lower than the quantity threshold value so as to avoid influencing the generation of the subsequent target video. In an illustrative example, step 403 may include step 403A and step 403B.
Step 403A, in response to the trigger operation on the target deletion control in the keyword selection interface, acquiring the number of the keywords of the remaining recommended keywords.
In a possible implementation manner, after a trigger operation on a target deletion control in a keyword selection interface is received, the number of keywords of the remaining recommended keywords can be obtained, so that whether the remaining recommended keywords are enough to generate a target video or not can be judged subsequently.
And step 403B, in response to the number of the keywords being higher than the number threshold, deleting the recommended keywords corresponding to the target deletion control in the keyword selection interface.
And when the number of the keywords is determined to be higher than the number threshold, the fact that the residual recommended keywords are enough to generate the target video with rich information is indicated, and the recommended keywords corresponding to the target deleting control in the keyword selection interface are directly deleted.
Optionally, if the number of the keywords is lower than the number threshold, it indicates that the remaining recommended keywords may not be enough to generate an information-rich target video, and it is necessary to further remind the user whether to delete the recommended keywords, and the corresponding terminal displays first prompt information, where the first prompt information is used to prompt the number of the remaining keywords.
Illustratively, the quantity threshold may be set by a developer, and the quantity threshold may be 5.
Step 404, in response to the operation of modifying the recommended keywords in the keyword selection interface, displaying the modified recommended keywords in the keyword selection interface.
When the initial audio is identified wrongly, errors may exist between the recommended keywords and the recommended keywords desired by the user, and in order to avoid the user from repeatedly inputting the initial audio and repeatedly extracting the keywords, in one possible implementation manner, a modification operation on the recommended keywords is provided in the keyword selection interface, the user can modify the recommended keywords as needed, input the modified recommended keywords, and the corresponding terminal displays the modified recommended keywords in the keyword selection interface.
As shown in fig. 7, when the user needs to modify the recommendation keyword 704, the modified recommendation keyword 706 is displayed in the keyword selection interface 701 correspondingly by pressing the recommendation keyword 704 for a long time.
Step 405, in response to the triggering operation of the video synthesis control in the keyword selection interface, determining the recommended keywords displayed in the keyword selection interface as target keywords.
In order to enable the terminal to determine that the user recommended keywords are edited, in one possible implementation manner, a video synthesis control is displayed in the keyword selection interface, when the terminal receives a trigger operation on the video synthesis control in the keyword selection interface, the recommended keywords are determined to be edited, and the recommended keywords finally displayed in the keyword selection interface are determined to be the target keywords.
And 406, responding to the video synthesis operation in the keyword selection interface, and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video segments, and the target video segments are obtained by matching the target video segments based on target keywords.
Optionally, the video composition process is performed by a server, and in one illustrative example, step 406 may include step 406A and step 406B.
Step 406A, in response to a trigger operation on the video composition control in the keyword selection interface, sending a target keyword to a server, where the server is configured to perform video segment matching based on the target keyword to obtain at least one target video segment, and synthesize a target video based on the target video segment.
In a possible implementation manner, after the terminal receives a trigger operation on a video synthesis control in a keyword selection interface, a target keyword can be determined based on a recommended keyword finally displayed in the keyword selection interface, and then a video synthesis request is sent to a server based on the target keyword, wherein the video synthesis request includes the target keyword, and after the corresponding server receives the video synthesis request, video segment matching can be performed based on the target keyword, at least one target video segment is matched, and then a target video is synthesized based on the target video segment.
And step 406B, displaying a video display interface based on the target video sent by the server.
Optionally, the server feeds the synthesized target video back to the terminal, and the corresponding terminal displays a video display interface based on the target video sent by the server.
As shown in fig. 8, which shows a schematic diagram of a video generation process shown in another exemplary embodiment of the present application. When a user clicks the audio input control 802 in the audio input interface 801, the corresponding terminal receives an audio input operation on the audio input interface 801 to acquire an initial audio; displaying a recommended keyword 804 obtained by performing audio recognition based on the initial audio in a keyword selection interface 803; when a user clicks a synthesis control 805 in a keyword selection interface 803, a corresponding terminal receives a video synthesis operation in the keyword selection interface 803 and sends a video synthesis request to a server; the terminal displays the target video 807 in the video display interface 806 based on the target video fed back by the server.
Step 407, responding to the playing operation of the target video in the video display interface, playing the target video, wherein the target video comprises a target subtitle, and the target subtitle comprises a target keyword.
Optionally, the target video is spliced by the target video segments and may also include target subtitles and dubbing, and the target video is indirectly generated by the target keywords, and correspondingly, the target subtitles included in the target video should also include the target keywords.
The following embodiments may be referred to in the process of generating a target subtitle and dubbing, which is not described herein again in this embodiment.
And step 408, responding to the triggering operation of the resynthesis control in the video display interface, and displaying a keyword selection interface.
Optionally, a re-composition control is displayed in the video display interface.
If the target video does not meet the user's expectations, in order to avoid the need for the user to re-enter the audio and repeat the operation of converting the audio into the video, in a possible implementation manner, the user may click a re-synthesis control in the video presentation interface, may return to the keyword selection interface again, and the user re-edits the recommended keywords and re-performs video synthesis.
As shown in fig. 8, when the user clicks the re-composition control 809 in the video presentation interface 806, the keyword selection interface 803 may be displayed, and the recommended keyword editing operation may be performed again. Optionally, a publishing control 808 and a re-inputting control 810 are also displayed in the video display interface 806, the publishing control 808 is used for publishing the target video 807 to other video platforms, and the re-inputting control 810 is used for returning to the audio input interface 801 to perform the audio input operation again.
In the embodiment, the user can modify, delete and add the recommended keywords through the recommended keyword editing function provided by the keyword selection interface, so that the finally determined target keywords are more in line with the user expectation, the operation of converting audio into video is avoided being repeatedly performed, the generation efficiency of high-quality video is improved, and the video publishing efficiency can be improved.
In the above embodiment, the video generation process at the terminal side is mainly described, and the video generation process is completed by the interaction between the terminal side and the server side, and the video generation process at the server side is mainly described in this embodiment.
Referring to fig. 9, a flowchart of a video generation method according to an exemplary embodiment of the present application is shown. In the embodiment of the present application, the method is described by taking an example in which the method is applied to the server shown in fig. 1, and the method includes:
step 901, in response to receiving an initial audio, performing audio recognition on the initial audio, and determining at least one recommended keyword.
It should be noted that, in this embodiment, the process of performing audio recognition on the initial audio and the video synthesis process are both performed on the server side.
In a possible implementation manner, after the terminal acquires the initial audio, the initial audio may be sent to the server, and the corresponding server receives the initial audio, performs audio recognition on the initial audio, and determines at least one recommended keyword.
Optionally, after the server determines the recommended keywords, the server does not directly perform video segment matching and subsequent video synthesis based on the recommended keywords, but needs to feed the recommended keywords back to the terminal, and then feeds the recommended keywords back to the terminal, so that the terminal can display a keyword selection interface based on the received recommended keywords.
Step 902, responding to a video synthesis request, and performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface contains recommended keywords.
In order to improve the accuracy of subsequent video generation, it is necessary to ensure the accuracy of target keywords (recommended keywords), and therefore, in one possible implementation, after a user edits the recommended keywords at a keyword selection interface of a terminal, the terminal sends the target keywords confirmed by the user to a server, and the server may execute a subsequent video generation process based on the target keywords.
Optionally, after receiving the video synthesis request, the server may obtain a target keyword from the video synthesis request, and further perform video segment matching based on the obtained target keyword to obtain at least one target video segment.
In a possible implementation manner, the similarity between the target keyword and the video tag corresponding to the candidate hotspot video may be compared, and if the similarity is higher than a similarity threshold, the candidate hotspot video is determined as the target video segment corresponding to the target keyword. Illustratively, the similarity threshold may be 85%.
Optionally, the candidate hotspot video may be input into the video understanding model, the video understanding model extracts the spatio-temporal information of the candidate hotspot video, performs scene recognition, motion capture and emotion analysis, and extracts the scene information, object information, character expression and motion information of the candidate hotspot video as the video tag of the candidate hotspot video.
Optionally, when the similarity between the target keyword and the video tag is calculated, the target keyword and the video tag may be converted into feature vectors, and then the similarity between the two feature vectors is compared.
And step 903, generating a target video based on the target video clip.
In a possible implementation manner, a server synthesizes target video clips to obtain a target video; and feeding back the target video to the terminal, and displaying the target video in a video display interface by the terminal.
Optionally, when the server synthesizes the target video based on the target video segments, if a plurality of target keywords exist and two or more target video segments are correspondingly matched, the target video can be synthesized by splicing the plurality of target video segments, and the dubbing and the subtitles in the target video can use the original dubbing and the original subtitles in the target video segments; or the server can generate new target dubbing and target subtitles based on the target keywords, so that a plurality of target video segments are spliced and synthesized with the target dubbing and the target subtitles to obtain a target video; optionally, if only a single target keyword is included, the target keyword is correspondingly matched to a single target video segment, and the target video may be synthesized based on the target video segment, the target dubbing, and the target subtitle.
In summary, in the embodiment of the application, the recommended keywords are obtained by performing audio recognition on the initial audio input by the user, video segment matching is performed based on the recommended keywords, and the target video is generated based on the matched target video segment, so that conversion from the audio to the related video can be realized, and in a video generation scene, the user can obtain the video related to the voice by inputting a section of voice, the video generation efficiency is improved, and the video publishing efficiency is improved; in addition, a related key word selection interface is provided, so that a user can manually adjust recommended key words, and further the generated target video is more in line with the requirements of the user.
When a plurality of target video segments are matched, the target video segments need to be spliced and synthesized, the video splicing sequence affects the fluency of generating the target video, and in one possible implementation, a reference can be provided for the splicing sequence of the target video segments by the initial audio.
Referring to fig. 10, a flowchart of a video generation method according to another exemplary embodiment of the present application is shown. In the embodiment of the present application, the method is described by taking an example in which the method is applied to the server shown in fig. 1, and the method includes:
step 1001, in response to receiving the initial audio, performing audio recognition on the initial audio to obtain an initial text content.
In one possible implementation mode, when the server receives initial audio sent by the terminal, the initial audio is subjected to audio recognition first, and the audio is converted into initial text content.
The initial audio is converted into the initial text content by using a speech recognition (audio recognition) method, for example, an algorithm based on dynamic time warping, a hidden markov model based on a parametric model, a vector quantization method based on a non-parametric model, an algorithm based on an artificial neural network, and the like.
Step 1002, extracting keywords from the initial text content, and determining at least one recommended keyword.
After the server acquires the initial text content, keyword extraction can be performed on the initial text content, and at least one recommended keyword is extracted from the initial text content.
Optionally, the keyword extraction may be performed by using an artificial intelligent natural language processing technology, inputting the initial text content into the keyword extraction model, and outputting a keyword sequence by the keyword extraction model. The keyword extraction model is composed of an Embedding (Embedding) layer, a Long Short-Term Memory network (LSTM) and normalized index (SoftMax) hiding layer and a Conditional Random Field (CRF) supervision layer.
Optionally, after the server obtains the initial text content, meaningless text segments, such as linguistic balloon words, in the initial text content may be deleted, and then the initial text content with the meaningless text segments deleted is input into the keyword extraction model for keyword extraction, so that keyword extraction efficiency may be improved.
Step 1003, responding to the video synthesis request, performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface contains recommended keywords.
Optionally, the video composition request is sent by the terminal after receiving a video composition operation on the keyword selection interface, and in an exemplary example, step 1003 may include step 1003A and step 1003B.
Step 1003A, acquiring a target keyword based on a video synthesis request sent by the terminal, wherein the video synthesis request is sent by the terminal after receiving a video synthesis operation in the keyword selection interface.
In one possible implementation mode, when the terminal receives a triggering operation of a video synthesis control in the keyword selection interface, the terminal determines that a video synthesis operation in the keyword selection interface is received, and sends a video synthesis request to the server, wherein the video synthesis request contains a target keyword, so that the server can perform a subsequent video synthesis operation based on the target keyword.
And 1003B, performing video segment matching based on the target keywords to obtain at least one target video segment.
Optionally, when video segments are matched based on the target keywords, one target video segment may be determined based on only a single target keyword, and then one target video is synthesized based on the target video segments corresponding to the respective keywords.
Optionally, two or more target video segments may be matched based on a single target keyword, and then a combination of different target video segments may be considered subsequently to generate a plurality of target videos and push the target videos to the terminal, so that the user may select a target video with a better effect from the plurality of target videos to publish the video.
Step 1004, generating target text content based on the target keywords and the initial text content.
Since the target keywords may include not only the recommended keywords in the initial text content but also the newly added keywords of the user, in one possible implementation, the initial text content may need to be modified based on the target keywords to generate the target text content.
In an illustrative example, step 1004 may include steps 1004A-1004C.
Step 1004A, in response to the target keyword belonging to the initial text content, generates a target text content based on the initial text content.
When the user deletes only part of the recommended keywords in the recommended keywords, that is, part of the text content in the initial text content is irrelevant to the target keywords, in one possible implementation, the initial text content is deleted based on the target keywords, that is, the text content not including the target keywords is deleted, and then the deleted text content is determined as the target text content.
Optionally, if the initial text content includes all the target keywords and there is no irrelevant text content, the initial text content may be directly determined as the target text content.
And 1004B, responding to the situation that the target keywords do not belong to the initial text content, and generating a target description text corresponding to the target keywords based on the target keywords.
In a possible implementation manner, if the target keyword is a newly added recommended keyword and the initial text content does not include the target keyword, in order to enable the generated target text content to include the target keyword, a target description text corresponding to the target keyword needs to be generated based on the context semantics of the target keyword and the initial text content, and then the target description text is added to the initial text content to obtain a final target text content.
Optionally, since the target final text content is used to determine the target subtitle information in the target video, in other possible implementations, a target description text corresponding to the target keyword may also be generated based on the target video segment corresponding to the target keyword, and in an exemplary example, step 1004B further includes step three and step four.
And thirdly, acquiring a target video clip corresponding to the target keyword.
Optionally, if the initial text content does not include the target keyword, a target video segment corresponding to the target keyword may be obtained first, and then a target description text related to the target keyword is generated based on the target video segment.
And fourthly, determining a target description text corresponding to the target key words based on the target video clip.
Optionally, a target description text related to the target keyword may be generated based on the original subtitle information of the target video segment.
And step 1004C, generating target text content based on the initial text content and the target description text.
And adding the target description text into the initial text content based on the context semantics of the initial text content, and further generating the target text content.
Optionally, after the target text content is generated, a corresponding target subtitle may be generated based on the target text content, so as to be subsequently added to the target video.
Optionally, after the target subtitle is generated, the target subtitle may be converted into voice (dubbing) through a voice synthesis technique for subsequent addition to the target video.
Optionally, the dubbing may be performed by using the user's own voice, and the voiceprint feature of the user is extracted from the initial audio, and then the speech synthesis may be performed based on the voiceprint feature, so as to generate the dubbing having the user's own voice.
Step 1005, splicing and synthesizing all target video segments based on the keyword sequence of the target keywords in the target text content to generate the target video.
In order to make the finally generated target video conform to the speaking habits of the user (conform to the initial audio), in one possible implementation, the target video can be generated by splicing and synthesizing the target video segments based on the keyword sequence of the target keywords in the target text content.
Optionally, a target subtitle can be added to the target video, and the target video (recommended video) is combined and rendered by dubbing to obtain the target video which is finally fed back to the terminal.
Step 1006, sending the target video to a terminal, where the terminal is configured to display the target video in a video display interface.
In a possible implementation manner, after the server generates the target video, the target video can be fed back to the terminal, so that the terminal can display the target video in the video display interface.
In the embodiment, the keyword sequence of the target keywords in the target text content is obtained, and the target video segments are spliced, so that the generated target video can better accord with the speaking habit of the user, and the accuracy of the target video can be improved.
Referring to fig. 11, a flow chart of a video generation method according to an exemplary embodiment of the present application is shown. The method comprises the following steps:
at step 1101, a user enters an initial piece of audio.
Step 1102, convert the initial audio to preliminary text content and delete meaningless text segments through speech recognition techniques.
Step 1103, performing keyword recognition and extraction on the preliminary text content recognized by the speech.
In step 1104, the user adds pick-up keywords.
And 1105, increasing or decreasing the initial text content according to the keywords to generate final text content.
Step 1106, the final text content is synthesized into speech using speech synthesis techniques.
Step 1107, the final text information is generated as subtitle information.
Step 1108, based on the video content, the video is automatically labeled by using a deep learning technique.
And step 1109, according to the keywords, performing label retrieval matching in a video big data set label system, and outputting video data with high matching degree.
And step 1110, synthesizing the matched video, subtitle and audio to generate a recommended video.
In the following, embodiments of the apparatus of the present application are referred to, and for details not described in detail in the embodiments of the apparatus, the above-described embodiments of the method can be referred to.
Fig. 12 is a block diagram of a video generation apparatus according to an exemplary embodiment of the present application. The device includes:
the first display module 1201 is configured to display a keyword selection interface based on an obtained initial audio in response to an audio input operation on an audio input interface, where the keyword selection interface includes at least one recommended keyword, and the recommended keyword is obtained by performing audio recognition on the initial audio;
a first determining module 1202, configured to determine at least one target keyword in response to an editing operation on the recommended keyword in the keyword selection interface;
a second display module 1203, configured to display a video display interface in response to a video composition operation in the keyword selection interface, where the video display interface includes a target video, the target video is obtained by synthesizing a target video segment, and the target video segment is obtained by matching the target video segment based on the target keyword.
Optionally, the first determining module 1202 includes:
the adding unit is used for responding to the triggering operation of adding the control in the keyword selection interface and newly adding recommendation keywords in the keyword selection interface;
the deleting unit is used for responding to triggering operation of a target deleting control in the keyword selecting interface and deleting the recommended keywords corresponding to the target deleting control in the keyword selecting interface;
the modification unit is used for responding to modification operation of the recommended keywords in the keyword selection interface and displaying the modified recommended keywords in the keyword selection interface;
and the first determining unit is used for responding to the triggering operation of the video synthesis control in the keyword selection interface and determining the recommended keywords displayed in the keyword selection interface as the target keywords.
Optionally, the deleting unit is further configured to:
responding to the trigger operation of the target deleting control in the keyword selection interface, and acquiring the number of the keywords of the rest recommended keywords;
deleting the recommended keywords corresponding to the target deletion control in the keyword selection interface in response to the number of the keywords being higher than a number threshold;
the device further comprises:
and the third display module is used for responding to the condition that the number of the keywords is lower than the number threshold value, and displaying first prompt information, wherein the first prompt information is used for prompting the number of the remaining keywords.
Optionally, the adding unit is further configured to:
responding to the trigger operation of the adding control in the keyword selection interface, and acquiring a newly added keyword;
determining the association degree between the newly added keywords and each recommended keyword;
in response to the fact that the association degree is larger than an association degree threshold value, newly adding recommendation keywords in the keyword selection interface;
the device further comprises:
and the fourth display module is used for responding to the relevance degree smaller than the relevance degree threshold value and displaying second prompt information, wherein the second prompt information is used for prompting the relevance degree information.
Optionally, a re-composition control is displayed in the video display interface;
the device further comprises:
and the fifth display module is used for responding to the triggering operation of the re-synthesis control in the video display interface and displaying the keyword selection interface.
Optionally, the first display module 1201 includes:
the first acquisition unit is used for responding to the triggering operation of an audio input control in the audio input interface and acquiring the initial audio through a microphone;
or the like, or, alternatively,
and the second acquisition unit is used for responding to the triggering operation of the audio uploading control in the audio input interface and acquiring the audio file corresponding to the initial audio.
Optionally, the apparatus further comprises:
and the playing module is used for responding to the playing operation of the target video in the video display interface and playing the target video, wherein the target video comprises a target caption, and the target caption comprises the target keyword.
Optionally, the first display module 1201 includes:
the first sending unit is used for responding to audio input operation of the audio input interface and sending the obtained initial audio to a server, and the server is used for performing audio identification on the initial audio and determining at least one recommended keyword;
and the first display unit is used for displaying the keyword selection interface based on the recommended keywords sent by the server.
Optionally, the second display module 1203 includes:
the second sending unit is used for responding to triggering operation of a video synthesis control in the keyword selection interface and sending the target keywords to a server, and the server is used for performing video segment matching based on the target keywords to obtain at least one target video segment and synthesizing the target video based on the target video segment;
and the second display unit is used for displaying the video display interface based on the target video sent by the server.
In summary, in the embodiment of the application, the recommended keywords are obtained by performing audio recognition on the initial audio input by the user, video segment matching is performed based on the recommended keywords, and the target video is generated based on the matched target video segment, so that conversion from the audio to the related video can be realized, and in a video generation scene, the user can obtain the video related to the voice by inputting a section of voice, the video generation efficiency is improved, and the video publishing efficiency is improved; in addition, a related key word selection interface is provided, so that a user can manually adjust recommended key words, and further the generated target video is more in line with the requirements of the user.
Fig. 13 is a block diagram of a video generation apparatus according to an exemplary embodiment of the present application. The device includes:
a second determining module 1301, configured to perform audio recognition on an initial audio in response to receiving the initial audio, and determine at least one recommended keyword;
a third determining module 1302, configured to, in response to a video synthesis request, perform video segment matching based on an obtained target keyword to obtain at least one target video segment, where the target keyword is determined by an editing operation on a keyword selection interface, and the keyword selection interface includes the recommended keyword;
a first generating module 1303, configured to generate the target video based on the target video segment.
Optionally, the second determining module 1301 includes:
the identification unit is used for carrying out audio identification on the initial audio to obtain initial text content;
the second determining unit is used for extracting keywords from the initial text content and determining at least one recommended keyword;
the device further comprises:
the second generation module is used for generating target text content based on the target keywords and the initial text content;
the first generation module includes:
and the first generation unit is used for splicing and synthesizing all the target video segments based on the keyword sequence of the target keywords in the target text content to generate the target video.
Optionally, the second generating module includes:
a second generation unit configured to generate the target text content based on the initial text content in response to the target keyword belonging to the initial text content;
or the like, or, alternatively,
a third generating unit, configured to generate, in response to a target keyword not belonging to the initial text content, a target description text corresponding to the target keyword based on the target keyword;
and the fourth generating unit is used for generating the target text content based on the initial text content and the target description text.
Optionally, the third generating unit is further configured to:
acquiring the target video clip corresponding to the target keyword;
and determining the target description text corresponding to the target keyword based on the target video segment.
Optionally, the third determining module 1302 includes:
a third obtaining unit, configured to obtain the target keyword based on the video composition request sent by a terminal, where the video composition request is sent by the terminal after receiving a video composition operation in the keyword selection interface;
a third determining unit, configured to perform video segment matching based on the target keyword to obtain at least one target video segment;
after the target video is generated based on the target video segment, the method further comprises:
and the sending module is used for sending the target video to the terminal, and the terminal is used for displaying the target video in a video display interface.
In summary, in the embodiment of the application, the recommended keywords are obtained by performing audio recognition on the initial audio input by the user, video segment matching is performed based on the recommended keywords, and the target video is generated based on the matched target video segment, so that conversion from the audio to the related video can be realized, and in a video generation scene, the user can obtain the video related to the voice by inputting a section of voice, the video generation efficiency is improved, and the video publishing efficiency is improved; in addition, a related key word selection interface is provided, so that a user can manually adjust recommended key words, and further the generated target video is more in line with the requirements of the user.
An embodiment of the present application provides a computer device, which includes a processor and a memory, where the memory stores at least one program, and the at least one program is loaded and executed by the processor to implement the video generation method provided in the above optional implementation manner. Optionally, the computer device may be a terminal or a server.
When the computer device is a terminal, the terminal may be configured to execute the video generation method at the terminal side in the above-described alternative embodiment; when the computer device is a server, the server may be configured to perform the video generation method on the server side in the above-described alternative embodiment.
Referring to fig. 14, a schematic structural diagram of a computer device according to an embodiment of the present application is shown. The computer apparatus 1400 includes a Central Processing Unit (CPU) 1401, a system Memory 1404 including a Random Access Memory (RAM) 1402 and a Read-Only Memory (ROM) 1403, and a system bus 1405 connecting the system Memory 1404 and the Central Processing unit 1401. The computer device 1400 also includes a basic Input/Output system (I/O) 1406 that facilitates transfer of information between devices within the computer, and a mass storage device 1407 for storing an operating system 1413, application programs 1414, and other program modules 1415.
The basic input/output system 1406 includes a display 1408 for displaying information and an input device 1409, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1408 and input device 1409 are both connected to the central processing unit 1401 via an input/output controller 1410 connected to the system bus 1405. The basic input/output system 1406 may also include an input/output controller 1410 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, an input/output controller 1410 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 1407 is connected to the central processing unit 1401 through a mass storage controller (not shown) connected to the system bus 1405. The mass storage device 1407 and its associated computer-readable media provide non-volatile storage for the computer device 1400. That is, the mass storage device 1407 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact disk Read-Only Memory) drive.
Without loss of generality, the computer-readable media may comprise computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc) or other optical, magnetic, or other magnetic storage devices. Of course, those skilled in the art will appreciate that the computer storage media is not limited to the foregoing. The system memory 1404 and mass storage device 1407 described above may collectively be referred to as memory.
According to various embodiments of the present application, the computer device 1400 may also operate as a remote computer connected to a network via a network, such as the Internet. That is, the computer device 1400 may be connected to the network 1412 through the network interface unit 1411 connected to the system bus 1405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 1411.
The memory also includes one or more programs stored in the memory and configured to be executed by the one or more central processing units 1401.
The present application further provides a computer-readable storage medium having at least one instruction, at least one program, a set of codes, or a set of instructions stored therein, which is loaded and executed by a processor to implement the video generation method provided by any of the above exemplary embodiments.
Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to execute the video generation method provided in the above-described alternative implementation.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims (20)

1. A method of video generation, the method comprising:
responding to audio input operation of an audio input interface, and displaying a keyword selection interface based on the acquired initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio;
responding to editing operation of the recommended keywords in the keyword selection interface, and determining at least one target keyword;
responding to the video synthesis operation in the keyword selection interface, and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video segments, and the target video segments are obtained by matching the target video segments based on the target keywords.
2. The method of claim 1, wherein determining at least one target keyword in response to the editing operation on the recommended keyword in the keyword selection interface comprises:
responding to the trigger operation of adding a control in the keyword selection interface, and newly adding a recommendation keyword in the keyword selection interface;
in response to a trigger operation of a target deleting control in the keyword selection interface, deleting a recommended keyword corresponding to the target deleting control in the keyword selection interface;
responding to modification operation of the recommended keywords in the keyword selection interface, and displaying the modified recommended keywords in the keyword selection interface;
and determining the recommended keywords displayed in the keyword selection interface as the target keywords in response to the triggering operation of the video synthesis control in the keyword selection interface.
3. The method of claim 2, wherein deleting the recommended keyword corresponding to the target deletion control in the keyword selection interface in response to the triggering operation on the target deletion control in the keyword selection interface comprises:
responding to the trigger operation of the target deleting control in the keyword selection interface, and acquiring the number of the keywords of the rest recommended keywords;
deleting the recommended keywords corresponding to the target deletion control in the keyword selection interface in response to the number of the keywords being higher than a number threshold;
the method further comprises the following steps:
and responding to the condition that the number of the keywords is lower than the number threshold value, and displaying first prompt information, wherein the first prompt information is used for prompting the number of the remaining keywords.
4. The method of claim 2, wherein the newly adding a recommendation keyword in the keyword selection interface in response to the triggering operation of adding a control in the keyword selection interface comprises:
responding to the trigger operation of the adding control in the keyword selection interface, and acquiring a newly added keyword;
determining the association degree between the newly added keywords and each recommended keyword;
in response to the fact that the association degree is larger than an association degree threshold value, newly adding recommendation keywords in the keyword selection interface;
the method further comprises the following steps:
and responding to the relevance degree smaller than the relevance degree threshold value, and displaying second prompt information, wherein the second prompt information is used for prompting relevance degree information.
5. The method according to any one of claims 1 to 4, wherein a re-composition control is displayed in the video presentation interface;
after the video composition operation in response to the keyword selection interface displays the video display interface, the method further comprises:
and responding to the triggering operation of the re-synthesis control in the video display interface, and displaying the keyword selection interface.
6. The method of any of claims 1 to 4, wherein responding to an audio input operation to the audio input interface comprises:
collecting the initial audio through a microphone in response to a triggering operation of an audio input control in the audio input interface;
or the like, or, alternatively,
and responding to the triggering operation of the audio uploading control in the audio input interface, and acquiring an audio file corresponding to the initial audio.
7. The method of any of claims 1 to 4, wherein after displaying a video presentation interface in response to the video composition operation within the keyword selection interface, the method further comprises:
responding to the playing operation of the target video in the video display interface, and playing the target video, wherein the target video comprises a target subtitle, and the target subtitle comprises the target keyword.
8. The method according to any one of claims 1 to 4, wherein the displaying a keyword selection interface based on the obtained initial audio in response to an audio input operation on the audio input interface comprises:
responding to an audio input operation of the audio input interface, sending the obtained initial audio to a server, wherein the server is used for performing audio identification on the initial audio and determining at least one recommended keyword;
and displaying the keyword selection interface based on the recommended keywords sent by the server.
9. The method of any of claims 1 to 4, wherein displaying a video presentation interface in response to the video composition operation within the keyword selection interface comprises:
responding to a triggering operation of a video synthesis control in the keyword selection interface, sending the target keyword to a server, wherein the server is used for performing video segment matching based on the target keyword to obtain at least one target video segment, and synthesizing the target video based on the target video segment;
and displaying the video display interface based on the target video sent by the server.
10. A method of video generation, the method comprising:
in response to receiving an initial audio, performing audio recognition on the initial audio, and determining at least one recommended keyword;
responding to a video synthesis request, and performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface comprises the recommended keywords;
and generating the target video based on the target video segment.
11. The method of claim 10,
the audio recognition of the initial audio and the determination of at least one recommended keyword comprise:
performing audio recognition on the initial audio to obtain initial text content;
extracting keywords from the initial text content, and determining at least one recommended keyword;
before generating the target video based on the target video segment, the method further includes:
generating target text content based on the target keywords and the initial text content;
the generating the target video based on the target video segment comprises:
and splicing and synthesizing all the target video segments based on the keyword sequence of the target keywords in the target text content to generate the target video.
12. The method of claim 11, wherein generating target textual content based on the target keywords and the initial textual content comprises:
in response to the target keyword belonging to the initial textual content, generating the target textual content based on the initial textual content;
or the like, or, alternatively,
in response to the fact that target keywords do not belong to the initial text content, generating a target description text corresponding to the target keywords based on the target keywords;
and generating the target text content based on the initial text content and the target description text.
13. The method of claim 12, wherein generating the target description text corresponding to the target keyword based on the target keyword comprises:
acquiring the target video clip corresponding to the target keyword;
and determining the target description text corresponding to the target keyword based on the target video segment.
14. The method according to any one of claims 10 to 13, wherein the performing, in response to the video composition request, video segment matching based on the obtained target keyword to obtain at least one target video segment comprises:
acquiring the target keyword based on the video synthesis request sent by the terminal, wherein the video synthesis request is sent by the terminal after receiving the video synthesis operation in the keyword selection interface;
performing video segment matching based on the target keywords to obtain at least one target video segment;
after the target video is generated based on the target video segment, the method further comprises:
and sending the target video to the terminal, wherein the terminal is used for displaying the target video in a video display interface.
15. A video generation apparatus, characterized in that the apparatus comprises:
the first display module is used for responding to audio input operation on an audio input interface, displaying a keyword selection interface based on the acquired initial audio, wherein the keyword selection interface comprises at least one recommended keyword, and the recommended keyword is obtained by performing audio identification on the initial audio;
the first determination module is used for responding to the editing operation of the recommended keywords in the keyword selection interface and determining at least one target keyword;
and the second display module is used for responding to the video synthesis operation in the keyword selection interface and displaying a video display interface, wherein the video display interface comprises a target video, the target video is obtained by synthesizing target video clips, and the target video clips are obtained by matching the target video clips based on the target keywords.
16. A video generation apparatus, characterized in that the apparatus comprises:
the second determination module is used for responding to the received initial audio, performing audio recognition on the initial audio and determining at least one recommendation keyword;
the third determining module is used for responding to a video synthesis request, performing video segment matching based on the obtained target keywords to obtain at least one target video segment, wherein the target keywords are determined by editing operation on a keyword selection interface, and the keyword selection interface comprises the recommended keywords;
a first generation module to generate the target video based on the target video segment.
17. A terminal, characterized in that it comprises a processor and a memory, in which at least one program is stored, which is loaded and executed by the processor to implement the video generation method according to any one of claims 1 to 9.
18. A server, characterized in that the server comprises a processor and a memory, in which at least one program is stored, which is loaded and executed by the processor to implement the video generation method according to any one of claims 10 to 14.
19. A computer-readable storage medium, in which at least one program is stored, the at least one program being loaded and executed by a processor to implement the video generation method according to any one of claims 1 to 9, or to implement the video generation method according to any one of claims 10 to 14.
20. A computer program product comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium by a processor of a computer device, the processor executing the computer instructions to implement the video generation method of any of claims 1 to 9, or to implement the video generation method of any of claims 10 to 14.
CN202111013239.9A 2021-08-31 2021-08-31 Video generation method, device, terminal, server and storage medium Active CN114286169B (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN202111013239.9A CN114286169B (en) 2021-08-31 2021-08-31 Video generation method, device, terminal, server and storage medium
PCT/CN2022/112842 WO2023029984A1 (en) 2021-08-31 2022-08-16 Video generation method and apparatus, terminal, server, and storage medium
US18/140,296 US12026354B2 (en) 2021-08-31 2023-04-27 Video generation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111013239.9A CN114286169B (en) 2021-08-31 2021-08-31 Video generation method, device, terminal, server and storage medium

Publications (2)

Publication Number Publication Date
CN114286169A true CN114286169A (en) 2022-04-05
CN114286169B CN114286169B (en) 2023-06-20

Family

ID=80868479

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111013239.9A Active CN114286169B (en) 2021-08-31 2021-08-31 Video generation method, device, terminal, server and storage medium

Country Status (2)

Country Link
CN (1) CN114286169B (en)
WO (1) WO2023029984A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023029984A1 (en) * 2021-08-31 2023-03-09 腾讯科技(深圳)有限公司 Video generation method and apparatus, terminal, server, and storage medium
US12026354B2 (en) 2021-08-31 2024-07-02 Tencent Technology (Shenzhen) Company Limited Video generation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868176A (en) * 2016-03-02 2016-08-17 北京同尘世纪科技有限公司 Text based video synthesis method and system
WO2018214772A1 (en) * 2017-05-22 2018-11-29 腾讯科技(深圳)有限公司 Media data processing method and apparatus, and storage medium
CN109543102A (en) * 2018-11-12 2019-03-29 百度在线网络技术(北京)有限公司 Information recommendation method, device and storage medium based on video playing
CN110121116A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video generation method and device
CN111935537A (en) * 2020-06-30 2020-11-13 百度在线网络技术(北京)有限公司 Music video generation method and device, electronic equipment and storage medium
CN112752121A (en) * 2020-05-26 2021-05-04 腾讯科技(深圳)有限公司 Video cover generation method and device
CN112929746A (en) * 2021-02-07 2021-06-08 北京有竹居网络技术有限公司 Video generation method and device, storage medium and electronic equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3542360A4 (en) * 2016-11-21 2020-04-29 Microsoft Technology Licensing, LLC Automatic dubbing method and apparatus
KR102407013B1 (en) * 2020-01-30 2022-06-08 목포대학교산학협력단 A method for controlling video playback application recommending search keywords and an apparatus therefor
CN112684913B (en) * 2020-12-30 2023-07-14 维沃移动通信有限公司 Information correction method and device and electronic equipment
CN114286169B (en) * 2021-08-31 2023-06-20 腾讯科技(深圳)有限公司 Video generation method, device, terminal, server and storage medium

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105868176A (en) * 2016-03-02 2016-08-17 北京同尘世纪科技有限公司 Text based video synthesis method and system
WO2018214772A1 (en) * 2017-05-22 2018-11-29 腾讯科技(深圳)有限公司 Media data processing method and apparatus, and storage medium
CN110121116A (en) * 2018-02-06 2019-08-13 上海全土豆文化传播有限公司 Video generation method and device
CN109543102A (en) * 2018-11-12 2019-03-29 百度在线网络技术(北京)有限公司 Information recommendation method, device and storage medium based on video playing
CN112752121A (en) * 2020-05-26 2021-05-04 腾讯科技(深圳)有限公司 Video cover generation method and device
CN111935537A (en) * 2020-06-30 2020-11-13 百度在线网络技术(北京)有限公司 Music video generation method and device, electronic equipment and storage medium
CN112929746A (en) * 2021-02-07 2021-06-08 北京有竹居网络技术有限公司 Video generation method and device, storage medium and electronic equipment

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
谭乐娟;: "人工智能技术在视频编辑中的应用实践", no. 08 *
飘零雪;: "让文本转化为在线视频", 电脑迷, no. 14 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2023029984A1 (en) * 2021-08-31 2023-03-09 腾讯科技(深圳)有限公司 Video generation method and apparatus, terminal, server, and storage medium
US12026354B2 (en) 2021-08-31 2024-07-02 Tencent Technology (Shenzhen) Company Limited Video generation

Also Published As

Publication number Publication date
CN114286169B (en) 2023-06-20
WO2023029984A1 (en) 2023-03-09
US20230259253A1 (en) 2023-08-17

Similar Documents

Publication Publication Date Title
US10643610B2 (en) Voice interaction based method and apparatus for generating multimedia playlist
CN110517689B (en) Voice data processing method, device and storage medium
JP7498640B2 (en) Systems and methods for generating localized context video annotations - Patents.com
CN100545907C (en) Speech recognition dictionary making device and information indexing device
US8966360B2 (en) Transcript editor
US7680853B2 (en) Clickable snippets in audio/video search results
CN110430476B (en) Live broadcast room searching method, system, computer equipment and storage medium
CN101202864B (en) Player for movie contents
CN111182347B (en) Video clip cutting method, device, computer equipment and storage medium
US20240212706A1 (en) Audio data processing
CN112399258B (en) Live playback video generation playing method and device, storage medium and electronic equipment
CN112632326B (en) Video production method and device based on video script semantic recognition
CN112418011A (en) Method, device and equipment for identifying integrity of video content and storage medium
CN109710799B (en) Voice interaction method, medium, device and computing equipment
CN113411674A (en) Video playing control method and device, electronic equipment and storage medium
CN112291589A (en) Video file structure detection method and device
CN113824972A (en) Live video processing method, device and equipment and computer readable storage medium
CN117609550B (en) Video title generation method and training method of video title generation model
WO2023029984A1 (en) Video generation method and apparatus, terminal, server, and storage medium
CN114845149A (en) Editing method of video clip, video recommendation method, device, equipment and medium
CN112328152B (en) Method and device for controlling playing of media file, electronic equipment and storage medium
US12026354B2 (en) Video generation
CN113868445A (en) Continuous playing position determining method and continuous playing system
CN111859006A (en) Method, system, electronic device and storage medium for establishing voice entry tree
JP2009049638A (en) Information processing system, method and program

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 40067614

Country of ref document: HK

SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant