CN118474476A - AIGC-based travel scene video generation method, system, equipment and storage medium - Google Patents
AIGC-based travel scene video generation method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN118474476A CN118474476A CN202410370258.4A CN202410370258A CN118474476A CN 118474476 A CN118474476 A CN 118474476A CN 202410370258 A CN202410370258 A CN 202410370258A CN 118474476 A CN118474476 A CN 118474476A
- Authority
- CN
- China
- Prior art keywords
- video
- sub
- image
- aigc
- script
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000003860 storage Methods 0.000 title claims abstract description 27
- 238000013515 script Methods 0.000 claims abstract description 104
- 238000013507 mapping Methods 0.000 claims abstract description 21
- 230000007704 transition Effects 0.000 claims description 16
- 238000000605 extraction Methods 0.000 claims description 10
- 238000005516 engineering process Methods 0.000 abstract description 37
- 238000004519 manufacturing process Methods 0.000 abstract description 12
- 238000003672 processing method Methods 0.000 abstract description 8
- 238000013473 artificial intelligence Methods 0.000 description 29
- 238000012549 training Methods 0.000 description 11
- 230000000007 visual effect Effects 0.000 description 11
- 238000012545 processing Methods 0.000 description 9
- 238000009792 diffusion process Methods 0.000 description 8
- 238000004422 calculation algorithm Methods 0.000 description 6
- 230000000694 effects Effects 0.000 description 6
- 230000015572 biosynthetic process Effects 0.000 description 5
- 238000003786 synthesis reaction Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 239000000463 material Substances 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 238000010422 painting Methods 0.000 description 3
- 239000011159 matrix material Substances 0.000 description 2
- 239000013307 optical fiber Substances 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 230000000644 propagated effect Effects 0.000 description 2
- 230000002441 reversible effect Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 238000003491 array Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 239000003795 chemical substances by application Substances 0.000 description 1
- 238000010367 cloning Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 239000002131 composite material Substances 0.000 description 1
- 238000005520 cutting process Methods 0.000 description 1
- 238000013136 deep learning model Methods 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000036651 mood Effects 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002194 synthesizing effect Effects 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/816—Monomedia components thereof involving special video data, e.g 3D video
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8106—Monomedia components thereof involving special audio data, e.g. different tracks for different languages
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8126—Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts
- H04N21/8133—Monomedia components thereof involving additional data, e.g. news, sports, stocks, weather forecasts specifically related to the content, e.g. biography of the actors in a movie, detailed information about an article seen in a video program
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N21/00—Selective content distribution, e.g. interactive television or video on demand [VOD]
- H04N21/80—Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
- H04N21/81—Monomedia components thereof
- H04N21/8146—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
- H04N21/8153—Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics comprising still images, e.g. texture, background image
Landscapes
- Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Signal Processing (AREA)
- Computer Graphics (AREA)
- Television Signal Processing For Recording (AREA)
Abstract
The invention provides a method, a system, equipment and a storage medium for generating a video of a travel scene based on AIGC, wherein the method comprises the following steps: collecting relevant texts on the network based on the names of the travel scenes, and extracting at least tag combinations from the texts, wherein the tag combinations comprise at least two kinds of tag subgroups; generating a video script sequence based on label combination through a text generation model; matching corresponding images through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the matched images; generating a sub-video by adopting a AIGC video generation method based on the image; and sequentially arranging the sub-videos based on the corresponding video sub-script language, adding the subtitle file and the audio file converted by voice to generate the travel scene video. The invention can combine various technologies of AIGC and the traditional video processing method, and connect the modules generated by each content in series through the matching technology, thus forming a complete automatic video production flow.
Description
Technical Field
The invention relates to the field of voice recognition, in particular to a method, a system, equipment and a storage medium for generating a video of a travel scene based on AIGC.
Background
With the advent of short videos, videos are becoming one of the most popular and effective forms of advertising content. In marketing aspect, the video can attract the attention of users, and convey information, so as to achieve the purpose of propaganda. But creating high quality video requires specialized shooting and time consuming editing.
The generation type artificial intelligence AIGC (ARTIFICIAL INTELLIGENCE GENERATED Content) is a technology for generating related Content with proper generalization capability through learning and recognition of existing data based on a technical method for generating artificial intelligence such as an countermeasure network and a large-scale pre-training model. The key idea of AIGC technology is to generate content with a certain creative and quality using artificial intelligence algorithms. By training the model and learning the large amount of data, AIGC can generate content related thereto according to the input conditions or instructions. For example, by entering keywords, descriptions, or samples, AIGC can generate articles, images, audio, etc. that match it. At present, AIGC in China mainly appears in the form of single-model application and is mainly divided into text generation, image generation, video generation and audio generation, wherein the text generation becomes the basis of other content generation.
(1) Text generation
Text generation (AI Text Generation), artificial intelligence text generation is the use of Artificial Intelligence (AI) algorithms and models to generate text that mimics human written content. It involves training a machine learning model on a large dataset of existing text to generate new text that is similar in style, mood, and content to the input data.
(2) Image generation
Image generation (AI Image Generation), artificial intelligence may be used to generate images of non-human artist works. This type of image is referred to as an "artificial intelligence generated image". Artificial intelligence images may be realistic or abstract and may also convey specific topics or information.
(3) Speech generation
Speech generation (AI Audio Generation), the audio generation techniques of AIGC can be divided into two categories, text-to-speech synthesis and speech cloning, respectively. Text-to-speech synthesis requires the input of text and the output of speech from a particular speaker, primarily for robotic and voice broadcast tasks.
(4) Video generation
Video generation (AI Video Generation), the workflow is similar to image generation, with each frame of video being processed at the frame level, and then the video clip being detected using the AI algorithm.
Marketing personnel often need to produce videos according to travel products to the marketing demand under the travel scene, attract user flow. However, although there are many editing software on the market to assist video editing, a large amount of manual operations are still required, and the labor is high. Meanwhile, the video can only be operated singly, the video cannot be produced in batches aiming at different products, and the video is difficult to start when being released.
Accordingly, the invention provides AIGC-based travel scene video generation method, system, equipment and storage medium.
Disclosure of Invention
Aiming at the problems in the prior art, the invention aims to provide AIGC-based travel scene video generation method, system, equipment and storage medium, which overcome the difficulties in the prior art, can combine various technologies of AIGC and traditional video processing methods, and connect the modules generated by each content in series through a matching technology to form a complete automatic video production flow.
The embodiment of the invention provides a method for generating a video of a travel scene based on AIGC, which comprises the following steps:
S110, collecting relevant texts on the basis of travel scene names on a network, and extracting at least tag combinations from the texts, wherein the tag combinations comprise at least two types of tag subgroups, and each type of tag subgroup comprises at least one tag with the highest heat under the type;
S120, generating a video script sequence based on the label combination through a text generation model, wherein the video script sequence comprises a plurality of video sub-script languages based on sequence, and the video sub-script languages comprise a plurality of category label groups;
S130, matching corresponding images based on a preset matching threshold through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the images meeting the matching;
S140, generating a sub-video by adopting a AIGC video generation method based on the image; and
And S150, sequentially arranging and adding subtitle files and audio files converted by voice to generate a travel scene video based on the corresponding video sub-script language.
Preferably, the step S130 further includes:
And when part of the video sub-script language is not matched with the image meeting the matching condition, inputting the unmatched video sub-script language into an AI drawing model to generate an image, and establishing a mapping relation.
Preferably, the step S130 further includes:
And when part of the video sub-scripting languages are not matched with the images meeting the matching conditions, inputting the combination of the non-matched video sub-scripting languages and transition scripting languages into an AI drawing model to generate images, and establishing a mapping relation, wherein the transition scripting languages are intersections of corresponding class label groups of the video sub-scripting languages which are arranged before and after the video sub-scripting languages which are not matched with the images and have completed matching.
Preferably, the transition scripting language further includes an average value of image color temperature parameters and an intersection of image style labels of two video sub-scripting languages that have completed matching arranged before and after the video sub-scripting language that has not matched an image.
Preferably, in the step S110, the tag combination includes at least a travel scene name category tag group, a sub-scenic spot category tag group, an optimal travel season category tag group, and a scene optimal scenic time period category tag group.
Preferably, in the step S120, the text generation model is a GPT model, in the step S130, the image search model is a CLIP model, and the AI drawing model is a midjourney model.
Preferably, the step S150 includes:
S151, sequentially arranging the sub-videos based on the corresponding video sub-script language and adding a subtitle file and an audio file converted by voice;
S152, reading video data of the sub-video corresponding to the video sub-script language, analyzing video frames, and unifying other videos to a standard frame based on image parameters of a target frame as a reference according to the target frame, wherein the target frame is an image corresponding to the video sub-script language;
And S153, generating the travel scene video based on the time sequence of the sub-video, the subtitle file and the audio file.
The embodiment of the invention also provides a system for generating the video of the tourist scene based on AIGC, which is used for realizing the method for generating the video of the tourist scene based on AIGC, and the system for generating the video of the tourist scene based on AIGC comprises the following steps:
The system comprises a label extraction module, a label extraction module and a label extraction module, wherein the label extraction module is used for collecting relevant texts on the basis of travel scene names in a network, and extracting at least label combinations from the texts, wherein the label combinations comprise at least two label subgroups, and each label subgroup comprises at least one label with the maximum heat under the category;
The video script module is used for generating a video script sequence based on the label combination through a text generation model, wherein the video script sequence comprises a plurality of video sub-script languages based on sequence, and the video sub-script languages comprise a plurality of category label groups;
the image generation module is used for matching corresponding images based on a preset matching threshold through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the images meeting the matching;
the video generation module is used for generating a sub-video by adopting a AIGC video generation method based on the image; and
And the scene video module is used for sequentially arranging and adding subtitle files and audio files converted by voice to generate travel scene videos based on the corresponding video sub-script language.
The embodiment of the invention also provides AIGC-based travel scene video generation equipment, which comprises:
A processor;
A memory having stored therein executable instructions of the processor;
Wherein the processor is configured to perform the steps of the AIGC-based travel scenario video generation method described above via execution of the executable instructions.
Embodiments of the present invention also provide a computer readable storage medium storing a program that when executed implements the steps of the AIGC-based travel scene video generation method described above.
The invention aims to provide AIGC-based travel scene video generation method, system, equipment and storage medium, which can combine various technologies of AIGC and a traditional video processing method, and connect modules generated by each content in series through a matching technology to form a complete automatic video production flow.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.
FIG. 1 is a flow chart of a method of generating a travel scene video based on AIGC of the present invention.
FIG. 2 is a schematic diagram of a AIGC-based travel scene video generation system according to the present invention.
FIG. 3 is a schematic diagram of a AIGC-based travel scene video generation device of the present invention.
Fig. 4 is a schematic structural view of a computer-readable storage medium according to an embodiment of the present invention.
Detailed Description
Other advantages and effects of the present application will be readily apparent to those skilled in the art from the following disclosure, which describes the embodiments of the present application by way of specific examples. The application may be practiced or carried out in other embodiments and with various details, and various modifications and alterations may be made to the details of the application from various points of view and applications without departing from the spirit of the application. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
The embodiments of the present application will be described in detail below with reference to the attached drawings so that those skilled in the art to which the present application pertains can easily implement the present application. This application may be embodied in many different forms and is not limited to the embodiments described herein.
In the context of the present description, reference to the terms "one embodiment," "some embodiments," "examples," "particular examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present application. Furthermore, the particular features, structures, materials, or characteristics may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples, as well as features of various embodiments or examples, presented herein may be combined and combined by those skilled in the art without conflict.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the context of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
For the purpose of clarity of explanation of the present application, components that are not related to the explanation are omitted, and the same or similar components are given the same reference numerals throughout the description.
Throughout the specification, when a device is said to be "connected" to another device, this includes not only the case of "direct connection" but also the case of "indirect connection" with other elements interposed therebetween. In addition, when a certain component is said to be "included" in a certain device, unless otherwise stated, other components are not excluded, but it means that other components may be included.
When a device is said to be "on" another device, this may be directly on the other device, but may also be accompanied by other devices therebetween. When a device is said to be "directly on" another device in contrast, there is no other device in between.
Although the terms first, second, etc. may be used herein to connote various elements in some instances, the elements should not be limited by the terms. These terms are only used to distinguish one element from another element. For example, a first interface, a second interface, etc. Furthermore, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes," and/or "including" specify the presence of stated features, steps, operations, elements, components, items, categories, and/or groups, but do not preclude the presence, presence or addition of one or more other features, steps, operations, elements, components, items, categories, and/or groups. The terms "or" and/or "as used herein are to be construed as inclusive, or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a, A is as follows; b, a step of preparing a composite material; c, performing operation; a and B; a and C; b and C; A. b and C). An exception to this definition will occur only when a combination of elements, functions, steps or operations are in some way inherently mutually exclusive.
The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the language clearly indicates the contrary. The meaning of "comprising" in the specification is to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not preclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.
Although not differently defined, including technical and scientific terms used herein, all have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The term append defined in commonly used dictionaries is interpreted as having a meaning that is consistent with the context of the relevant technical literature and current prompts, and is not interpreted as an ideal or very formulaic meaning as long as it is not defined.
Online travel (Online TRAVEL AGENCY), a professional term in the travel e-commerce industry. Refers to "travel consumers order travel products or services to travel service providers over a network and pay by online or offline, i.e., each travel agent may conduct product marketing or product sales over the network. The invention provides a method for automatically editing video based on AIGC product introduction in a travel scene, which is mainly used for online travel. FIG. 1 is a flow chart of a method of generating a travel scene video based on AIGC of the present invention. As shown in fig. 1, the method for generating the video of the tourist scene based on AIGC of the present invention comprises:
S110, collecting relevant texts on the basis of travel scene names in a network, and extracting at least tag combinations from the texts, wherein the tag combinations comprise at least two types of tag subgroups, and each type of tag subgroup comprises at least one tag with the highest heat under the type;
s120, generating a video script sequence based on label combination through a text generation model, wherein the video script sequence comprises a plurality of video sub-script languages based on sequence, and the video sub-script languages comprise a plurality of category label groups;
S130, matching corresponding images based on a preset matching threshold through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the matched images;
S140, generating a sub-video by adopting a AIGC video generation method based on the image; and
S150, sequentially arranging the sub-videos based on the corresponding video sub-script language, adding the subtitle file and the audio file converted by voice to generate the travel scene video.
The invention provides a method for automatically realizing the production from the introduction text of the travel product to the completion of the production of the finished video, which can produce the video rapidly, simply and in batches. The automatic video editing technology can replace manual operation in the aspects of writing, collecting materials, integrating videos and the like, saves editing time and improves production efficiency. Therefore, the invention provides a AIGC-based automatic generation method of the video of the travel scene. In order to meet the specific scene of the video generation application, the invention establishes a video material library through manual experience knowledge. The invention covers text generation technology, voice processing technology, image searching technology, image generation technology and video generation technology. Compared with the traditional video editing software operation, the invention saves 75% of manpower.
In a preferred embodiment, step S130 further includes: when part of the video sub-script language is not matched with the image meeting the matching condition, inputting the unmatched video sub-script language into an AI drawing model to generate an image, and establishing a mapping relation, but not limited to the image.
In a preferred embodiment, step S130 further includes: when part of the video sub-scripting languages are not matched with the images meeting the matching conditions, inputting the combination of the unmatched video sub-scripting languages and the transition scripting language into an AI drawing model to generate the images, and establishing a mapping relation, wherein the transition scripting language is the intersection of corresponding class label groups of the video sub-scripting languages which are arranged before and after the video sub-scripting languages which are not matched with the images and are already matched, but the intersection is not limited to the intersection.
When a part of video sub-script language exists, the corresponding image can be matched, and in addition, the other part of video sub-script language does not match the corresponding image, if the video sub-script language which is not matched with the image is directly subjected to image generation, obvious differences between visual experiences of color temperature parameters, contrast parameters, histogram parameters, color style labels, image style labels and the like of the generated image and the matched image easily occur, more obvious visual differences between videos respectively produced based on the generated image and the matched image also exist, and the overall visual experience of the tourist scene video is influenced. In a preferred embodiment, the transition scripting language further includes an intersection of image style labels and an average value of image color temperature parameters of two video sub-scripting languages which are arranged before and after the video sub-scripting language which is not matched with the image, so that the generated image is similar to the image matched with the adjacent video sub-scripting language in visual sense as much as possible, abrupt sense of the video style is weakened, visual consistency of the video of the tourist scene is enhanced, and video quality of the video of the tourist scene is improved.
In a preferred embodiment, in step S110, the tag combinations at least include, but are not limited to, a travel scene name category tag group, a sub-scenic spot category tag group, an optimal travel season category tag group, and a scene optimal scenic time period category tag group.
In a preferred embodiment, in step S120, the text generation model is a GPT model, and in step S130, the image search model is a CLIP model, and the AI drawing model is a midjourney model, but not limited thereto. The GPT model adopts GPT technology, and given input script output template and specific text information, the input script is rewritten and output as a concise and vivid video script language. GPT, collectively referred to as GENERATIVE PRE-Trained Transformer (generative pre-training transducer model), is an Internet-based, data-capable training, text-generated deep learning model. The CLIP model (Contrastive Language-Image Pre-training) is a large-scale Image-text Pre-training model based on contrast learning. In the training phase, for data in one batch, the inner products are calculated by firstly passing through the features of the image encoder and the text encoder to the image or the video and the text, and then respectively calculating all the image video and the text features, so that a matrix can be obtained. The objective function is to maximize the inner product of the same pair of image and text features and minimize the inner product of the irrelevant features. In the reasoning stage, obtaining an image code through VIT (Vision Transformer), and obtaining a Text code through a Text transducer; embedding text and image features through a linear mapping layer, mapping the text and the image features to the same feature dimension, and carrying out L2 standardization for ensuring the consistency of numerical scale; calculating the similarity of the text and the image; and taking the result with highest similarity between the images/videos and the texts in a batch. The Midjourney model is an image generation technique based on a diffusion process that uses a diffusion model algorithm to generate high quality images. The core idea of this model is to simulate the physical diffusion process in the image, by which the initial temperature distribution is gradually changed, each pixel interacting with its neighbors and spreading heat by diffusion. This diffusion process can be modeled using a thermal conduction equation and the temperature of each pixel calculated by numerical solution. The key innovation of the Midjourney model is the use of a reversible diffusion process, which means that it can generate high quality images without loss of information. Since the diffusion process is reversible, the original image can be restored by back diffusion, so that a more realistic and detailed image can be generated without losing any important information.
In a preferred embodiment, step S150 includes:
s151, sequentially arranging the sub-videos based on the corresponding video sub-script language and adding a subtitle file and an audio file converted from voice;
s152, reading video data of the sub-video corresponding to the video sub-script language, analyzing video frames, and unifying other videos to a standard frame based on image parameters of the target frame as a reference according to the target frame, wherein the target frame is an image corresponding to the video sub-script language;
s153, generating the travel scene video based on the sub-video, the subtitle file and the time sequence of the audio file, but not limited to.
The present invention relates To Text generation technology, image search technology, image generation technology, video generation technology, and conventional Speech synthesis TTS (Text-To-Speech) technology in a generated artificial intelligence AIGC (ARTIFICIAL INTELLIGENCE GENERATED Content). The text generation technology adopts GPT3.5, and the provided long and redundant product introduction summary is refined, the machine is moistened, and a video script text is generated; the voice processing technology mainly comprises TTS, wherein a certain sound color is selected from the generated video script text, and the sound color is converted into voice broadcasting audio which is used as dubbing of video; then on the collection of video materials, an image searching and matching technology is adopted, and the mainly adopted models are a CLIP model and an image VIT technology; in the event of insufficient images, midjourney, through video script text, content-related images are generated. In the aspect of video generation, 2 technologies are involved, namely, 1 picture or a plurality of pictures are added with dynamic effects to produce video by an image through animation change, a frame processing method and the like; another method is to convert an Image into a moving Image by passing the Image through a template to be generated by the Image2Video technology AIGC. In the final video editing process, the techniques of size adjustment, automatic cropping, automatic subtitle addition, audio synthesis, dynamic effect addition, automatic splicing, etc. of the video clips are involved.
The invention provides an automatic video generation method under a specific travel scene. The invention mainly comprises a GPT video script generation module, an image video search module, an image generation module, a video generation module, a text-to-speech module and a video automatic editing module. The technical scheme is realized in the following way:
(1) GPT video script generation module
By adopting the GPT technology, a given input script output template and specific text information are rewritten and output as a simple and vivid video script language.
(2) Image video searching module
The CLIP (Contrastive Language-Image Pre-training) model is used, which is a large-scale Image-text Pre-training model based on contrast learning. In the training phase, for data in one batch, the inner products are calculated by firstly passing through the features of the image encoder and the text encoder to the image or the video and the text, and then respectively calculating all the image video and the text features, so that a matrix can be obtained. The objective function is to maximize the inner product of the same pair of image and text features and minimize the inner product of the irrelevant features. In the reasoning stage, obtaining an image code through VIT (Vision Transformer), and obtaining a Text code through a Text transducer; embedding text and image features through a linear mapping layer, mapping the text and the image features to the same feature dimension, and carrying out L2 standardization for ensuring the consistency of numerical scale; calculating the similarity of the text and the image; and taking the result with highest similarity between the images/videos and the texts in a batch.
(3) Image production module
For script sentences where the image video search module cannot obtain results, the main principle of inputting midjourney, midjourney is to collect a large amount of existing work data, perform algorithm analysis on the data, and generate various types of images through keywords.
(4) Video generation module
The video generation module has two modes, one mode adopts AIGC video generation method to process each frame of the image, and the frames are linked by action; another way is that the images are used for producing sliding video through dynamic effect, generating different sliding effects on one image, and each of a plurality of pictures
(5) Text-to-speech module
The speech synthesis technology is used to convert words into speech, the acoustic model converts characters and phonemes into acoustic features, and the vocoder converts the acoustic features into waveforms to generate speech.
(6) Video editing module
And arranging different sub-videos in sequence according to the script sequence and adding the subtitle file and the audio file converted by voice. Different video data are read in through scripts, video frames are analyzed, other videos are unified to a standard frame according to target frames, and the phenomenon that videos of different frames are stained is solved; and finally, all videos are read in sequence.
The technical effects of the invention include: the operation of generating video by the product introduction text in the specific field is fully automatically realized; matching images and videos conforming to text description in a given poi gallery and a video gallery, so that content text information is richer;
The invention provides an AI video generation and production tool, which is a method stream for generating videos from description texts in a specific field by utilizing AIGC algorithm. The method can help the user save time and cost and improve efficiency and the degree of coincidence of the video and the text. The method can automatically call each module to generate video by inputting the product profile to be generated and all POI sceneries.
The embodiment of the invention designs an automatic method for automatically producing introduction videos from product information in the travel field, which mainly comprises the following steps:
Firstly, selecting a tourist product, and obtaining product introduction information and POI under the tourist product;
converting product introduction information of the travel product into a document conforming to video editing by using a GPT technology;
Dividing the text according to punctuation, adding the divided text as search words into POI information contained in the text, searching a video or an image which is most attached to the text information from a video library, and generating a video fragment through AIGC videos if the video is not acquired and the acquired image is the image;
generating a dubbing file from the segmented text;
Generating a subtitle file from the segmented text;
synthesizing the video clips, dubbing files and subtitle files into sub-videos;
And (3) adjusting the sizes and cutting all the sub videos, and adding a link animation to generate a complete video.
One embodiment of the present invention includes:
First, based on the travel scene names, relevant texts are collected in a network, at least a label combination is extracted from the texts, the label combination comprises at least two kinds of label subgroups, each kind of label subgroup comprises at least one label with the highest heat under the category, and the label combination at least comprises a travel scene name category label subgroup, a sub-scenic spot category label subgroup, an optimal travel season category label subgroup and a scene optimal scenic time period category label subgroup.
Then, a video script sequence is generated based on tag combinations through a text generation model, the video script sequence comprising a number of order-based video sub-scripting languages, the video sub-scripting languages comprising a number of classes of tag groups, the text generation model being a GPT model.
Then, according to each video sub-script language, matching the corresponding image based on a preset matching threshold through an image search model, establishing a mapping relation between the video sub-script language and the matched image, wherein the image search model is a CLIP model, when part of the video sub-script languages are not matched to the image meeting the matching condition, inputting the combination of the non-matched video sub-script language and a transition script language into an AI painting model to generate the image, and establishing the mapping relation, wherein the transition script language is an intersection of corresponding class label groups of the video sub-script languages which are arranged before and after the video sub-script language which is not matched to the image, the AI painting model is a midjourney model, and the transition script language further comprises an intersection of image style labels of the two video sub-script languages which are arranged before and after the video sub-script language which is not matched to the image and an average value of image color temperature parameters. If the video sub-script language which is not matched with the image is directly subjected to image generation, obvious differences between the generated image and visual experiences such as color temperature parameters, contrast parameters, histogram parameters, color style labels and image style labels of the matched image easily occur, more obvious visual differences between videos respectively produced based on the generated image and the matched image also exist, and a split feeling is given to people, so that the overall visual experience of the video of the tourist scene is affected. In a preferred embodiment, the transition scripting language further includes an average value of image color temperature parameters and an intersection of image style labels of two video sub-scripting languages that have completed matching arranged before and after the video sub-scripting language that has not matched the image, but is not limited thereto. According to the invention, the images matched with adjacent video sub-script languages of the video sub-script language for producing the images are analyzed and identified, visual parameters (such as color temperature parameters, contrast parameters, histogram parameters, color style labels and image style labels) of the images corresponding to the adjacent video sub-script languages are obtained, the average or intersection of the two label sets is obtained, the information is used as a transition script language to be combined with the unmatched video sub-script language, and then the information is input into an AI drawing model to generate the images, so that the generated images are similar to the images matched with the adjacent video sub-script languages as much as possible in visual appearance, the abrupt sense of the video style is weakened, the visual consistency of the video of the tourist scene is enhanced, and the video quality of the video of the tourist scene is improved.
A AIGC video generation method is used to generate the sub-video based on the image.
And finally, sequentially arranging the sub-videos based on the corresponding video sub-script language and adding the subtitle file and the audio file converted from voice. And reading in video data of the sub video corresponding to the video sub script language, analyzing video pictures, and unifying other videos to a standard picture based on image parameters of the target picture as a reference according to the target picture, wherein the target picture is an image corresponding to the video sub script language. And generating the travel scene video based on the time sequence of the sub-video, the subtitle file and the audio file.
FIG. 2 is a schematic diagram of a AIGC-based travel scene video generation system according to the present invention. As shown in fig. 2, an embodiment of the present invention further provides a system for generating a travel scene video based on AIGC, which is configured to implement the method for generating a travel scene video based on AIGC, where the system for generating a travel scene video based on AIGC includes:
The tag extraction module 51 collects relevant text on the web based on the travel scene name, and extracts at least tag combinations from the text, the tag combinations including at least two types of tag subgroups, each type of tag subgroup including at least one tag having a maximum heat under the type.
The video script module 52 generates a video script sequence based on tag combinations via a text generation model, the video script sequence comprising a number of order-based video sub-scripting languages, the video sub-scripting languages comprising a number of category tag groups.
The image generation module 53 matches the corresponding image based on the preset matching threshold through the image search model according to each video sub-script language, and establishes the mapping relation between the video sub-script language and the image meeting the matching.
The video generation module 54 generates sub-videos based on the images using AIGC video generation methods.
And
The scene video module 55 sequentially arranges and adds subtitle files and audio files converted from voice to generate travel scene videos based on the corresponding video sub-script language.
In a preferred embodiment, the image generation module 53 is configured to input the unmatched video subscripting language into the AI drawing model to generate an image and establish a mapping relationship when a part of the video subscripting language does not match the image satisfying the matching condition.
In a preferred embodiment, the image generation module 53 is configured to input a combination of the non-matched video sub-scripting language and the transition scripting language, which is an intersection of corresponding class label groups of the video sub-scripting language that have completed matching arranged before and after the video sub-scripting language that has not been matched to the image, into the AI drawing model to generate an image when the partial video sub-scripting language has not been matched to the image that satisfies the matching condition, and to establish a mapping relationship.
In a preferred embodiment, the transition scripting language further includes an intersection of an image style label and an average of image color temperature parameters of the video sub-scripting language that have completed matching before and after the video sub-scripting language that did not match the image.
In a preferred embodiment, the tag extraction module 51 is configured such that the tag combinations include at least a travel scene name category tag subgroup, a sub-attraction category tag subgroup, an optimal travel season category tag subgroup, a scene optimal scenic time period category tag subgroup.
In a preferred embodiment, the video script module 52 is configured such that the text generation model is a GPT model, the image search model is a CLIP model, and the AI painting model is a midjourney model in step S130.
In a preferred embodiment, the scene video module 55 is configured to arrange and add subtitle files and audio files converted from speech in sequence based on the corresponding video sub-scripting language. And reading in video data of the sub video corresponding to the video sub script language, analyzing video pictures, and unifying other videos to a standard picture based on image parameters of the target picture as a reference according to the target picture, wherein the target picture is an image corresponding to the video sub script language. And generating the travel scene video based on the time sequence of the sub-video, the subtitle file and the audio file.
The system for generating the video of the tourist scene based on AIGC can combine various technologies of AIGC and the traditional video processing method, and the modules for generating the contents are connected in series through the matching technology, so that a complete automatic video manufacturing process is formed.
The embodiment of the invention also provides AIGC-based travel scene video generation equipment, which comprises a processor. A memory having stored therein executable instructions of a processor. Wherein the processor is configured to execute steps of AIGC-based travel scenario video generation method via execution of the executable instructions.
As shown above, the AIGC-based travel scene video generating device according to the embodiment of the present invention can combine various technologies of AIGC and conventional video processing methods, and connect the modules of each content generation in series through the matching technology, so as to form a complete automatic video production process.
Those skilled in the art will appreciate that the various aspects of the invention may be implemented as a system, method, or program product. Accordingly, aspects of the invention may be embodied in the following forms, namely: an entirely hardware embodiment, an entirely software embodiment (including firmware, micro-code, etc.) or an embodiment combining hardware and software aspects may be referred to herein as a "circuit," module "or" platform.
FIG. 3 is a schematic diagram of a AIGC-based travel scene video generation device of the present invention. An electronic device 600 according to this embodiment of the invention is described below with reference to fig. 3. The electronic device 600 shown in fig. 3 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.
As shown in fig. 3, the electronic device 600 is embodied in the form of a general purpose computing device. Components of electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one memory unit 620, a bus 630 connecting the different platform components (including memory unit 620 and processing unit 610), a display unit 640, etc.
Wherein the storage unit stores program code that is executable by the processing unit 610 such that the processing unit 610 performs the steps according to various exemplary embodiments of the invention described in the above method section of the present specification. For example, the processing unit 610 may perform the steps as shown in fig. 1.
The storage unit 620 may include readable media in the form of volatile storage units, such as Random Access Memory (RAM) 6201 and/or cache memory unit 6202, and may further include Read Only Memory (ROM) 6203.
The storage unit 620 may also include a program/utility 6204 having a set (at least one) of program modules 6205, such program modules 6205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.
Bus 630 may be a local bus representing one or more of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or using any of a variety of bus architectures.
The electronic device 600 may also communicate with one or more external devices 700 (e.g., keyboard, pointing device, bluetooth device, etc.), one or more devices that enable a user to interact with the electronic device 600, and/or any device (e.g., router, modem, etc.) that enables the electronic device 600 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 650. Also, electronic device 600 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 660. The network adapter 660 may communicate with other modules of the electronic device 600 over the bus 630. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 600, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage platforms, and the like.
The embodiment of the invention also provides a computer readable storage medium for storing a program, and the steps of the AIGC-based travel scene video generation method are realized when the program is executed. In some possible embodiments, the aspects of the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps according to the various exemplary embodiments of the invention as described in the method portions of this specification, when the program product is run on the terminal device.
As shown above, the AIGC-based travel scene video generation system according to the embodiment of the present invention can combine various technologies of AIGC and conventional video processing methods, and connect the modules of each content generation in series through the matching technology, so as to form a complete automatic video production process.
Fig. 4 is a schematic structural view of a computer-readable storage medium of the present invention. Referring to fig. 4, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described, which may employ a portable compact disc read only memory (CD-ROM) and include program code, and may be run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The computer readable storage medium may include a data signal propagated in baseband or as part of a carrier wave, with readable program code embodied therein. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable storage medium may also be any readable medium that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Program code for carrying out operations of the present invention may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device, partly on a remote computing device, or entirely on the remote computing device or server. In the case of remote computing devices, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., connected over the Internet using an Internet service provider).
In summary, the invention aims to provide AIGC-based travel scene video generation method, system, equipment and storage medium, which can combine various technologies of AIGC and traditional video processing methods, and connect modules generated by various contents in series through a matching technology to form a complete automatic video production process.
The foregoing is a further detailed description of the invention in connection with the preferred embodiments, and it is not intended that the invention be limited to the specific embodiments described. It will be apparent to those skilled in the art that several simple deductions or substitutions may be made without departing from the spirit of the invention, and these should be considered to be within the scope of the invention.
Claims (10)
1. The method for generating the video of the tourist scene based on AIGC is characterized by comprising the following steps of:
S110, collecting relevant texts on the basis of travel scene names on a network, and extracting at least tag combinations from the texts, wherein the tag combinations comprise at least two types of tag subgroups, and each type of tag subgroup comprises at least one tag with the highest heat under the type;
S120, generating a video script sequence based on the label combination through a text generation model, wherein the video script sequence comprises a plurality of video sub-script languages based on sequence, and the video sub-script languages comprise a plurality of category label groups;
S130, matching corresponding images based on a preset matching threshold through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the images meeting the matching;
S140, generating a sub-video by adopting a AIGC video generation method based on the image; and
And S150, sequentially arranging and adding subtitle files and audio files converted by voice to generate a travel scene video based on the corresponding video sub-script language.
2. The method for generating a video of a travel scene based on AIGC as set forth in claim 1, wherein the step S130 further includes:
And when part of the video sub-script language is not matched with the image meeting the matching condition, inputting the unmatched video sub-script language into an AI drawing model to generate an image, and establishing a mapping relation.
3. The method for generating a video of a travel scene based on AIGC as set forth in claim 1, wherein the step S130 further includes:
And when part of the video sub-scripting languages are not matched with the images meeting the matching conditions, inputting the combination of the non-matched video sub-scripting languages and transition scripting languages into an AI drawing model to generate images, and establishing a mapping relation, wherein the transition scripting languages are intersections of corresponding class label groups of the video sub-scripting languages which are arranged before and after the video sub-scripting languages which are not matched with the images and have completed matching.
4. The travel scenario video generation method based on AIGC of claim 3, wherein the transition scripting language further comprises an intersection of image style labels and an average of image color temperature parameters of two of the video sub-scripting languages that have completed matching arranged before and after the video sub-scripting language that did not match images.
5. The method of claim 1, wherein in step S110, the tag combinations include at least a travel scene name category tag group, a sub-scenic spot category tag group, an optimal travel season category tag group, and a scene optimal scenic time period category tag group.
6. The method for generating a video of a travel scene based on AIGC as set forth in claim 2, wherein in step S120, the text generation model is a GPT model, and in step S130, the image search model is a CLIP model, and the AI drawing model is a midjourney model.
7. The method for generating a video of a travel scene based on AIGC as set forth in claim 2, wherein the step S150 includes:
S151, sequentially arranging the sub-videos based on the corresponding video sub-script language and adding a subtitle file and an audio file converted by voice;
S152, reading video data of the sub-video corresponding to the video sub-script language, analyzing video frames, and unifying other videos to a standard frame based on image parameters of a target frame as a reference according to the target frame, wherein the target frame is an image corresponding to the video sub-script language;
And S153, generating the travel scene video based on the time sequence of the sub-video, the subtitle file and the audio file.
8. A AIGC-based travel scenario video generation system for implementing the AIGC-based travel scenario video generation method of claim 1, comprising:
The system comprises a label extraction module, a label extraction module and a label extraction module, wherein the label extraction module is used for collecting relevant texts on the basis of travel scene names in a network, and extracting at least label combinations from the texts, wherein the label combinations comprise at least two label subgroups, and each label subgroup comprises at least one label with the maximum heat under the category;
The video script module is used for generating a video script sequence based on the label combination through a text generation model, wherein the video script sequence comprises a plurality of video sub-script languages based on sequence, and the video sub-script languages comprise a plurality of category label groups;
the image generation module is used for matching corresponding images based on a preset matching threshold through an image search model according to each video sub-script language, and establishing a mapping relation between the video sub-script language and the images meeting the matching;
the video generation module is used for generating a sub-video by adopting a AIGC video generation method based on the image; and
And the scene video module is used for sequentially arranging and adding subtitle files and audio files converted by voice to generate travel scene videos based on the corresponding video sub-script language.
9. A AIGC-based travel scene video generation device, comprising:
A processor;
A memory having stored therein executable instructions of the processor;
wherein the processor is configured to perform the steps of the AIGC-based travel scenario video generation method of any one of claims 1 to 7 via execution of the executable instructions.
10. A computer readable storage medium storing a program, wherein the program when executed by a processor implements the steps of the AIGC-based travel scene video generation method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410370258.4A CN118474476A (en) | 2024-03-28 | 2024-03-28 | AIGC-based travel scene video generation method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202410370258.4A CN118474476A (en) | 2024-03-28 | 2024-03-28 | AIGC-based travel scene video generation method, system, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN118474476A true CN118474476A (en) | 2024-08-09 |
Family
ID=92152004
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202410370258.4A Pending CN118474476A (en) | 2024-03-28 | 2024-03-28 | AIGC-based travel scene video generation method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN118474476A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119383426A (en) * | 2024-12-30 | 2025-01-28 | 深圳麦风科技有限公司 | Video synthesis method, device, equipment and storage medium |
CN119496964A (en) * | 2025-01-16 | 2025-02-21 | 合肥白泽前沿科技有限公司 | A relatively controllable video generation system based on AIGC large model |
-
2024
- 2024-03-28 CN CN202410370258.4A patent/CN118474476A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN119383426A (en) * | 2024-12-30 | 2025-01-28 | 深圳麦风科技有限公司 | Video synthesis method, device, equipment and storage medium |
CN119496964A (en) * | 2025-01-16 | 2025-02-21 | 合肥白泽前沿科技有限公司 | A relatively controllable video generation system based on AIGC large model |
CN119496964B (en) * | 2025-01-16 | 2025-04-01 | 合肥白泽前沿科技有限公司 | A relatively controllable video generation system based on AIGC large model |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US20240107127A1 (en) | Video display method and apparatus, video processing method, apparatus, and system, device, and medium | |
CN113157965B (en) | Audio visual model training and audio visual method, device and equipment | |
CN114390217B (en) | Video synthesis method, device, computer equipment and storage medium | |
CN108509465A (en) | A kind of the recommendation method, apparatus and server of video data | |
CN118474476A (en) | AIGC-based travel scene video generation method, system, equipment and storage medium | |
WO2024235271A1 (en) | Movement generation method and apparatus for virtual character, and construction method and apparatus for movement library of virtual avatar | |
CN112287168A (en) | Method and apparatus for generating video | |
CN113259763B (en) | Teaching video processing method and device and electronic equipment | |
WO2022089427A1 (en) | Video generation method and apparatus, and electronic device and computer-readable medium | |
WO2019245033A1 (en) | Moving image editing server and program | |
CN108563622A (en) | A kind of poem of four lines generation method and device with style varied | |
JP7113000B2 (en) | Method and apparatus for generating images | |
CN118381971B (en) | Video generation method, device, storage medium, and program product | |
JP2020005309A (en) | Moving image editing server and program | |
CN118468224A (en) | A multimodal sarcasm detection method based on visual instruction fine-tuning and demonstration learning enhancement | |
CN117880443A (en) | Script-based multi-mode feature matching video editing method and system | |
CN117009581A (en) | Video generation method and device | |
CN118138854A (en) | Video generation method, device, computer equipment and medium | |
CN117478975A (en) | Video generation method, device, computer equipment and storage medium | |
CN113411674A (en) | Video playing control method and device, electronic equipment and storage medium | |
CN112565875B (en) | Method, device, equipment and computer readable storage medium for automatically generating video | |
CN111523069B (en) | Method and system for realizing electronic book playing 3D effect based on 3D engine | |
CN118695044A (en) | Method, device, computer equipment, readable storage medium and program product for generating promotional video | |
CN118338072A (en) | Video editing method, device, equipment, medium and product based on large model | |
CN118262003A (en) | Storyboard generation method based on decoupling and re-integration control |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |