CN116320607A - Intelligent video generation method, device, equipment and medium - Google Patents

Intelligent video generation method, device, equipment and medium Download PDF

Info

Publication number
CN116320607A
CN116320607A CN202310267966.0A CN202310267966A CN116320607A CN 116320607 A CN116320607 A CN 116320607A CN 202310267966 A CN202310267966 A CN 202310267966A CN 116320607 A CN116320607 A CN 116320607A
Authority
CN
China
Prior art keywords
video
text
preset
clause
voice
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310267966.0A
Other languages
Chinese (zh)
Inventor
潘芸倩
奚悦
叶静娴
陈又新
吴伟杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202310267966.0A priority Critical patent/CN116320607A/en
Publication of CN116320607A publication Critical patent/CN116320607A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs
    • H04N21/44016Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs involving splicing one content stream with another content stream, e.g. for substituting a video clip
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/47End-user applications
    • H04N21/488Data services, e.g. news ticker
    • H04N21/4884Data services, e.g. news ticker for displaying subtitles
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8106Monomedia components thereof involving special audio data, e.g. different tracks for different languages
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/81Monomedia components thereof
    • H04N21/8146Monomedia components thereof involving graphical data, e.g. 3D object, 2D graphics
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computer Graphics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the technical field of artificial intelligence, and provides an intelligent video generation method, device, equipment and medium. The method comprises the steps of obtaining text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result; according to keywords contained in the text of the sentence dividing result, searching a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video; converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as a subtitle of the initial video; and generating a target video according to the initial video, the voice and the caption and feeding back the target video to the user. The invention also relates to the technical field of block chains, and the picture set, the voice and the caption can be stored in a node of a block chain.

Description

Intelligent video generation method, device, equipment and medium
Technical Field
The present invention relates to the field of artificial intelligence technologies, and in particular, to an intelligent video generating method, apparatus, device, and medium.
Background
With the rapid development of the internet, rapid editing and production of video is currently an important subject. In general, a user realizes video production on a PC end based on professional tools such as Adobe AE/PR, or through an edited video APP of a mobile end, and although these tools can meet the needs of the user, the video production has a certain professional threshold and requires the user to have abundant production experience, such as editing, dubbing, subtitle matching, material searching and idea creative capabilities, so that high-quality video can be produced.
If an ordinary user wants to make a video, it takes much time to learn and search, which easily results in the problem of high labor cost for producing the video.
Disclosure of Invention
In view of the above, the present invention provides an intelligent video generation method, device, apparatus and medium, which aims to solve the technical problem of high labor cost in video production in the prior art.
In order to achieve the above object, the present invention provides an intelligent video generation method, which includes:
acquiring text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
according to keywords contained in the text of the sentence dividing result, searching a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video;
converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as a subtitle of the initial video;
and generating a target video according to the initial video, the voice and the caption and feeding back to the user.
Preferably, the inputting the text information into a preset semantic clause model, outputting a clause result, includes:
dividing the text information according to punctuation marks contained in the text information to obtain one or more text fragments;
and cutting each text segment to obtain sentence results of each text segment.
Preferably, the retrieving, from a preset material library, a plurality of pictures associated with the keywords as a picture set to generate an initial video includes:
taking the keywords as search sentences, and acquiring description sentences of each picture in the material library;
and calculating the semantic similarity between the search statement and the description statement, and generating an initial video by taking a plurality of associated pictures with the semantic similarity larger than a preset value as a picture set.
Preferably, the acquiring the description sentence of each picture in the material library includes:
inputting any picture into a preset visual text generation model, identifying the category of each object seen in the picture, searching a label matched with the category from a preset label library, and generating label characteristics of the objects according to the label;
analyzing coordinate information of the position of the object in the picture to determine a region image of the object, and generating region features of the picture according to the region image;
extracting texts contained in the pictures to obtain text characteristics of the pictures;
and generating a description sentence of the picture according to the region characteristics, the label characteristics and the text characteristics.
Preferably, the converting, by a preset speech generation model, the text of the clause result into the speech of the initial video includes:
the word vector of each word of the sentence result is used as the network input of the encoder of the speech generation model, and the phoneme information, the word segmentation speed information and the word segmentation energy information of the sentence result are extracted;
and generating a Mel frequency spectrum by the phoneme information, and inputting the Mel frequency spectrum, the word segmentation speed information and the word segmentation energy information into a decoder of the voice generation model for decoding and synthesizing to obtain the voice of the initial video.
Preferably, before said generating the target video from said initial video, said voice, and said subtitle, the method further comprises:
and editing the playing time length of the initial video and the switching speed of the picture according to the text length of the clause result and the length of the voice.
Preferably, after said generating the target video from said initial video, said voice, and said subtitle, the method further comprises:
inputting the clause result into a preset acoustic model to obtain acoustic characteristics of the clause result;
according to the visual characteristics and the audio characteristics of the target video, calculating a matching value between the acoustic characteristics of the clause result and the acoustic characteristics of each piece of music in a preset music library;
and selecting one piece of music with the largest matching value from the calculation result as the background music of the target video.
In order to achieve the above object, the present invention further provides an intelligent video generating apparatus, the apparatus comprising:
sentence module: the method comprises the steps of obtaining text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
and a retrieval module: the method comprises the steps of searching a plurality of pictures associated with keywords from a preset material library as a picture set to generate an initial video according to keywords contained in texts of sentence results;
and a conversion module: the method comprises the steps of converting texts of clause results into voices of an initial video through a preset voice generation model, and taking the texts of the clause results as subtitles of the initial video;
and a synthesis module: and generating a target video according to the initial video, the voice and the subtitle and feeding back the target video to the user.
To achieve the above object, the present invention also provides an electronic device including:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores a program executable by the at least one processor to enable the at least one processor to perform the intelligent video generation method of any one of claims 1 to 7.
To achieve the above object, the present invention also provides a computer readable medium storing a smart video, which when executed by a processor, implements the steps of the smart video generating method as claimed in any one of claims 1 to 7.
According to the method and the device, only the text information is required to be input by a user, after sentence separation is carried out on the text information, the preset material library and a plurality of pictures associated with the keywords are searched as the picture set to generate the initial video according to the keywords of the sentence separation result, so that the time spent by the user for searching materials is effectively reduced, and the working efficiency is improved.
The text of the clause result is converted into the voice of the initial video through a preset voice generation model, the text of the clause result is used as the caption of the initial video, the voice and the caption are synthesized to obtain the target video, the full-automatic generation of the picture, the voice, the caption and the background music of the video is realized, and the switching of the video picture is automatically regulated according to the theme, the scene and the content conception of the text information, so that the rapid and simple generation of the target video is realized, the user is not required to participate in the production in the synthesis process, the user is not required to have abundant production experience, the threshold and the labor cost for producing high-quality video are reduced, and the efficiency of video production is improved.
Drawings
FIG. 1 is a flow chart diagram of a preferred embodiment of the intelligent video generation method of the present invention;
FIG. 2 is a schematic block diagram of an intelligent video generating apparatus according to a preferred embodiment of the present invention;
FIG. 3 is a schematic diagram of an electronic device according to a preferred embodiment of the present invention;
the achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (ArtificialIntelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The invention provides an intelligent video generation method. Referring to fig. 1, a method flow diagram of an embodiment of an intelligent video generating method according to the present invention is shown. The method may be performed by an electronic device, which may be implemented in software and/or hardware. The intelligent video generation method comprises the following steps S10-S40:
step S10: and acquiring text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result.
In this embodiment, the text information includes information such as a video theme, a video scene, and a video document; the preset semantic clause model is a nonowNLP model, and the nonowNLP model is a tool library based on Python writing, and can realize processing of Chinese text information (such as Chinese word segmentation, part-of-speech tagging, emotion analysis, text classification, text conversion pinyin, text clause and the like); receiving text information input by a user, and performing processing on the text information according to the functions of Chinese word segmentation, part-of-speech tagging and text clauses of a preset semantic clause model to obtain a sentence result of the text information.
In one embodiment, the inputting the text information into a preset semantic clause model, outputting a clause result, includes:
dividing the text information according to punctuation marks contained in the text information to obtain one or more text fragments;
and cutting each text segment to obtain sentence results of each text segment.
Punctuation marks refer to commas, periods, exclamation marks and semicolons in text information; the text segment refers to a paragraph of text pause caused by turning, emphasis, intermittence and the like when the idea content of the text is expressed; sentence result is a long sentence or short sentence consisting of words and phrases (phrases) of text information, which can express a complete meaning.
For example, text segment a is: the insurance refers to that an insurance applicant pays insurance fees to an insurance person according to contract agreements, property loss of the insurance person caused by possible accidents of the contract agreements bears compensation insurance policy liabilities, or business insurance actions of paying insurance policy liabilities when the insured person dies, disabilities, diseases or reaches the conditions of age, period and the like of the contract agreements are born, sentence results obtained after a preset semantic sentence model is divided into sentences are the sentence A1 and insurance, and the insurance applicant pays insurance fees to the insurance person according to the contract agreements. Sentence A2, insurer undertakes the liability of reimbursement for the property loss due to the occurrence of a potentially occurring accident of contractual agreement, or undertakes the business insurance act of paying the liability for the insured death, disability, illness or attainment of the contractual age, period, etc.
In one embodiment, the segmenting each text segment to obtain sentence results of each text segment includes:
obtaining vectors of all text fragments and extracting features to obtain phrase labels;
and cutting each text segment according to the phrase label to obtain sentence results of each text segment.
The vector of the text segment refers to a unique index generated by a dictionary of a preset semantic clause model for each word in the text segment (for example, "refers to the applicant according to contract agreement", and the vector of the 11 words in the dictionary is 1,1,1,1,1,0,0,0,1,1,1); the phrase labels refer to various phrase labels which are trained by a preset semantic clause model through a large number of samples (text fragments) in advance; by searching the phrase labels of the text fragments in the preset semantic clause model, the clause processing can be accurately and rapidly carried out on the text fragments, and the text processing efficiency is improved.
Step S20: and according to the keywords contained in the text of the sentence result, searching out a plurality of pictures associated with the keywords from a preset material library to serve as a picture set to generate an initial video.
In this embodiment, the preset material library refers to a picture library pre-constructed by an enterprise according to business development; the preset material library comprises pictures and video clips formed by continuously changing the pictures into a series of picture sets; a typical video clip refers to a picture set of multiple pictures that can express a full meaning or event. The time for searching the materials by the user is effectively reduced through the preset material library, and the working efficiency is improved.
The method comprises the steps that a keyword is input to initiate picture retrieval on a preset material library so as to obtain pictures related to the keyword, before each picture is stored in the material library, description sentences of each picture need to be generated in advance and stored in the material library, namely, each picture in the material library has corresponding description sentences, and each picture can correspond to one or more description sentences.
The keywords contained in the description sentences can be quickly searched, and the pictures associated with the input keywords can be reduced, so that the workload of establishing a search index table for the material library is reduced, and the problem of maintenance cost of the material library in the later period is also reduced (for example, the index table needs to occupy physical space, and when data in the index table is added, deleted and modified, the index table is also dynamically maintained, so that the maintenance speed of the data is reduced).
In one embodiment, the retrieving, from a preset material library, a plurality of pictures associated with the keywords as a picture set to generate an initial video includes:
taking the keywords as search sentences, and acquiring description sentences of each picture in the material library;
and calculating the semantic similarity between the search statement and the description statement, and generating an initial video by taking a plurality of associated pictures with the semantic similarity larger than a preset value as a picture set.
Each picture in the material library corresponds to N description sentences, wherein N is a positive integer greater than or equal to 1, a model is generated according to a preset visual text, N semantic similarities between the search sentences and the N description sentences corresponding to each picture are calculated and determined, a plurality of associated pictures with the semantic similarities being greater than a preset value (for example, the preset value is 1) are obtained, and the associated pictures are used as picture sets corresponding to sentence results, namely, each frame of picture of the sentence results in a playing period corresponding to the initial video. The preset visual text generation model refers to a Vision-Language visual Language model composed of a convolutional neural network, an encoding structure and a decoding structure.
In other embodiments, after obtaining the picture set of the initial video, the method further comprises:
and the user selects or replaces each picture of the picture set of the initial video according to own preference or demand.
If the user has no special requirement, the method and the device can realize the whole-course automatic generation of the video without manual participation of the user; if the user has own preference or demand, the user can participate in selecting or replacing each picture after obtaining the picture set so as to meet the service demand of the user.
In one embodiment, the acquiring the description sentence of each picture in the material library includes:
inputting any picture into a preset visual text generation model, identifying the category of each object seen in the picture, searching a label matched with the category from a preset label library, and generating label characteristics of the objects according to the label;
analyzing coordinate information of the position of the object in the picture to determine a region image of the object, and generating region features of the picture according to the region image;
extracting texts contained in the pictures to obtain text characteristics of the pictures;
and generating a description sentence of the picture according to the region characteristics, the label characteristics and the text characteristics.
Reading any picture, sequentially passing the picture through a convolutional neural network, a coding structure and a decoding structure of a visual text generation model, identifying the class of each object in the picture (for example, the class is divided into the class of people, animals, buildings and the like, each class comprises a plurality of subcategories such as children, teenagers, middle-aged people and the like), searching the label of the object from a preset label library (for example, the label library defines a scene label for each class according to the class of the object, such as labels of application scenes such as basketball, swimming of teenagers and the like), generating the label characteristics (Token) of the object according to the label of the object, accurately defining the object of each picture by utilizing the label characteristics, and assisting in quickly searching the picture scene wanted by the sentence dividing result.
The coordinates of the upper left corner and the coordinates of the lower right corner of the object in the picture are analyzed, a rectangular area where the object is located, namely an area image corresponding to the object, can be determined based on the coordinates of the upper left corner and the coordinates of the lower right corner, then the area characteristic (Patch) of the picture is obtained based on the area image corresponding to the object, rather than obtaining the area characteristic of the object based on the whole picture, the area characteristic can be used for representing the object, and the position relation and the action relation of each object in the picture are obtained through the area characteristic of the picture, so that the event type expressed in the picture scene of the picture is accurately obtained.
The picture is processed by an optical character recognition module (Optical Character Recognition, OCR) of the visual text generation model, text contained in the picture is extracted from the picture, and text characteristics of the picture are obtained.
And fusing the regional features, the tag features and the text features to obtain fused features, inputting the fused features into a visual language model, and generating a description sentence of the picture. Each picture needs to be fused with the obtained regional characteristics, the label characteristics and the text characteristics to generate a description sentence of the picture.
In one embodiment, before the keywords contained in the text according to the clause result, the method further comprises:
performing word segmentation pretreatment on the sentence segmentation result to obtain a plurality of segmented words, and respectively converting the words of each segmented word into word vectors;
dividing all word vectors of each clause result into class clusters with preset quantity according to a preset clustering algorithm;
adding all word vectors contained in any cluster to serve as cluster vectors of the clusters, and inputting the cluster vectors into a preset keyword extraction model to obtain keywords of the sentence segmentation result.
The preset clustering algorithm is a k-means clustering algorithm. For example, the preset number of class clusters is 3, all word vectors of each clause result are divided into 3 class clusters according to a preset clustering algorithm, and each class cluster comprises a plurality of word vectors; if the first cluster class includes word vector 1 (a 1, a2, a 3) and word vector 2 (b 1, b2, b 3), then the cluster class vector of the cluster class is (a1+b1, a2+b2, a3+b3). The preset keyword extraction model is generated by modeling each cluster based on a deep neural network model (e.g., CNN, RNN).
The method has the advantages that the words corresponding to the word vectors which are the most similar to the word vectors of the pre-stored preset keywords of any cluster are used as the keywords of the cluster, so that the extraction of the keywords of the sentence segmentation result is realized, the semantic dependency relationship among the sentence segmentation result is fully considered, the extraction of the keywords is not limited by the restriction of the semantic dependency relationship, the method has stronger universality, the problem that the traditional method is usually only aimed at texts in a specific field is solved, the combination of a clustering algorithm and a deep neural network algorithm is adopted, the weaknesses and the limitations of clustering operation on the word vectors independently and taking the geometric center of the cluster as the keywords are overcome, and the accuracy and objectivity of the keyword extraction are improved.
Step S30: converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as the subtitle of the initial video.
In this embodiment, inputting the clause result into the preset speech generation model can directly output mel-spectrum (mel-spectral), then generate waveforms by using the voice dithering (Griffin-Lim) algorithm, synthesize by the WaveNet vocoder, and output the speech of the clause result as the speech of the corresponding initial video. Directly taking the text of the sentence result (each word) as the caption of the initial video.
In one embodiment, the converting the text of the clause result into the voice of the initial video through a preset voice generation model includes:
the word vector of each word of the sentence result is used as the network input of the encoder of the speech generation model, and the phoneme information, the word segmentation speed information and the word segmentation energy information of the sentence result are extracted;
and generating a Mel frequency spectrum by the phoneme information, and inputting the Mel frequency spectrum, the word segmentation speed information and the word segmentation energy information into a decoder of the voice generation model for decoding and synthesizing to obtain the voice of the initial video.
The preset speech generation model is a model formed by combining a text-to-speech model of Tacotron2 and a WaveNet vocoder. Tacotron2 is an end-to-end TTS neural network model, only text or sentences need to be input, and the Tacotron2 can directly output Mel frequency spectrum; the WaveNet vocoder can generate a deep neural network of human natural voice, the neural network method of real voice recording training is used for directly simulating waveforms, and texts of clause results can generate human-like voices which are relatively real in sound. Word segmentation pretreatment is carried out on sentence segmentation results to obtain a plurality of segmented words, words of each segmented word are respectively converted into word vectors of each segmented word, and the word vectors refer to vectors in which words (segmented words) or phrases from a vocabulary are mapped to real numbers; phoneme information refers to unit information which can represent semantics in clause results; the word segmentation speed information refers to the meaning of the word speed and the speed of the words presented by the language symbols expressing the meaning of each word segmentation of the sentence segmentation result in unit time; the word segmentation energy information refers to a tone of a speech required for each word part of the word of the sentence result (for example, the part of speech is noun, verb, adjective, etc., a tone of a speech required for verb is high, and a tone of a speech required for verb is low).
Step S40: and generating a target video according to the initial video, the voice and the caption and feeding back to the user.
According to the method, each frame of picture of the target video is generated from the picture set of the initial video according to the arrangement sequence of the clause results, the voice and the caption are synthesized and matched with the corresponding each frame of picture, the candidate video is obtained, and the target video is generated from the preset music library by matching with the background music according to the visual characteristics of the candidate video and the audio characteristics of the voice and is sent to the user.
In one embodiment, before said generating the target video from said initial video, said speech, said subtitles, the method further comprises:
and editing the playing time length of the initial video and the switching speed of the picture according to the text length of the clause result and the length of the voice.
Editing refers to adjusting the playing speed through operations such as slowing down and accelerating according to the text length and the voice length of the clause result, so that pictures can be played smoothly better, and visual enjoyment of users can be met.
In one embodiment, after said generating the target video from said initial video, said speech, said subtitles, the method further comprises:
inputting the clause result into a preset acoustic model to obtain acoustic characteristics of the clause result;
according to the visual characteristics and the audio characteristics of the target video, calculating a matching value between the acoustic characteristics of the clause result and the acoustic characteristics of each piece of music in a preset music library;
and selecting one piece of music with the largest matching value from the calculation result as the background music of the target video.
The preset acoustic model refers to a TTS voice synthesis module and a VGGish audio module; the visual characteristics of the target video refer to characteristics of semantic information representing the target video as a whole; the audio characteristics of speech refer to the characteristics of pitch, intonation, energy, rhythm change and the like in speech.
The method comprises the steps of obtaining visual characteristics of a target video and audio characteristics of voice as initial matching parameters of an audio module, carrying out average weighting calculation on acoustic characteristics of sentence results and acoustic characteristics of each piece of music in a preset music library, and selecting the music with the largest matching value from calculation results as background music of the target video. The visual characteristics of the target video and the audio characteristics of the voice are used as matching parameters, so that the accuracy of matching background music is improved, the playing rhythm and picture switching of the target video are extremely matched with the background music, high-quality video can be automatically generated to meet the business development needs of enterprises, and meanwhile, the video manufacturing process is simpler, and the user is effectively assisted in improving the productivity.
Referring to fig. 2, a functional block diagram of an intelligent video generating apparatus 100 according to the present invention is shown.
The intelligent video generating apparatus 100 of the present invention may be installed in an electronic device. Depending on the implemented functions, the intelligent video generating apparatus 100 may include a clause module 110, a clause module 20, a conversion module 130, and a synthesis module 140. The module of the present invention may also be referred to as a unit, meaning a series of computer program segments capable of being executed by the processor of the electronic device and of performing fixed functions, stored in the memory of the electronic device.
The functions of the respective modules/units are as follows in this embodiment:
clause module 110: the method comprises the steps of obtaining text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
the retrieval module 120: the method comprises the steps of searching a plurality of pictures associated with keywords from a preset material library as a picture set to generate an initial video according to keywords contained in texts of sentence results;
conversion module 130: the method comprises the steps of converting texts of clause results into voices of an initial video through a preset voice generation model, and taking the texts of the clause results as subtitles of the initial video;
the synthesis module 140: and generating a target video according to the initial video, the voice and the subtitle and feeding back the target video to the user.
In one embodiment, the inputting the text information into a preset semantic clause model, outputting a clause result, includes:
dividing the text information according to punctuation marks contained in the text information to obtain one or more text fragments;
and cutting each text segment to obtain sentence results of each text segment.
In one embodiment, the retrieving, from a preset material library, a plurality of pictures associated with the keywords as a picture set to generate an initial video includes:
taking the keywords as search sentences, and acquiring description sentences of each picture in the material library;
and calculating the semantic similarity between the search statement and the description statement, and generating an initial video by taking a plurality of associated pictures with the semantic similarity larger than a preset value as a picture set.
In one embodiment, the acquiring the description sentence of each picture in the material library includes:
inputting any picture into a preset visual text generation model, identifying the category of each object seen in the picture, searching a label matched with the category from a preset label library, and generating label characteristics of the objects according to the label;
analyzing coordinate information of the position of the object in the picture to determine a region image of the object, and generating region features of the picture according to the region image;
extracting texts contained in the pictures to obtain text characteristics of the pictures;
and generating a description sentence of the picture according to the region characteristics, the label characteristics and the text characteristics.
In one embodiment, the converting the text of the clause result into the voice of the initial video through a preset voice generation model includes:
the word vector of each word of the sentence result is used as the network input of the encoder of the speech generation model, and the phoneme information, the word segmentation speed information and the word segmentation energy information of the sentence result are extracted;
and generating a Mel frequency spectrum by the phoneme information, and inputting the Mel frequency spectrum, the word segmentation speed information and the word segmentation energy information into a decoder of the voice generation model for decoding and synthesizing to obtain the voice of the initial video.
In one embodiment, before said generating the target video from said initial video, said speech, said subtitles, the method further comprises:
and editing the playing time length of the initial video and the switching speed of the picture according to the text length of the clause result and the length of the voice.
In one embodiment, after said generating the target video from said initial video, said speech, said subtitles, the method further comprises:
inputting the clause result into a preset acoustic model to obtain acoustic characteristics of the clause result;
according to the visual characteristics and the audio characteristics of the target video, calculating a matching value between the acoustic characteristics of the clause result and the acoustic characteristics of each piece of music in a preset music library;
and selecting one piece of music with the largest matching value from the calculation result as the background music of the target video.
Referring to fig. 3, a schematic diagram of a preferred embodiment of an electronic device 1 according to the present invention is shown.
The electronic device 1 includes, but is not limited to: memory 11, processor 12, display 13, and network interface 14. The electronic device 1 is connected to a network through a network interface 14 to obtain the original data. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (GlobalSystemofMobilecommunication, GSM), a wideband code division multiple access (WidebandCodeDivisionMultipleAccess, WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, or a call network.
The memory 11 includes at least one type of readable medium including flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), programmable Read Only Memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the storage 11 may be an internal storage unit of the electronic device 1, such as a hard disk or a memory of the electronic device 1. In other embodiments, the memory 11 may also be an external storage device of the electronic device 1, such as a plug-in hard disk, a smart memory card (SmartMediaCard, SMC), a secure digital (SecureDigital, SD) card, a flash card (FlashCard), etc. that are equipped with the electronic device 1. Of course, the memory 11 may also comprise both an internal memory unit of the electronic device 1 and an external memory device. In this embodiment, the memory 11 is typically used to store an operating system and various application software installed on the electronic device 1, such as program codes of the smart video generator 10. Further, the memory 11 may be used to temporarily store various types of data that have been output or are to be output.
Processor 12 may be a central processing unit (CentralProcessingUnit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 12 is typically used for controlling the overall operation of the electronic device 1, e.g. performing data interaction or communication related control and processing, etc. In this embodiment, the processor 12 is configured to execute the program code stored in the memory 11 or process data, for example, execute the program code of the intelligent video generation 10.
The display 13 may be referred to as a display screen or a display unit. The display 13 may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (EmittingDiode, OLED) touch, or the like in some embodiments. The display 13 is used for displaying information processed in the electronic device 1 and for displaying a visual work interface, for example displaying the results of data statistics.
The network interface 14 may alternatively comprise a standard wired interface, a wireless interface, such as a WI-FI interface, which network interface 14 is typically used for establishing a communication connection between the electronic device 1 and other electronic devices.
Fig. 3 shows only the electronic device 1 with components 11-14 and the intelligent video generation 10, but it should be understood that not all shown components are required to be implemented, and that more or fewer components may be implemented instead.
Optionally, the electronic device 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an organic light-emitting diode (EmittingDiode, OLED) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 1 and for displaying a visual user interface.
The electronic device 1 may also include radio frequency (RadioFrequency, RF) circuits, sensors and audio circuits, etc., which are not described in detail herein.
In the above embodiment, the following steps may be implemented when the processor 12 executes the intelligent video generation 10 stored in the memory 11:
acquiring text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
according to keywords contained in the text of the sentence dividing result, searching a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video;
converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as a subtitle of the initial video;
and generating a target video according to the initial video, the voice and the caption and feeding back to the user.
The storage device may be the memory 11 of the electronic device 1, or may be another storage device communicatively connected to the electronic device 1.
For a detailed description of the above steps, please refer to the functional block diagram of the embodiment of the intelligent video generating apparatus 100 described above with reference to fig. 2 and the flowchart of the embodiment of the intelligent video generating method described above with reference to fig. 1.
Furthermore, the embodiment of the invention also provides a computer readable medium, which can be nonvolatile or volatile. The computer readable medium may be any one or any combination of several of a hard disk, a multimedia card, an SD card, a flash memory card, SMC, a Read Only Memory (ROM), an Erasable Programmable Read Only Memory (EPROM), a portable compact disc read only memory (CD-ROM), a USB memory, and the like. The computer readable medium includes a storage data area and a storage program area, the storage data area stores data created according to the use of the blockchain node, the storage program area stores intelligent video 10, and the intelligent video generation 10 realizes the following operations when executed by a processor:
acquiring text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
according to keywords contained in the text of the sentence dividing result, searching a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video;
converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as a subtitle of the initial video;
and generating a target video according to the initial video, the voice and the caption and feeding back to the user.
The embodiment of the computer readable medium of the present invention is substantially the same as the embodiment of the intelligent video generation method, and will not be described herein.
In another embodiment, in the intelligent video generating method provided by the invention, in order to further ensure the privacy and security of all the data, all the data can be stored in a node of a blockchain. Such as picture sets, speech, subtitles, which may all be stored in the blockchain node.
It should be noted that, the blockchain referred to in the present invention is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, encryption algorithm, etc. The Blockchain (Blockchain), which is essentially a decentralised database, is a string of data blocks that are generated by cryptographic means in association, each data block containing a batch of information of network transactions for verifying the validity of the information (anti-counterfeiting) and generating the next block. The blockchain may include a blockchain underlying platform, a platform product services layer, an application services layer, and the like.
It should be noted that, the foregoing reference numerals of the embodiments of the present invention are merely for describing the embodiments, and do not represent the advantages and disadvantages of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, apparatus, article or method that comprises the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a medium as described above (e.g. ROM/RAM, magnetic disk, optical disk), comprising instructions for causing a terminal device (which may be a mobile phone, a computer, an electronic device, or a network device, etc.) to perform the method according to the embodiments of the present invention.
The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims (10)

1. An intelligent video generation method, characterized in that the method comprises the following steps:
acquiring text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
according to keywords contained in the text of the sentence dividing result, searching a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video;
converting the text of the clause result into the voice of the initial video through a preset voice generation model, and taking the text of the clause result as a subtitle of the initial video;
and generating a target video according to the initial video, the voice and the caption and feeding back to the user.
2. The intelligent video generating method according to claim 1, wherein inputting the text information into a preset semantic clause model and outputting a clause result comprises:
dividing the text information according to punctuation marks contained in the text information to obtain one or more text fragments;
and cutting each text segment to obtain sentence results of each text segment.
3. The intelligent video generating method according to claim 1, wherein the retrieving a plurality of pictures associated with the keywords from a preset material library as a picture set to generate an initial video comprises:
taking the keywords as search sentences, and acquiring description sentences of each picture in the material library;
and calculating the semantic similarity between the search statement and the description statement, and generating an initial video by taking a plurality of associated pictures with the semantic similarity larger than a preset value as a picture set.
4. The intelligent video generating method according to claim 3, wherein the acquiring the description sentence of each picture in the material library comprises:
inputting any picture into a preset visual text generation model, identifying the category of each object seen in the picture, searching a label matched with the category from a preset label library, and generating label characteristics of the objects according to the label;
analyzing coordinate information of the position of the object in the picture to determine a region image of the object, and generating region features of the picture according to the region image;
extracting texts contained in the pictures to obtain text characteristics of the pictures;
and generating a description sentence of the picture according to the region characteristics, the label characteristics and the text characteristics.
5. The intelligent video generating method according to claim 1, wherein the converting text of the clause result into voice of the initial video by a preset voice generating model comprises:
the word vector of each word of the sentence result is used as the network input of the encoder of the speech generation model, and the phoneme information, the word segmentation speed information and the word segmentation energy information of the sentence result are extracted;
and generating a Mel frequency spectrum by the phoneme information, and inputting the Mel frequency spectrum, the word segmentation speed information and the word segmentation energy information into a decoder of the voice generation model for decoding and synthesizing to obtain the voice of the initial video.
6. The intelligent video generating method according to claim 1, wherein before said generating a target video from said initial video, said voice, said subtitle, the method further comprises:
and editing the playing time length of the initial video and the switching speed of the picture according to the text length of the clause result and the length of the voice.
7. The intelligent video generating method according to claim 1, wherein after said generating a target video from said initial video, said voice, said subtitle, the method further comprises:
inputting the clause result into a preset acoustic model to obtain acoustic characteristics of the clause result;
according to the visual characteristics and the audio characteristics of the target video, calculating a matching value between the acoustic characteristics of the clause result and the acoustic characteristics of each piece of music in a preset music library;
and selecting one piece of music with the largest matching value from the calculation result as the background music of the target video.
8. An intelligent video generating apparatus, the apparatus comprising:
sentence module: the method comprises the steps of obtaining text information input by a user, inputting the text information into a preset semantic clause model, and outputting a clause result;
and a retrieval module: the method comprises the steps of searching a plurality of pictures associated with keywords from a preset material library as a picture set to generate an initial video according to keywords contained in texts of sentence results;
and a conversion module: the method comprises the steps of converting texts of clause results into voices of an initial video through a preset voice generation model, and taking the texts of the clause results as subtitles of the initial video;
and a synthesis module: and generating a target video according to the initial video, the voice and the subtitle and feeding back the target video to the user.
9. An electronic device, the electronic device comprising:
at least one processor; the method comprises the steps of,
a memory communicatively coupled to the at least one processor; wherein,,
the memory stores a program executable by the at least one processor to enable the at least one processor to perform the intelligent video generation method of any one of claims 1 to 7.
10. A computer readable medium, wherein the computer readable medium stores an intelligent video, which when executed by a processor, implements the intelligent video generation method of any of claims 1 to 7.
CN202310267966.0A 2023-03-14 2023-03-14 Intelligent video generation method, device, equipment and medium Pending CN116320607A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310267966.0A CN116320607A (en) 2023-03-14 2023-03-14 Intelligent video generation method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310267966.0A CN116320607A (en) 2023-03-14 2023-03-14 Intelligent video generation method, device, equipment and medium

Publications (1)

Publication Number Publication Date
CN116320607A true CN116320607A (en) 2023-06-23

Family

ID=86786660

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310267966.0A Pending CN116320607A (en) 2023-03-14 2023-03-14 Intelligent video generation method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN116320607A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117082293A (en) * 2023-10-16 2023-11-17 成都华栖云科技有限公司 Automatic video generation method and device based on text creative
CN117440116A (en) * 2023-12-11 2024-01-23 深圳麦风科技有限公司 Video generation method, device, terminal equipment and readable storage medium

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117082293A (en) * 2023-10-16 2023-11-17 成都华栖云科技有限公司 Automatic video generation method and device based on text creative
CN117082293B (en) * 2023-10-16 2023-12-19 成都华栖云科技有限公司 Automatic video generation method and device based on text creative
CN117440116A (en) * 2023-12-11 2024-01-23 深圳麦风科技有限公司 Video generation method, device, terminal equipment and readable storage medium
CN117440116B (en) * 2023-12-11 2024-03-22 深圳麦风科技有限公司 Video generation method, device, terminal equipment and readable storage medium

Similar Documents

Publication Publication Date Title
KR102577514B1 (en) Method, apparatus for text generation, device and storage medium
CN111883110B (en) Acoustic model training method, system, equipment and medium for speech recognition
CN107220235B (en) Speech recognition error correction method and device based on artificial intelligence and storage medium
WO2021072875A1 (en) Intelligent dialogue generation method, device, computer apparatus and computer storage medium
EP3616190A1 (en) Automatic song generation
CN116320607A (en) Intelligent video generation method, device, equipment and medium
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN111985243B (en) Emotion model training method, emotion analysis device and storage medium
CN111783471B (en) Semantic recognition method, device, equipment and storage medium for natural language
CN109801349B (en) Sound-driven three-dimensional animation character real-time expression generation method and system
CN110750996B (en) Method and device for generating multimedia information and readable storage medium
CN111738016A (en) Multi-intention recognition method and related equipment
CN113033182B (en) Text creation assisting method, device and server
Wang et al. Comic-guided speech synthesis
CN114780582A (en) Natural answer generating system and method based on form question and answer
CN112199502A (en) Emotion-based poetry sentence generation method and device, electronic equipment and storage medium
CN116682411A (en) Speech synthesis method, speech synthesis system, electronic device, and storage medium
CN113408292A (en) Semantic recognition method and device, electronic equipment and computer-readable storage medium
CN115238708B (en) Text semantic recognition method, device, equipment, storage medium and program product
CN115132182B (en) Data identification method, device, equipment and readable storage medium
CN116595970A (en) Sentence synonymous rewriting method and device and electronic equipment
Shang Spoken Language Understanding for Abstractive Meeting Summarization
CN113470617B (en) Speech recognition method, electronic equipment and storage device
CN113990286A (en) Speech synthesis method, apparatus, device and storage medium
CN114492382A (en) Character extraction method, text reading method, dialog text generation method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination