CN113434733A

CN113434733A - Text-based video file generation method, device, equipment and storage medium

Info

Publication number: CN113434733A
Application number: CN202110717658.4A
Authority: CN
Inventors: 胡向杰
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-06-28
Filing date: 2021-06-28
Publication date: 2021-09-24
Anticipated expiration: 2041-06-28
Also published as: CN113434733B

Abstract

The invention relates to the technical field of data display, and discloses a method, a device, equipment and a storage medium for generating a video file based on a text, wherein the method comprises the following steps: acquiring a text to be processed and a material image; splitting a text to be processed according to preset clause symbols to obtain a plurality of clauses; performing keyword recognition on each clause through a keyword detection model to obtain a keyword result, and performing voice generation on each clause through a voice generation technology to obtain a clause audio; according to all keyword results, carrying out layer decomposition on the material image to decompose an element layer set; according to all the clauses and all the clause audios, carrying out element configuration processing on each element layer in the element layer set to obtain an animation video; and synthesizing the video file. Therefore, the invention realizes that the keywords are identified through the keyword detection model, and the video file is automatically generated by applying the voice generation technology and the layer decomposition technology and configuring the elements of the element layer.

Description

Text-based video file generation method, device, equipment and storage medium

Technical Field

The invention relates to the technical field of data display, in particular to a method, a device, equipment and a storage medium for generating a video file based on a text.

Background

In recent years, with the rapid development of the computer vision field and the introduction of the generation of the countermeasure network, the research on image generation has been receiving more and more extensive attention, and the research has a very positive significance in the aspects of material accumulation and automatic generation of data sets. Compared with images, the video is more vivid and has higher generation difficulty, so that the video generation method has more research significance for the exploration in the aspect of video generation. Meanwhile, if it is not so much practical as most image generation methods to randomly generate a video, users would want to generate the video based on the written text and the mood of the text, for example, users input a script and the background map of the script, and expect to obtain video segments matching the input script and the background map, rather than some random and meaningless videos departing from the background map, which causes deviation in understanding the script and non-fitting of the generated video, so that the traditional generation methods have failed to meet the user's requirement for directional and matching of the generated result.

Disclosure of Invention

The invention provides a text-based video file generation method and device, computer equipment and a storage medium, which realize automatic updating of a test environment configuration file, ensure the correctness of the environment configuration file in the test process and improve the accuracy of the environment configuration file.

A text-based video file generation method includes:

acquiring a text to be processed and a material image associated with the text to be processed;

splitting the text to be processed according to preset clause symbols to obtain a plurality of clauses;

performing keyword recognition on each clause through a keyword detection model to obtain a keyword result corresponding to each clause, and performing voice generation on each clause through a voice generation technology to obtain a clause audio corresponding to each clause;

according to all the keyword results, carrying out layer decomposition on the material image to decompose an element layer set;

according to all the clauses and all the clause audios, carrying out element configuration processing on each element layer in the element layer set to obtain an animation video associated with each element layer;

and synthesizing a video file corresponding to the text to be processed according to all the clause audios and all the animation videos.

A text-based video file generation apparatus, comprising:

the acquisition module is used for acquiring a text to be processed and a material image related to the text to be processed;

the splitting module is used for splitting the text to be processed according to preset clause symbols to obtain a plurality of clauses;

the processing module is used for carrying out keyword recognition on each clause through the keyword detection model to obtain a keyword result corresponding to each clause, and simultaneously carrying out voice generation on each clause through a voice generation technology to obtain a clause audio frequency corresponding to each clause;

the decomposition module is used for carrying out layer decomposition on the material image according to all the keyword results to decompose an element layer set;

the configuration module is used for carrying out element configuration processing on each element layer in the element layer set according to all the clauses and all the clause audios to obtain an animation video associated with each element layer;

and the synthesis module is used for synthesizing a video file corresponding to the text to be processed according to all the clause audios and all the animation videos.

A computer device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the above text-based video file generation method when executing said computer program.

A computer-readable storage medium, storing a computer program which, when executed by a processor, implements the steps of the above-described text-based video file generation method.

The invention provides a text-based video file generation method, a text-based video file generation device, computer equipment and a storage medium, wherein the method comprises the steps of obtaining a text to be processed and a material image related to the text to be processed; splitting the text to be processed according to preset clause symbols to obtain a plurality of clauses; performing keyword recognition on each clause through a keyword detection model to obtain a keyword result corresponding to each clause, and performing voice generation on each clause through a voice generation technology to obtain a clause audio corresponding to each clause; according to all the keyword results, carrying out layer decomposition on the material image to decompose an element layer set; according to all the clauses and all the clause audios, carrying out element configuration processing on each element layer in the element layer set to obtain an animation video associated with each element layer; according to all the clause audio and all the animation videos, a video file corresponding to a text to be processed is synthesized, so that keywords are identified through a keyword detection model, clause audio is generated by applying a speech generation technology, an element layer in a material image is decomposed by applying a layer decomposition technology, elements in the element layer are configured, the animation video is generated, the clause audio and the animation videos are combined, the video file is automatically generated, the fitting degree of the audio and the played image can be improved, the configuration parameters of the played image are automatically configured, the video file is completely fitted with the audio, the accuracy and the efficiency of video file generation are improved, and the experience satisfaction degree of a user is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments of the present invention will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to these drawings without inventive labor.

FIG. 1 is a schematic diagram of an application environment of a text-based video file generation method according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for text-based video file generation in an embodiment of the present invention;

FIG. 3 is a flowchart of step S30 of a text-based video file generation method in an embodiment of the present invention;

FIG. 4 is a flowchart of step S40 of a text-based video file generation method in an embodiment of the present invention;

FIG. 5 is a schematic block diagram of an apparatus for generating a text-based video file according to an embodiment of the present invention;

FIG. 6 is a functional block diagram of a decomposition module of an apparatus for generating a text-based video file according to an embodiment of the present invention;

FIG. 7 is a schematic diagram of a computer device in an embodiment of the invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The text-based video file generation method provided by the invention can be applied to the application environment shown in fig. 1, wherein a client (computer equipment or terminal) communicates with a server through a network. The client (computer device or terminal) includes, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server may be implemented as a stand-alone server or as a server cluster consisting of a plurality of servers.

In an embodiment, as shown in fig. 2, a text-based video file generating method is provided, which mainly includes the following steps S10-S60:

and S10, acquiring the text to be processed and the material image associated with the text to be processed.

Understandably, in some application scenes, the text to be processed and the material image need to be interpreted according to the needs, the application scenes can be selected according to needs, such as a teacher interpreting a picture in a classroom, or a developer interpreting a structure of a product for a user, and the like, the text to be processed is text content which needs to be converted into a video file, the material image is an image which accords with the text content in the text to be processed, when the text to be processed and the material image need to be generated into a corresponding video, a request is triggered,

and S20, splitting the text to be processed according to preset clause symbols to obtain a plurality of clauses.

Understandably, the preset clause symbols may be ending symbols such as semicolon, sentence, question mark, and the like, the splitting process may be scanning from the beginning of the text to be processed, separating two sentences when the same symbols as the preset clause symbols are scanned, determining the sentence from the beginning to the scanned symbol as the clause, scanning the next character as the beginning, and so on, so as to separate the multiple clauses one by one; the splitting process may also be a process of snapshotting the text to be processed to obtain a text image, where pixel positions in the text image correspond to character positions in the text to be processed, the text image is identified by a trained neural network model for identifying the preset clauses, the identifying process is to extract characteristics of the clauses in the text image, the characteristics of the clauses are characteristics related to the preset clauses, identify pixel positions of the clauses in the text image according to the extracted characteristics of the clauses, determine character positions in the text to be processed according to the identified pixel positions, split a plurality of the clauses according to the determined character positions, and quickly locate positions of the clauses through the training process of splitting the trained neural network model for identifying the preset clauses, and the scanning from the beginning of the text to be processed is not needed, so that the splitting efficiency is improved.

And S30, performing keyword recognition on each clause through the keyword detection model to obtain a keyword result corresponding to each clause, and performing voice generation on each clause through a voice generation technology to obtain a clause audio corresponding to each clause.

Understandably, the keyword recognition includes entity recognition and element key feature recognition, the keyword detection model performs entity recognition on each clause to recognize an entity set of each clause, then performs element key feature extraction on the entity set in each clause to recognize the keyword result corresponding to each clause, the keyword result represents a word or a term related to an element represented by a layer in the clause, the speech generation technology, also referred to as TTS technology, can perform audio conversion on each word in the clause, then performs splicing together, and finally performs intonation adjustment to obtain a technology of an audio segment, and the clause audio represents the audio of the content of the clause.

Wherein the elements may include pie elements, line segment elements, arrow elements, tick elements, rectangle elements, rounded rectangle elements, circular arc elements, ellipse elements, sector elements, parabola elements, polygon elements, text elements, cube elements, bar chart elements, line chart elements, trigonometric function curve elements, and the like.

In an embodiment, as shown in fig. 3, in the step S30, that is, performing keyword recognition on each clause through the keyword detection model to obtain a keyword result corresponding to each clause, the method includes:

s301, unit splitting is carried out on each clause through the keyword detection model, and a plurality of unit words corresponding to each clause are obtained.

Understandably, the keyword detection model is a trained deep neural network model, and is used for detecting a model of characters or words related to elements in the inputted clause, the network structure of the keyword detection model can be set according to requirements, for example, the network structure of the keyword detection model can be a network structure of a bert (bidirectional Encoder retrieval from transformations) language model, and can also be a network structure of a Long Short-Term Memory network (LSTM) language model, performing unit splitting on each clause through the keyword detection model, wherein the unit splitting is a process of splitting the clause into single characters or words, the single characters or words after splitting are marked as unit words, so that one sentence dividing unit can be divided into a plurality of unit words corresponding to the sentence.

And S302, performing entity identification on each unit word corresponding to each clause through the keyword detection model to obtain an entity name corresponding to each clause.

Understandably, the process of the entity identification is a process of performing word vector conversion on each unit word to obtain a corresponding word vector, performing context semantic identification and entity extraction on the word vector to obtain an entity name corresponding to each clause, the upper and lower semantic identification is a process of determining whether the semantics presented by the combination of adjacent word vectors have corresponding nouns and corresponding nouns, and the process of the entity extraction is a process of screening out nouns capable of representing the entity from the nouns identified by the context semantic.

S303, extracting element keyword features of the entity names corresponding to the clauses through the keyword detection model, and identifying keyword results corresponding to the clauses according to the extracted element keyword features.

Understandably, element keyword features are extracted from each entity name through the keyword detection model, the extracted element keyword features are identified, and keyword results corresponding to each clause are obtained, the element keyword features are drawing-related features, such as lines, distribution, trends and the like, one clause corresponds to one keyword result, and the keyword results represent results of words or words with the element keyword features contained in each clause.

The unit splitting of each clause is realized through the keyword detection model, and a plurality of unit words corresponding to each clause are obtained; performing entity recognition on each unit word corresponding to each clause to obtain an entity name corresponding to each clause; and extracting element keyword features of the entity names corresponding to the clauses, and identifying keyword results corresponding to the clauses according to the extracted element keyword features, so that the entity names in the clauses can be automatically identified through a keyword detection model, and the keyword results related to elements in the clauses can be automatically identified, the keyword identification accuracy is improved, a data base is provided for subsequently decomposing an element layer set, and the accuracy and the quality of video file generation are improved.

And S40, according to all the keyword results, carrying out layer decomposition on the material image, and decomposing out an element layer set.

Understandably, the layer decomposition is a process of performing layer-by-layer decomposition on each element in the material image according to elements in all the keyword results, that is, layers corresponding to the elements summarized from all the keyword results are extracted from the material image, so as to obtain the element layer set.

In an embodiment, as shown in fig. 4, in step S40, that is, performing layer decomposition on the material image according to all the keyword results to decompose an element layer set, includes:

s401, performing duplicate removal processing on all the keyword results to obtain first-appearing keyword results; the first-come keyword result includes a plurality of first-come keywords.

Understandably, the deduplication processing is to remove the same keyword, reserve the keyword with the first sequence of the clause corresponding to the keyword in the same keyword, determine the keyword as the first-appearing keyword, record all the first-appearing keywords after deduplication as the first-appearing keyword result, where the first-appearing keyword is a keyword in which the same keyword in all the clauses appears for the first time according to the sequence of the text to be processed.

S402, according to the first-appearing keyword result, carrying out layer decomposition on the material image, and decomposing out an element layer associated with each first-appearing keyword.

Understandably, the layer decomposition is a process of splitting the material image according to the first-appearing keyword result, and element layers associated with the first-appearing keywords are decomposed, the material image includes a plurality of layers, each layer is obtained after processing one or more elements, that is, each layer includes one or more elements, the layer decomposition process can be set according to requirements, and the layer decomposition process can be: firstly, separating each layer in the material image by using an image processing tool to obtain a plurality of layers; then, matching each first-appearing keyword with an element of each layer, and matching out a layer corresponding to the element matched with the first-appearing keyword, for example: the first keyword is a line, the separated element in one layer is a curve, and the line is matched with the curve; and finally, marking the layer corresponding to the element matched with the first-appearing keyword as the element layer.

In an embodiment, in step S402, that is, performing layer decomposition on the material image according to the result of the first-appearing keyword, and decomposing an element layer associated with each first-appearing keyword, includes:

carrying out element identification on the material image to identify an element result; the element result includes the element.

Understandably, when the elements of each layer in the material image are not acquired, or the dimensions of the elements of each layer are different from the dimensions of the first-occurring keyword (for example, the dimensions of the elements are normal distribution, discrete distribution, and the like, and the dimensions of the first-occurring keyword are line segments, polygons, trends, and the like), the element identification needs to be performed on the material image, the element identification is an identification process for identifying element categories through a trained target detection model, the relevant features of the elements in the input image, which have the same dimensions as the first-occurring keyword, are extracted through the target detection model, and the element categories in the input image are classified based on the features, so as to obtain the element results, the network structure of the target detection model can be set according to requirements, for example, the network structure of the target detection model can be the network structure of the YOLO target detection model, the method can also be a network structure of a VGG series target detection model, the element result represents the element category contained in the material image, and the element is

And carrying out layer decomposition on the material image according to the element result to obtain an element layer corresponding to each element.

Understandably, according to the element result, separating target areas corresponding to each element in the element result in the material image, separating layers corresponding to each target area, matching each first-appearing keyword with the element corresponding to each layer, matching layers corresponding to the element matched with the first-appearing keyword, and marking the layer corresponding to the element matched with the first-appearing keyword as the element layer.

The invention realizes the element identification of the material image to identify the element result; the element result comprises the element; and carrying out layer decomposition on the material image according to the element result to obtain element layers corresponding to the elements, so that the elements in the material image are automatically identified, the element layers are automatically decomposed according to the identified elements, and the corresponding element layers are matched, the process of manual identification is reduced, the efficiency of element layer output is improved, the accuracy and quality of element layer decomposition are improved, and the cost is saved.

And S403, recording the image of the element layer, from which all layers are decomposed, in the material image as a start-end image, and performing start-end identification on the start-end image to obtain a start-end identification result.

Understandably, recording the layers left after the layer decomposition as the beginning and end images, and performing beginning and end identification on the beginning and end images, wherein the beginning and end identification is an identification process for determining whether the images are backgrounds, and if the images are backgrounds, indicating that the layers need to be superimposed on the basis of the beginning and end images, the identified beginning and end identification result is the beginning image; if the image is not the background, the remaining part of the beginning and end image is the last presented content, the identified beginning and end identification result is an end image, the beginning and end identification result comprises a beginning image and an end image, the beginning image indicates the image layer appearing at the beginning, and the end image indicates the image layer appearing at the last.

The background is identified by extracting background features in the starting image and the ending image, and classifying whether the starting image or the ending image is based on the extracted background features, wherein the background features are features related to the background, such as coordinate axes, time lines and the like.

S404, according to the beginning and end recognition results, the beginning and end images are associated to the element image layers associated with the first-appearing keywords matched with the beginning and end recognition results.

Understandably, if the beginning and end recognition result is a starting image, the beginning and end image is associated to the first-appearing keyword of the first-appearing keyword in the sequence, so that the element layer corresponding to the first-appearing keyword in the sequence is associated; and if the start and end recognition result is an end image, associating the start and end image to the first-appearing keyword with the last sequence in all the first-appearing keywords, so that the element layer corresponding to the first-appearing keyword with the last sequence is associated.

S405, determining all the associated element image layers as the element list.

Understandably, all the associated element layers are recorded as the element list.

The invention realizes the duplicate removal processing of all the keyword results to obtain the first-appearing keyword result; the first-come keyword result comprises a plurality of first-come keywords; according to the first-appearing keyword result, carrying out layer decomposition on the material image, and decomposing element layers associated with each first-appearing keyword; recording the images of the element layers, from which all layer decomposition is removed, in the material image as start and end images, and performing start and end identification on the start and end images to obtain start and end identification results; according to the beginning and end recognition results, the beginning and end images are associated to the element image layers associated with the first-appearing keywords matched with the beginning and end recognition results; and determining all the associated element layers as the element list, so that the layer decomposition can be automatically carried out on the material image, and the initial and final images can be automatically identified and associated to generate the element list, thereby improving the accuracy of the layer decomposition and providing a basis for subsequent video synthesis.

And S50, according to all the clauses and all the clause audios, carrying out element configuration processing on each element layer in the element layer set to obtain an animation video associated with each element layer.

Understandably, the element configuration processing is a processing process of setting configuration parameters of element drawing, element motion and element special effects in an element layer, and the animation video is a video synthesized according to the element layer after the element configuration processing.

Wherein the element drawing comprises: (1)2D, 3D pie elements; (2) line segment elements; (3) arrow elements include arrow directions, angles, and shapes; (4) a V-shaped or hook element comprises an element orientation; (5) a rectangular element; (6) a rounded rectangular element; (7) circular and arc elements; (8) an elliptical element; (9) sector elements; (10) a parabolic element; (11) a polygon element; (12) a text element; (13) a 3D cube element; (14) a bar chart element; (15) a line graph element; (16) a trigonometric function curve element; the element movement comprises: (1) the elements are changed from small to big; (2) increase according to the time line; (3) element translation occurs; (4) element rotation occurs; (5) the elements appear by expanding from left to right and from top to bottom; (6) the elements appear in a cross mode; the element special effect comprises the following steps: (1) displaying characters one by one; (2) the parabola falls and has the rebound effect; (3) characters are displayed in a large and small scale; (4) the text is displayed from left to right, top to bottom.

In an embodiment, in step S50, that is, according to all the clauses and all the clause audios, the performing element configuration processing on each element layer in the element layer set to obtain an animation video associated with each element layer includes:

and acquiring the clause corresponding to the first-appearing keyword associated with the element map layer and the audio frame number in the clause audio.

Understandably, the element map layer has an association relation with the first-appearing keyword, one first-appearing keyword has one clause corresponding to the first-appearing keyword and the clause audio frequency of the clause, and thus, the clause and the clause audio frequency corresponding to the first-appearing keyword associated with the element map layer can be obtained according to the element map layer.

The sentence audio comprises audio frame numbers, and the audio frame numbers are the duration of the sentence audio.

And extracting the configuration characteristics of the obtained clauses through an element configuration extraction model, and determining the configuration item result of the clauses according to the extracted configuration characteristics.

Understandably, the clauses corresponding to the element image layers are subjected to configuration feature extraction through the element configuration extraction model, the configuration feature extraction process is a process of extracting text contents related to configuration features from contents in the corresponding clauses, and the configuration features are parameter features related to drawing, movement and special effects of elements, such as: the clause content comprises a development trend from bottom to top and a rebound effect which occurs when the trend develops, the extracted text content related to the configuration characteristics has element motion from bottom to top and an element special effect of the rebound effect, and finally the configuration item result of the clause is determined to comprise the element motion from bottom to top and the rebound effect.

And the configuration item result embodies a set of related configuration items of all element drawing, element movement or element special effect in the clause.

And carrying out element configuration processing on the element layer according to the audio frame number corresponding to the element layer and the configuration item result to obtain the animation video corresponding to the element layer.

Understandably, according to the audio frame number of the clause audio corresponding to the clause, performing element configuration processing on the element layer corresponding to the clause, where the element configuration processing is performed by performing corresponding drawing, motion or special effect processing on the corresponding element layer on a display according to the configuration item result, and synthesizing a basic video, for example: the 'from bottom to top' is the process of gradually displaying from bottom to top, the 'rebound effect' is the special effect of displaying rebound after the display is finished, the processed element layers are combined into the basic video, the element configuration processing further comprises comparing the number of audio frames to the duration of the base video, and if the number of audio frames is greater than the duration of the base video, compressing the basic video in equal proportion, compressing the time length of the basic video to be less than or equal to the number of the audio frames, if the number of the audio frames is less than or equal to the time length of the basic video, compressing the basic video without compression, and finally determining the basic video less than or equal to the number of the audio frames as the animation video, therefore, the composite animation video can be ensured to be less than or equal to the number of audio frames, and all animation videos can be displayed after the sentence audio is completely read.

The invention realizes that the clauses corresponding to the first-appearing keywords associated with the element map layer and the audio frame number in the clause audio are obtained; carrying out configuration feature extraction on the obtained clauses through an element configuration extraction model, and determining configuration item results of the clauses according to the extracted configuration features; according to the audio frame number corresponding to the element layer and the configuration item result, element configuration processing is carried out on the element layer to obtain the animation video corresponding to the element layer, so that the configuration characteristics of the clauses relevant to the element configuration items are automatically identified, the element configuration processing is automatically carried out on the element layer, the animation video with high audio wedge degree with the clauses is automatically generated, manual configuration and video compression operations are not needed, the animation video generation efficiency is improved, the diversity and quality of playing effects are improved for subsequent video file generation, and the user experience is improved.

And S60, synthesizing a video file corresponding to the text to be processed according to all the clause audios and all the animation videos.

Understandably, all clause audios are spliced to obtain an audio file, all animation videos are subjected to image frame superposition according to the clauses corresponding to the clauses to obtain superposed videos the same as the frame number (namely the duration of the video file) of the audio file, and the audio file and the superposed videos are synthesized by using a video synthesis technology to obtain the video file, so that the purpose of automatically combining a text and a provided image to generate the video file is achieved, the audio and the played image degree can be improved, the configuration parameters of the played image are automatically configured, the audio file is completely attached to the audio, the accuracy and the efficiency of the generation of the video file are improved, and the user experience satisfaction degree is improved.

The method and the device realize the purpose that the text to be processed and the material image associated with the text to be processed are obtained; splitting the text to be processed according to preset clause symbols to obtain a plurality of clauses; performing keyword recognition on each clause through a keyword detection model to obtain a keyword result corresponding to each clause, and performing voice generation on each clause through a voice generation technology to obtain a clause audio corresponding to each clause; according to all the keyword results, carrying out layer decomposition on the material image to decompose an element layer set; according to all the clauses and all the clause audios, carrying out element configuration processing on each element layer in the element layer set to obtain an animation video associated with each element layer; according to all the clause audio and all the animation videos, a video file corresponding to a text to be processed is synthesized, so that keywords are identified through a keyword detection model, clause audio is generated by applying a speech generation technology, an element layer in a material image is decomposed by applying a layer decomposition technology, elements in the element layer are configured, the animation video is generated, the clause audio and the animation videos are combined, the video file is automatically generated, the fitting degree of the audio and the played image can be improved, the configuration parameters of the played image are automatically configured, the video file is completely fitted with the audio, the accuracy and the efficiency of video file generation are improved, and the experience satisfaction degree of a user is improved.

In an embodiment, the step S60, namely, the synthesizing a video file corresponding to the text to be processed according to all the clause audios and all the animation videos includes:

and splicing the sentence audio corresponding to each sentence according to the sequence of the sentence in the text to be processed to obtain an audio file of the text to be processed.

Understandably, according to the sequence of the clauses and the sequence of the order in the text to be processed, that is, the sequence of the positions, the clause audios corresponding to the clauses are spliced end to end in a sequential manner to obtain the audio file of the text to be processed, wherein the audio file is a file for playing the audio of the content in the text to be processed.

And according to the animation video corresponding to each element layer, determining the animation video associated with the clause corresponding to each element layer, and according to the sequence of each clause in the text to be processed, performing adjacent last frame filling on the clause of the unassociated linkage drawing video to obtain the animation video associated with the clause.

Understandably, the animation video can be associated with the clauses corresponding to the element layers corresponding to the animation video through the element layers corresponding to the animation video, so that the text to be processed can have unassociated clauses, the animation video associated with the clauses of the unassociated element layers is automatically generated by using a method of filling adjacent last frames, and the process of filling the adjacent last frames is a process of filling images of last frames in the adjacent previous clauses, so that the animation video of all the clauses can be associated.

And superposing the image frames of all the animation videos according to the sentence sequence associated with the animation videos to obtain the superposed video of the file to be processed.

Understandably, the image frame superposition is a process of superposing layers in an animation video layer by layer, namely a process of superposing an image of a current frame on a previous frame image or a superposed image, and the material image can be finally output through the process of continuously superposing element layers by the image frame superposition, so that a superposed video of the file to be processed can be obtained, wherein the superposed video is a set of all animation videos.

And synthesizing the audio file and the superposed video by using a video synthesis technology to obtain the video file.

Understandably, the video synthesis technology is a technology of playing each frame of image in a video while playing an audio, so that the audio and a silent video can be combined into a video with sound, the audio file and the superposed video can be synthesized to generate the video file through the video synthesis technology, and the video file is a section of video for displaying a text to be processed and the material image.

According to the invention, the sentence audio corresponding to each sentence is spliced according to the sequence of the sentence in the text to be processed, so that the audio file of the text to be processed is obtained; determining the animation video associated with the clauses corresponding to each element layer according to the animation video corresponding to each element layer, and filling adjacent last frames of the clauses of the unassociated linkage drawing video according to the sequence of each clause in the text to be processed to obtain the animation video associated with the clause; superposing image frames of all the animation videos according to a sentence sequence associated with the animation videos to obtain superposed videos of the files to be processed; the audio file and the superposed video are synthesized by using a video synthesis technology to obtain the video file, so that the sentence division audio and the animation video are automatically synthesized into the required video file which accords with the text and the material image to be processed by using a splicing method, an adjacent last frame filling method, an image frame superposing method and a video synthesis technology, manual synthesis operation is not needed, the cost is saved, the accuracy and the efficiency of video file generation are improved, and the experience satisfaction degree of a user is improved.

It should be understood that, the sequence numbers of the steps in the foregoing embodiments do not imply an execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present invention.

In an embodiment, a text-based video file generation apparatus is provided, and the text-based video file generation apparatus corresponds to the text-based video file generation method in the above embodiment one to one. As shown in fig. 5, the text-based video file generating apparatus includes an obtaining module 11, a splitting module 12, a processing module 13, a decomposing module 14, a configuring module 15, and a synthesizing module 16. The functional modules are explained in detail as follows:

the acquisition module 11 is configured to acquire a text to be processed and a material image associated with the text to be processed;

the splitting module 12 is configured to split the text to be processed according to preset clause identifiers to obtain multiple clauses;

the processing module 13 is configured to perform keyword recognition on each clause through the keyword detection model to obtain a keyword result corresponding to each clause, and perform speech generation on each clause through a speech generation technology to obtain a clause audio corresponding to each clause;

the decomposition module 14 is configured to perform layer decomposition on the material image according to all the keyword results, and decompose an element layer set;

a configuration module 15, configured to perform element configuration processing on each element layer in the element layer set according to all the clauses and all the clause audios, so as to obtain an animation video associated with each element layer;

and the synthesis module 16 is used for synthesizing a video file corresponding to the text to be processed according to all the clause audios and all the animation videos.

In one embodiment, as shown in fig. 6, the decomposition module 14 includes:

a duplicate removal submodule 41, configured to perform duplicate removal processing on all the keyword results to obtain first-come keyword results; the first-come keyword result comprises a plurality of first-come keywords;

the decomposition submodule 42 is configured to perform layer decomposition on the material image according to the first-appearing keyword result, and decompose an element layer associated with each first-appearing keyword;

the identifying sub-module 43 is configured to record, as a start-end image, the image of the element layer from which all layer decomposition is removed in the material image, and perform start-end identification on the start-end image to obtain a start-end identification result;

a matching sub-module 44, configured to associate, according to the start-end recognition result, the start-end image to the element map layer associated with the first-occurrence keyword matched with the start-end recognition result;

and a determining submodule 45, configured to determine all the associated element layers as the element list.

For specific limitations of the text-based video file generation apparatus, reference may be made to the above limitations of the text-based video file generation method, which will not be described herein again. The respective modules in the text-based video file generation apparatus described above may be wholly or partially implemented by software, hardware, and a combination thereof. The modules can be embedded in a hardware form or independent from a processor in the computer device, and can also be stored in a memory in the computer device in a software form, so that the processor can call and execute operations corresponding to the modules.

In one embodiment, a computer device is provided, which may be a client or a server, and its internal structure diagram may be as shown in fig. 7. The computer device includes a processor, a memory, a network interface, and a database connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device comprises a readable storage medium and an internal memory. The readable storage medium stores an operating system, a computer program, and a database. The internal memory provides an environment for the operation of an operating system and computer programs in the readable storage medium. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a text-based video file generation method.

In one embodiment, a computer device is provided, which includes a memory, a processor, and a computer program stored on the memory and executable on the processor, and when the processor executes the computer program, the text-based video file generation method in the above embodiments is implemented.

In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored, which, when executed by a processor, implements the text-based video file generation method in the above-described embodiments.

It will be understood by those skilled in the art that all or part of the processes of the methods of the embodiments described above can be implemented by hardware instructions of a computer program, which can be stored in a non-volatile computer-readable storage medium, and when executed, can include the processes of the embodiments of the methods described above. Any reference to memory, storage, databases, or other media used in embodiments provided herein may include non-volatile and/or volatile memory. Non-volatile memory can include read-only memory (ROM), Programmable ROM (PROM), Electrically Programmable ROM (EPROM), Electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in a variety of forms such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDRSDRAM), Enhanced SDRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), Rambus Direct RAM (RDRAM), direct bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM).

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions.

The above-mentioned embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not substantially depart from the spirit and scope of the embodiments of the present invention, and are intended to be included within the scope of the present invention.

Claims

1. A text-based video file generation method is characterized by comprising the following steps:

2. The method for generating a text-based video file according to claim 1, wherein the performing keyword recognition on each clause through a keyword detection model to obtain a keyword result corresponding to each clause comprises:

performing unit splitting on each clause through the keyword detection model to obtain a plurality of unit words corresponding to each clause;

performing entity recognition on each unit word corresponding to each clause through the keyword detection model to obtain an entity name corresponding to each clause;

and extracting element keyword features of the entity names corresponding to the clauses through the keyword detection model, and identifying keyword results corresponding to the clauses according to the extracted element keyword features.

3. The method according to claim 1, wherein said decomposing the layer of the material image according to the result of all the keywords to obtain an element layer set comprises:

performing duplicate removal processing on all the keyword results to obtain first-appearing keyword results; the first-come keyword result comprises a plurality of first-come keywords;

according to the first-appearing keyword result, carrying out layer decomposition on the material image, and decomposing element layers associated with each first-appearing keyword;

recording the images of the element layers, from which all layer decomposition is removed, in the material image as start and end images, and performing start and end identification on the start and end images to obtain start and end identification results;

according to the beginning and end recognition results, the beginning and end images are associated to the element image layers associated with the first-appearing keywords matched with the beginning and end recognition results;

and determining all the associated element image layers as the element list.

4. The method of claim 3, wherein the decomposing the layer of the material image according to the first-occurring keyword result to decompose the element layer associated with each first-occurring keyword comprises:

carrying out element identification on the material image to identify an element result; the element result comprises the element;

5. The method for generating a text-based video file according to claim 1, wherein the performing element configuration processing on each element layer in the element layer set according to all the clauses and all the clause audios to obtain an animation video associated with each element layer comprises:

acquiring the clauses corresponding to the first-appearing keywords associated with the element map layers and audio frame numbers in the clause audio;

carrying out configuration feature extraction on the obtained clauses through an element configuration extraction model, and determining configuration item results of the clauses according to the extracted configuration features;

6. The method for generating a text-based video file according to claim 1, wherein the synthesizing a video file corresponding to a text to be processed based on all of the sentence audio and all of the animation video comprises:

according to the sequence of the clauses in the text to be processed, splicing the clause audio corresponding to each clause to obtain an audio file of the text to be processed;

determining the animation video associated with the clauses corresponding to each element layer according to the animation video corresponding to each element layer, and filling adjacent last frames of the clauses of the unassociated linkage drawing video according to the sequence of each clause in the text to be processed to obtain the animation video associated with the clause;

superposing image frames of all the animation videos according to a sentence sequence associated with the animation videos to obtain superposed videos of the files to be processed;

7. A text-based video file generation apparatus, comprising:

8. The profile updating apparatus of claim 7, wherein the decomposition module comprises:

the duplication removing submodule is used for carrying out duplication removing processing on all the keyword results to obtain first-time keyword results; the first-come keyword result comprises a plurality of first-come keywords;

the decomposition submodule is used for carrying out layer decomposition on the material image according to the first-appearing keyword result and decomposing an element layer associated with each first-appearing keyword;

the identification submodule is used for recording the image of the element layer, from which all layer decomposition is removed, in the material image as a start-end image, and performing start-end identification on the start-end image to obtain a start-end identification result;

the matching submodule is used for associating the beginning and end images to the element image layers associated with the first-appearing keywords matched with the beginning and end recognition results according to the beginning and end recognition results;

and the determining submodule is used for determining all the associated element image layers as the element list.

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the text-based video file generation method according to any one of claims 1 to 6 when executing the computer program.

10. A computer-readable storage medium, in which a computer program is stored, which, when being executed by a processor, carries out a text-based video file generation method according to any one of claims 1 to 6.