CN113923521B

CN113923521B - Video scripting method

Info

Publication number: CN113923521B
Application number: CN202111519420.7A
Authority: CN
Inventors: 严华培; 王红星
Original assignee: Shenzhen Big Head Brothers Technology Co Ltd
Current assignee: Shenzhen Flash Scissor Intelligent Technology Co ltd
Priority date: 2021-12-14
Filing date: 2021-12-14
Publication date: 2022-03-08
Anticipated expiration: 2041-12-14
Also published as: CN113923521A

Abstract

The invention discloses a scripted method of a video, which comprises the steps of acquiring a video file to be processed; performing lens splitting on the video file to obtain a plurality of lens files, wherein the lens files comprise lens and lens audio; for each lens splitting file, extracting information of a lens splitting lens in the lens splitting file to obtain lens information corresponding to the lens splitting lens; performing audio processing on the split-mirror audio in the split-mirror file to obtain audio information corresponding to the split-mirror audio; and generating a split-mirror script corresponding to the video file according to the lens information and the audio information. The invention can automatically convert the video into the script file, and is convenient and quick.

Description

Video scripting method

Technical Field

The invention relates to an interactive system, in particular to a video scripting method.

Background

Scripts are an important ring in the creation of movies and dramas, and not only determine the outline of the development of the entire story, but also can be used to determine the effect of the final presentation of the video. The script is an important basis for shooting videos. The video that the user can finally see is the result of the shooting according to the script. The video script has important learning value, and the aspects of split mirror, conversation proceeding mode, material selection and the like can provide great help for learners. The script of the wonderful video can be used as a learning object, the script of the wonderful video can be used as a reverse teaching material and an improved object.

At present, for the acquired video, the script is required to be obtained, and only the official publication and the self-production are carried out, the official publication is less in quantity, only some very famous film and television works are published, and the self-production needs a great deal of time and energy. Therefore, the scripting of the video at present requires a large amount of manpower and material resources, and is extremely low in efficiency.

Disclosure of Invention

The technical problem to be solved by the invention is that the existing video scripting method mainly adopts manual analysis and manufacturing, and the video scripting method is provided aiming at the defects of the prior art.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows:

a method of scripting video, the method comprising:

acquiring a video file to be processed;

performing lens splitting on the video file to obtain a plurality of lens files, wherein the lens files comprise lens and lens audio;

for each lens splitting file, extracting information of a lens splitting lens in the lens splitting file to obtain lens information corresponding to the lens splitting lens; and the number of the first and second groups,

performing audio processing on the split-mirror audio in the split-mirror file to obtain audio information corresponding to the split-mirror audio;

and generating a split-mirror script corresponding to the video file according to the lens information and the audio information.

The scripted method of the video, wherein the step of splitting the lens of the video file to obtain a plurality of split-lens files comprises:

performing track splitting on the video file to obtain an image track and an audio track;

calculating a frame difference between an nth frame image and an (N + 1) th frame image in the image track, wherein N is a positive integer less than or equal to M, and M is the number of frame images in the image track;

when the frame difference is larger than a preset frame difference threshold value, determining that the Nth frame of image is a lens ending frame, and the (N + 1) th frame of image is a lens starting frame, wherein the first frame of image is the lens starting frame, and the Mth frame is the lens ending frame;

splitting the image track according to the lens starting frame and the lens ending frame to obtain a plurality of lens splitting lenses, and splitting the audio track to obtain lens splitting audio corresponding to the lens splitting lenses.

The scripted method of the video, wherein the calculating a frame difference between the nth frame image and the (N + 1) th frame image in the image track comprises:

calculating a first color level histogram corresponding to the N frame of image, and calculating a second color level histogram corresponding to the (N + 1) th frame of image;

and calculating the difference area between the first color level histogram and the second color level histogram to obtain the frame difference between the N frame image and the (N + 1) frame image.

The scripted method of the video, wherein the shot information comprises a shot label; for each of the split mirror files, extracting information of a split mirror lens in the split mirror file, and obtaining lens information corresponding to the split mirror lens includes:

aiming at each frame image in each lens-splitting lens, carrying out object identification on the frame image to obtain a plurality of element images corresponding to the frame image and a first attribute label corresponding to each element image;

for each element image with the first attribute label as a character label, performing state identification on the element image to obtain a second attribute label corresponding to the element image;

and carrying out duplicate checking on the first attribute label and the second attribute label corresponding to each element image to obtain a lens label corresponding to the lens.

The scripted method of the video, wherein the first attribute tag comprises a person tag, an animal tag and/or an object tag; the second attribute tag includes an action tag and/or an emoticon tag.

The scripted method of the video, wherein the audio information comprises music information and/or human voice information; the audio processing of the split-mirror audio in the split-mirror file to obtain the audio information corresponding to the split-mirror audio comprises:

performing voice recognition on each split-mirror audio frequency to obtain text information corresponding to the split-mirror audio frequency;

when music exists in the audio frequency of the split mirror based on a preset music identification model, performing song identification on the audio frequency of the split mirror to obtain music information in the audio frequency of the split mirror, wherein the music information comprises music lyrics;

and filtering the text information according to the music lyrics to obtain a voice text corresponding to the audio frequency of the lens.

The video scripting method further includes, after extracting information of each of the split lenses in the split file to obtain lens information corresponding to the split lens, for each of the split files:

clustering all the element images with the first attribute labels as character labels to obtain a plurality of personal object image sets;

and determining a role label corresponding to each person image set according to a preset role label set.

The scripted method of the video, wherein, after filtering the text information according to the music lyrics to obtain the voice text corresponding to the audio in the split-mirror mode, the method further comprises:

aiming at each split-mirror audio frequency, extracting the voiceprint characteristics in the split-mirror audio frequency;

and determining a dialog text corresponding to each role label in the voice text according to the voiceprint characteristics.

A computer readable storage medium storing one or more programs, the one or more programs being executable by one or more processors to implement the steps in the scripted method of video as recited in any of the above.

A terminal device, comprising: a processor, a memory, and a communication bus; the memory has stored thereon a computer readable program executable by the processor;

the communication bus realizes connection communication between the processor and the memory;

the processor, when executing the computer readable program, implements the steps in the scripted method of video as described in any one of the above.

Has the advantages that: the method includes the steps of firstly obtaining a video file, then splitting a lens of the video file, and splitting the video file into a plurality of split-mirror files. The script is composed of the content of each lens, therefore, after the lens file is obtained, information extraction is respectively carried out on the lens in the lens file, namely the image set, and the lens audio to obtain corresponding lens information and audio information, and finally, the lens script corresponding to the video file is generated according to the lens information and the audio information. The scheme can automatically script the input video file without complex manual recording such as screen capture and dialogue translation, and is convenient for a user to modify and learn the script content.

Drawings

Fig. 1 is a general flowchart of a scripted method of video according to the present invention.

Fig. 2 is a schematic diagram of a first script file in the video scripting method provided by the present invention.

Fig. 3 is a schematic diagram of a second script file in the video scripting method provided by the present invention.

Fig. 4 is a flowchart of extracting a voice text in the video scripting method provided by the present invention.

Fig. 5 is a schematic structural diagram of a terminal device provided in the present invention.

Detailed Description

The present invention provides a video scripting method, and in order to make the objects, technical solutions, and effects of the present invention clearer and clearer, the present invention is further described in detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.

For example, the embodiment of the present invention may be implemented by a server, a terminal, a software program, or a plug-in of video playing software.

It should be noted that the above application scenarios are only presented to facilitate understanding of the present invention, and the embodiments of the present invention are not limited in any way in this respect. Rather, embodiments of the present invention may be applied to any scenario where applicable.

The invention will be further explained by the description of the embodiments with reference to the drawings.

As shown in fig. 1, this embodiment provides a video scripting method, where a PC is used as an execution subject in this embodiment, and the scripting process is described, where the video scripting method includes the following steps:

and S10, acquiring the video file to be processed.

Specifically, the PC reads a local file, thereby acquiring a video file to be processed. The parameters such as format and size of the video file are not particularly limited.

And S20, splitting the shot of the video file to obtain a plurality of split-mirror files.

Specifically, the video file is essentially composed of a plurality of frame images and audio arranged in sequence, and the frame images and the audio change with the passage of time to present a dynamic video effect. And the content display is realized by switching the same shot. For example, the first shot speaks for a, the second for B, and the third in turn turns to speak for a, presenting the appearance of a conversation with B in combination with audio assistance. The script records the content of each shot. Therefore, in order to generate the split mirror script, the video file is split first to obtain the split mirror file. The splitting not only splits the frame image of the video file to obtain the lens, but also splits the audio of the video file to obtain the lens audio. Therefore, one split-mirror file comprises a split-mirror shot consisting of a plurality of frame images and also comprises split-mirror audio consisting of a section of audio.

The video file is firstly split into the image track and the audio track, and the image track and the audio track can be directly split because the output tracks of the image and the audio are different.

The shot splitting is mainly to determine a shot start frame and a shot end frame in the frame image. In the method for determining the start frame and the end frame of the shot provided by this embodiment, the shot segmentation is implemented by using edge contour change, which mainly determines the boundary of the shot by calculating the change degree of the boundary. Firstly, performing marginalization on each frame image to obtain an edge frame image. And then calculating the overall displacement between the N-th edge frame image and the N + 1-th edge frame image, and registering the N-th edge frame image and the N + 1-th edge frame image according to the overall displacement, wherein N is a positive integer less than or equal to M, and M is the number of frame images in the image track. Then, the number and the positions of the edges in the adjacent N & ltth & gt edge frame image and the (N +1 & ltth & gt edge frame image are calculated. The proportion of the edge change is taken as the frame difference, i.e., the proportion of the shift of the edge from the nth edge frame to the N +1 th edge frame image. The larger the frame difference is, the larger the shot is shifted, and the more likely it is a lens node.

And presetting a frame difference threshold, and when the frame difference between the N-th edge frame image and the (N + 1) -th edge frame image is greater than the frame difference threshold, determining that the N-th frame image corresponding to the (N + 1) -th edge frame image is a lens starting frame and the N-th frame image corresponding to the N-th edge frame image is a lens ending frame.

In a second way of determining a shot start frame and a shot end frame, a histogram of color levels corresponding to each frame image is calculated. The color level refers to brightness, and a color level histogram can intuitively represent the gray level and the brightness of each frame image, and when a shot change occurs between frame images, the gray level and the brightness of each frame image often suddenly change. Therefore, a first histogram corresponding to the nth frame of image and a second histogram corresponding to the N +1 th frame of image are calculated. And then calculating the difference area between the first color level histogram and the second color level histogram to obtain the frame difference between the two frame images. The smaller the difference area between the first color level histogram and the second color level histogram is, the higher the similarity between two frame images is, and the higher the probability of no lens conversion is; the larger the difference area between the first histogram and the second histogram is, the lower the similarity between the two frame images is, and the lower the probability of the absence of shot transition is. This approach is less computationally intensive and more versatile than the first.

Similarly to the former way, a frame difference threshold is preset, and when the frame difference is greater than the preset frame difference threshold, the nth frame image is determined to be a lens end frame, and the (N + 1) th frame image is determined to be a lens start frame.

Further, when N is equal to 1, i.e., the first frame image, the default is the shot start frame, and when N = M, i.e., the last frame image, the default is the shot end frame. In addition to the above two ways, the frame difference between the nth frame image and the N +1 th frame image may be calculated based on the X2 histogram, the X2 histogram blocking, and the like, so as to determine the lens start frame and the lens end frame.

And finally, splitting the image track by using the lens starting frame and the lens ending frame to obtain a plurality of lens with different lenses. Because each frame image in the image track has a corresponding time stamp, and the time period corresponding to the image track is the same as the time period corresponding to the audio track, the audio track can be divided into a plurality of sections according to the time stamp corresponding to the lens starting frame and the time stamp corresponding to the lens ending frame, and the lens audio corresponding to each lens is obtained. And taking the lens and the audio as a lens file.

S30, aiming at each split mirror file, extracting information of the split mirror lenses in the split mirror file to obtain lens information corresponding to the split mirror lenses; and carrying out audio processing on the mirror audio in the mirror file to obtain audio information corresponding to the mirror audio.

Specifically, since one script is in units of a single shot, each split mirror file is individually processed. Taking a certain lens division file as an example, information extraction is performed on the lens division in the lens division file to obtain lens information corresponding to the lens division.

As shown in fig. 2, in the first information extraction method of the present embodiment, a representative identification image in a lens segment is directly extracted as lens information. And calculating the comprehensive frame difference between each frame image in the lens with the lens and each comparison image, wherein the comparison images are frame images except the frame image in the lens with the lens. The comprehensive frame difference is a value obtained by performing statistical analysis on the frame difference between the frame image and each comparison image, and is used for measuring the change between the frame images, and can be an average value, a median, a variance and the like. And determining the frame image corresponding to the minimum comprehensive frame difference as the identification image corresponding to the lens of the lens splitter.

Although the identification image adopted by the first method is used in many scripts as the lens information, the identification image is mainly used for workers who take pictures and the like, and is convenient for the workers to take pictures by taking the image as a standard in the shooting process. However, since information in an image is excessively complicated and a long video is not favorable for information arrangement, the second information extraction method of the present embodiment uses a form such as a lens label as lens information. The specific process comprises the following steps:

a10, carrying out object recognition on each frame image in each lens, and obtaining a plurality of element images corresponding to the frame image and a first attribute label corresponding to each element image.

Specifically, the object recognition technology is a kind of image recognition technology that is currently used, and includes various algorithms and models. The present embodiment may implement the recognition function by using any object recognition algorithm or model.

Taking an object recognition model as an example, an object recognition model for performing object recognition on an image to extract an object in the input image is set in advance. And inputting the frame image into the object recognition model aiming at the frame image in each lens, and extracting the area of each object in the frame image by the object recognition model to obtain an element image corresponding to the object. And determining a classification category possibly corresponding to each element image as a corresponding first attribute label based on a preset classification category. The first attribute label is a first layer category label of the object, such as a person label, an animal label, and an object label. Further, further refinements are possible, such as labels "cat," "chair," "man," and so forth.

Furthermore, after extracting the element images of the frame images in all the lens shots, the scenario is mainly promoted by the roles, so for further refining the script, the role labels can be obtained by further dividing the role labels, for example, in one lens file, a person a and a person B exist, and the first attribute labels corresponding to the element image containing the person a and the element image containing the person B are the person labels. Clustering all the element images with the first attribute labels as character labels to obtain a plurality of character image sets. And then determining a role label for each character image set according to a preset role label set, wherein the role label is used for distinguishing characters in all different video files. The clustering mode can adopt K-means and other algorithms for distinguishing the human conversation subsequently. The preset role tag set can be a tag set by a user or a default tag, for example, the preset role tag set is named directly by a role a and a role B. For example, the role label corresponding to person a is role a, and the role label corresponding to person B is role B.

And A20, for each element image of which the first attribute label is a person label, performing state recognition on the element image to obtain a second attribute label corresponding to the element image.

Specifically, if a person exists in the frame image, the expression and the motion of the person are key elements for promoting the development of the scenario. Therefore, when the first attribute tag of an element image is a person tag, it is necessary to identify the state of a person in the image, such as an action or an expression, and obtain a second attribute tag corresponding to the element. The second attribute tags may include action tags and/or emoji tags.

Still taking the state recognition model as an example, the state recognition model may be formed by combining a trained motion recognition model and an expression recognition model, or may be obtained by training a single model. And inputting the element image with the first attribute label as the character label into the state identification model, and carrying out state identification on the element image by the state identification model to obtain a second attribute label corresponding to the element image. For example, if the character tag corresponding to a certain element image is character a and the corresponding second attribute tag is smile, information "character a is smile" may be extracted.

And A30, carrying out duplication checking on the first attribute label and the second attribute label corresponding to each element image to obtain a lens label corresponding to the lens.

Specifically, since the lens segment includes a plurality of frame images, and the frame images in the same lens segment are all similar to each other, the first attribute tag and the second attribute tag need to be checked for duplication.

When the first attribute tag is an item tag or an animal tag, such as "chair," only one "chair" tag is retained if there are multiple "chairs" tags.

Further, the number of object tags of "chair" in each frame image may be calculated, for example, to be 3. When the number of "chairs" in all the frame images is the same, it is determined that the number corresponding to the object label "chair" is 3.

And when the first attribute tag is a character tag, the second attribute tag is checked for duplication of the role tag corresponding to the element image.

For example, according to the order of the frame images, the second attribute labels corresponding to the element image containing a include smile, smile … …, and only one second attribute label is reserved for the same element image, so as to reduce the number of labels of the lens.

As shown in fig. 2, the lens label in the last lens segment may be "chair (3); character a-raise hands, smile; role B-smile ", where the number of labels is in parentheses.

For the audio frequency of the split mirror, the audio frequency of the split mirror in the one split mirror file is processed, and the information contained in the audio frequency of the split mirror, namely the audio information, is extracted. Taking a video file as an example of a documentary, the split-mirror audio includes voice-over and background music, and song recognition and voice recognition are performed on the split-mirror audio, so that the audio information includes information such as voice-over characters and names of the background music. Taking MV (Music video) as an example, the audio in the split-mirror is a song, and the audio in the split-mirror is song recognition, so the audio information is the lyrics of the song. Taking a television play as an example, the split-mirror audio is a dialog, so that the split-mirror audio is subjected to voice recognition, and the obtained audio information is a dialog text.

Further, still taking a video file as an example of a television series, music contained in the audio of the video file may have some lyrics, and the opposite white text after speech recognition may be connected with the content of the lyrics, so as to affect the distinction between the opposite white text and the lyrics. For example, a10 s lens, music with voice playing for 3 seconds, followed by 7s, character a is talking to character B. Therefore, in order to improve the accuracy of the spoken text, the present embodiment provides an audio processing method, as shown in fig. 4, specifically as follows:

and B10, performing voice recognition on each split-mirror audio to obtain text information corresponding to the split-mirror audio.

Specifically, taking a single split-mirror audio as an example, first, voice recognition is performed on the split-mirror audio to extract information such as dialogue and voice-over in the split-mirror audio, so as to obtain text information corresponding to the split-mirror audio.

Further, sometimes a person is speaking when background music with human voice is played, and in order to improve the accuracy of text information, noise reduction processing may be performed on the segmented audio to reduce the sound of the background music, so as to obtain noise-reduced audio. And then, voice recognition is carried out on the noise reduction audio to obtain text information, so that the accuracy of the text information is improved.

The speech recognition can be based on neural network, such as long-and-short-term memory module, convolutional neural network, cyclic neural network, hidden Markov model, Gaussian mixture model, etc.

B20, when music exists in the audio frequency of the split mirror based on a preset music identification model, carrying out song identification on the audio frequency of the split mirror to obtain music information in the audio frequency of the split mirror.

Specifically, a music recognition model is preset for judging whether music exists in the split-mirror audio. If music exists, the situation that lyrics possibly exist in the text information extracted before is shown, and the lyrics need to be distinguished from the voice part. The music recognition model can be implemented based on audio fingerprints, for example, FFT (Fast Fourier Transform) is performed on audio, extreme points are taken as feature points in a frequency domain, and the extreme points are paired at intervals in a certain frequency range. Presetting a matching threshold value, taking the most matched song as the mirror-divided song corresponding to the mirror-divided audio, and determining that no music exists in the mirror-divided audio when the matching number is less than the matching threshold value. After the split-mirror songs are matched, music information of the split-mirror audio is extracted from a preset database, wherein the music information can comprise information such as music names and music lyrics of the distributed songs.

And B30, filtering the text information according to the music lyrics to obtain a voice text corresponding to the split-mirror audio.

Specifically, after the music lyrics are obtained, the music lyrics in the text information are filtered according to the music lyrics, and then the human voice text which is only used for human-to-animal conversation can be obtained.

Further, although the extracted vocal text can show the content of the human dialogue in the video, it cannot represent which human is specifically speaking, so in order to provide a script with clearer content, in the embodiment, the vocal print feature is extracted to determine which human is specifically speaking. The process comprises the following steps:

and C10, extracting the voiceprint features in the split-mirror audio for each split-mirror audio.

Specifically, the voiceprint features in the split-mirror audio are extracted firstly, the voiceprint features have certain specificity, and even if a speaker intentionally imitates the voice and tone of other people, the voiceprint is fixed and unchanged. Thus, the voiceprint feature can be used to distinguish between different sources of dialogue in the people's split-mirror audio. Therefore, the voiceprint feature in each of the split-mirror audios is extracted firstly, and the extraction mode can be a Gaussian mixture model and the like.

And C20, determining the dialogue text corresponding to each role label in the voice text according to the voiceprint characteristics.

Specifically, after obtaining the voiceprint features, the voiceprint features are corresponded to the role labels to obtain the role voiceprint relationship. And then determining a dialogue time period corresponding to each role label according to the role voiceprint relationship and the time period corresponding to each voiceprint characteristic. And then the voice text corresponding to the conversation time period is used as the dialogue text corresponding to the role label. In the aspect of determining the role voiceprint relationship, in a first manner of determining the role voiceprint relationship, in a split-mirror file, an actual situation is that a role a speaks, and then a role B speaks by taking a shot, so the role labels of the split-mirror file are the role a and the role B, the voiceprint features are the voiceprint feature a and the voiceprint feature B, and the initial relationships corresponding to the split-mirror file are "the role a corresponds to the voiceprint feature a", "the role B corresponds to the voiceprint feature a", and "the role B corresponds to the voiceprint feature B".

In a first determined implementation manner, an audio time period corresponding to each voiceprint feature is first mapped to a role time corresponding to each role tag, and role voiceprint relationships corresponding to the split mirror file are obtained as a voiceprint feature a corresponding to a role a and a voiceprint feature B corresponding to a role B.

This approach can only be applied to shots where the presence of a person and the speaking are in turn, but is not applicable in some special scenes. In a second implementation manner of this embodiment, a relationship between a character tag and a voiceprint feature determined by time is used as a candidate relationship, and then a character voiceprint relationship in the candidate relationship is determined based on all candidate relationships corresponding to the entire video file.

This embodiment will be described by taking the following four cases as examples:

in a split-mirror file, the actual situation is that the role a speaks, so the role label of the split-mirror file is the role a, and the voiceprint feature is the voiceprint feature a. The candidate relationship corresponding to the mirror file is 'role A corresponding to voiceprint feature A'.

In a mirror file, the actual situation is that a section of background is introduced for the voice-over, and then the role A speaks, so that the existing role label is only the role A, and the voiceprint features comprise a voiceprint feature A and a voiceprint feature C. The candidate relations corresponding to the mirror-divided file are ' voiceprint feature A corresponding to role A ' and ' voiceprint feature C corresponding to role.

In a split-mirror file, the actual situation is that a role a carries out a conversation with a role B, the role labels are the role a and the role B, the voiceprint features include a voiceprint feature a and a voiceprint feature B, and the candidate relations corresponding to the split-mirror file are "voiceprint feature a corresponding to the role a", "voiceprint feature B corresponding to the role a", "voiceprint feature a corresponding to the role B" and "voiceprint feature B corresponding to the role B".

In a split-mirror file, the scene is that the role B speaks in the role A, but the lens is only given to the role A, so that the psychological change of the role A is highlighted through the expression of the role A, therefore, the role label is the role A, the voiceprint feature is the voiceprint feature B, and the candidate relationship corresponding to the split mirror is the 'voiceprint feature B corresponding to the role A'.

For these candidate relations, the number of each candidate relation is calculated, and the most numerous candidate relations are selected as the role voiceprint relations, for example, for the role a, the above scenario total candidate relations are 8, and the number of the "role a corresponding voiceprint features a" is 3, and the number is the most, so the "role a corresponding voiceprint features a" is determined as the role voiceprint relations.

When the voiceprint features corresponding to each role label are determined and then voiceprint features which cannot correspond to the role labels exist, the voiceprint features are used as the voiceprint features of the voice-over-white, and the dialogue texts corresponding to the voiceprint features are used as dialogue texts of the voice-over-white.

And after the role voiceprint relationship is determined, determining the dialog text corresponding to each role label according to the role voiceprint relationship and the corresponding relationship between the voiceprint characteristics and the voice text.

And S40, generating a split-mirror script corresponding to the video file according to the lens information and the audio information.

Specifically, after the lens information and the audio information are obtained, as shown in fig. 2 and fig. 3, the lens information and the audio information corresponding to the split-lens file are written into a preset script file according to a preset script format. And when the information corresponding to all the mirror splitting files is written into the script file, generating a mirror splitting script corresponding to the video file.

In the script file, the shot information includes, in addition to the shot label, an identification image, a shot video (i.e., an image track of a time period corresponding to the split-lens file), and a shot duration, and the audio information includes a voice text, a music name, and the like.

Based on the scripted method of the video, the present invention further provides a terminal device, as shown in fig. 5, which includes at least one processor (processor) 20; a display screen 21; and a memory (memory) 22, and may further include a communication Interface (Communications Interface) 23 and a bus 24. The processor 20, the display 21, the memory 22 and the communication interface 23 can communicate with each other through the bus 24. The display screen 21 is configured to display a user guidance interface preset in the initial setting mode. The communication interface 23 may transmit information. The processor 20 may call logical commands in the memory 22 to perform the methods in the above embodiments.

In addition, the logic commands in the memory 22 can be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product.

The memory 22, which is a computer-readable storage medium, may be configured to store a software program, a computer-executable program, such as program commands or modules corresponding to the methods in the embodiments of the present disclosure. The processor 20 executes functional applications and data processing by executing software programs, commands or modules stored in the memory 22, i.e. implements the method in the above-described embodiments.

The memory 22 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal device, and the like. Further, the memory 22 may include a high speed random access memory and may also include a non-volatile memory. For example, a variety of media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk, may also be used as the transient computer readable storage medium.

In addition, the specific processes loaded and executed by the computer readable storage medium and the plurality of command processors in the terminal device are described in detail in the method, and are not stated herein.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method of scripting video, the method comprising:

acquiring a video file to be processed;

for each of the lens files, performing information extraction on a lens in the lens file to obtain lens information corresponding to the lens, where the lens information includes a lens label, the lens label includes a first attribute label and a second attribute label, the first attribute label includes a character label, the second attribute label includes an action label and/or an expression label, and for each of the lens files, performing information extraction on the lens in the lens file to obtain lens information corresponding to the lens, further including:

determining a role label corresponding to each person image set according to a preset role label set; and the number of the first and second groups,

performing audio processing on the split-mirror audio in the split-mirror file to obtain audio information corresponding to the split-mirror audio, wherein the audio information includes music information and/or voice information, and the performing audio processing on the split-mirror audio in the split-mirror file to obtain the audio information corresponding to the split-mirror audio includes:

filtering the text information according to the music lyrics to obtain a voice text corresponding to the audio frequency of the lens;

determining dialog texts corresponding to the role labels in the human voice texts according to the voiceprint characteristics;

wherein, according to the voiceprint features, determining the dialog text corresponding to each role tag in the vocal text comprises:

and corresponding the voiceprint features to the role labels to obtain role voiceprint relationships, wherein the step of corresponding the voiceprint features to the role labels to obtain the role voiceprint relationships comprises the following steps:

generating a plurality of candidate relations corresponding to the role labels according to the time corresponding to the role labels and the voiceprint characteristics;

selecting the candidate relationship with the largest number corresponding to each role label as the role voiceprint relationship corresponding to the role label;

determining a dialogue time period corresponding to each role label according to the role voiceprint relationship and the time period corresponding to each voiceprint characteristic;

for each role label, taking the voice text in the conversation time period corresponding to the role label as the dialogue text corresponding to the role label;

2. The video scripting method of claim 1, wherein the taking a shot split of the video file to obtain a plurality of split-mirror files comprises:

3. The scripted method of claim 2, wherein the calculating the frame difference between the nth frame image and the (N + 1) th frame image in the image track comprises:

4. The video scripting method according to claim 3, wherein for each of the split-mirror files, extracting information of a split-mirror shot in the split-mirror file, and obtaining shot information corresponding to the split-mirror shot comprises:

5. Scripted method according to claim 4, wherein the first attribute tag further comprises an animal tag and/or an object tag.

6. A computer readable storage medium, storing one or more programs, which are executable by one or more processors, to implement the steps in the scripted method of video according to any one of claims 1 to 5.

7. A terminal device, comprising: the device comprises a processor, a memory and a communication bus, wherein the memory is stored with a computer readable program which can be executed by the processor;

the processor, when executing the computer readable program, implements the steps in the scripted method of video according to any of claims 1 to 5.