CN115272533A

CN115272533A - Intelligent image-text video conversion method and system based on video structured data

Info

Publication number: CN115272533A
Application number: CN202210907146.9A
Authority: CN
Inventors: 陈鹏; 张华伟
Original assignee: New One Beijing Technology Co ltd
Current assignee: New One Beijing Technology Co ltd
Priority date: 2022-07-29
Filing date: 2022-07-29
Publication date: 2022-11-01

Abstract

The invention discloses an intelligent image-text video conversion method and system based on video structured data, wherein the method comprises the steps of carrying out transcoding, shot segmentation and content identification on video files in large-scale video data sets, carrying out manual examination and modification on inaccurate identification results, and storing the inaccurate identification results in a database in a structured form to generate a material library; processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result; matching the keyword extraction and named entity recognition processing results with materials in a material library to obtain an optimal matching material; the best matching material is merged with the voice-over and subtitle files. The advantages are that: the problem that the arrangement of the materials and the manufacturing process are complicated and time-consuming in the video production process is solved, and the video production efficiency is improved.

Description

Intelligent image-text video conversion method and system based on video structured data

Technical Field

The invention relates to the technical field of computer video synthesis and artificial intelligence content generation, in particular to an intelligent image-text-to-video method and system based on video structured data.

Background

For the author of the file, video production belongs to the professional field, the threshold of entry is higher, and traditional video production needs to be written through the file script, the material is organized and compiled and establish the material storehouse, then accomplish video production through flows such as rough trimming, smart trimming, track synthesis, proofreading, whole process is consuming time and loaded down with trivial details, can't satisfy the epoch demand of current video information explosion.

Disclosure of Invention

The invention aims to provide an intelligent image-text video conversion method and system based on video structured data, so as to solve the problems in the prior art.

In order to achieve the purpose, the technical scheme adopted by the invention is as follows:

a method for converting intelligent pictures and texts into videos based on video structured data comprises the following steps,

s1, establishing a material library:

transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;

s2, image-text analysis:

processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating the voice-over and the corresponding subtitle file based on the processing result;

s3, material matching:

matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;

s4, video synthesis:

and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.

Preferably, step S1 specifically includes the following steps,

s11, converting the video file code stream into a preset format code stream, and transcoding all video files in a large-scale video data set;

s12, judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments, and entering the step S13; otherwise, directly entering step S13;

when the cosine similarity of two adjacent frames of pictures of the video file is greater than or equal to a similarity threshold, the pictures need to be segmented, otherwise, the pictures do not need to be segmented;

s13, identifying shots, characters, scenes, events, objects and subtitles appearing in the short video clip;

s14, judging the accuracy of the identification result, carrying out manual review and modification on the inaccurate identification result, and adding subjective description information including time, place, people and events;

and S15, storing the accurate recognition result and the recognition result after manual rechecking in a database in a json format in a structuralization mode, and generating a material library.

Preferably, step S13 is specifically to construct a plurality of deep learning models including shot recognition, face recognition, OCR recognition and voice recognition through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models.

Preferably, step S14 specifically includes the following steps,

s141, intercepting key frames in the short video clips, acquiring pictures of characters, scenes and articles, respectively carrying out similarity calculation on the pictures of the characters, the pictures of the scenes and the images of the articles and sample pictures in a face gallery, a scene gallery and an article gallery, judging whether a similarity score is greater than or equal to a score threshold value or not, and if yes, indicating that the identification is accurate; otherwise, the identification is inaccurate;

and S142, manually rechecking the identification information with inaccurate identification, manually modifying the identification information, and adding subjective description information which cannot be identified by the computer.

Preferably, step S2 specifically includes the following steps,

s21, judging whether the text submitted by the user is a plain text or a webpage link, and if the text is the plain text, directly entering the step S22; if the webpage is linked, extracting the webpage image-text content, formatting the extracted content to remove html tags and meaningless characters, and then entering step S22;

s22, segmenting the text, and extracting keywords and named entities of the segmented text;

and S23, converting the text processed in the step S21 into voice over voice by using a voice synthesis technology, and generating a corresponding subtitle file.

Preferably, step S22 specifically includes the following steps,

s221, extracting key sentences of the text by using a TextRank algorithm, and segmenting the text into paragraphs according to the key sentences

S222, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, figures and events, and the named entities comprise names of people, places, organizations and verbs; merging paragraphs for which the word count does not meet the word count threshold or paragraphs for which keywords and/or named entities have not been extracted.

Preferably, in step S23, the voice is synthesized through a voice synthesis technology, the text processed in step S21 is parsed, a subject, a predicate and an object are extracted and converted into a voice waveform, and then the voice waveform is synthesized into a complete audio through a time domain waveform splicing technology based on the PSOLA method, and then the complete audio is converted into a voice-over, and a corresponding subtitle file is generated.

Preferably, step S3 is specifically to perform text semantic matching on the materials in the material library and the processing result of step S22, perform text semantic similarity calculation on the four key elements including time, place, people, and events, respectively, perform descending order according to the similarity calculation result, and take the first ranked material as the best matching material.

Preferably, the text semantic similarity calculation for the key elements specifically includes the following contents,

s31, respectively carrying out vector representation on the two texts, and converting the texts into vector matrixes;

s32, the two texts are respectively and independently processed, coding is carried out through a deep neural network, and synthetic representation of the two texts is respectively obtained;

the synthetic characterization includes: a token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; a segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; a position embedding stage, which processes the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;

and S33, performing cosine similarity calculation on the synthesized representation of the two texts to obtain a similarity calculation result.

The invention also aims to provide an intelligent teletext to video conversion system based on video structured data, which is used for realizing any one of the methods, and comprises,

the material library module: transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;

the image-text analysis module: processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result;

the material matching module: matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;

a video synthesis module: and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.

The invention has the beneficial effects that: 1. the video content is structurally analyzed in an artificial and intelligent mode, the image-text content is converted into video output through an algorithm based on the analyzed data, the problem that the material arrangement and manufacturing process in the video production process is complex and time-consuming is solved, and the video production efficiency is improved. 2. The method reduces the threshold of the author of the file for making the video, does not need to write scripts, sort materials, roughly cut, finely cut, synthesize audio tracks, correct, dub and the like, and can convert the text into the video by one key as long as the text is written. 3. The difficulty of searching materials by a video creator is reduced, ideal materials can be quickly and accurately positioned in massive video contents, and the video production period is shortened.

Drawings

FIG. 1 is a flow chart of a method for converting graphics to video according to an embodiment of the present invention;

FIG. 2 is a flow chart of the establishment of a material library in an embodiment of the present invention;

FIG. 3 is a flow chart of graphics context analysis in an embodiment of the present invention;

FIG. 4 is a flow chart of material matching in an embodiment of the present invention;

fig. 5 is a block diagram of a teletext to video system according to an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.

Example one

As shown in fig. 1, in this embodiment, an intelligent teletext to video conversion method based on video structured data is provided, which includes four parts, namely, establishing a material library, performing teletext analysis, performing material matching, and performing video synthesis. The following is a description of the four sections.

1. Establishing a material library

The method comprises the steps of transcoding video files in a large-scale video data set, segmenting shots and identifying contents, carrying out manual examination and modification on inaccurate identification results, storing the accurate identification results and the identification results after examination and modification in a database in a structured mode, and generating a material base.

As shown in fig. 2, this section specifically includes,

1. transcoding the video file:

and converting the video file code stream into a preset format code stream, and realizing transcoding of all video files in a large-scale video data set. For example, the video file stream is completely converted into MP4 format.

2. Lens segmentation:

judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments, and entering the step 3; otherwise, directly entering step 3. The shot segmentation is to compare two adjacent frames of a video file, calculate the cosine similarity of the two frames of pictures, segment if the similarity is greater than or equal to a similarity threshold, and segment a long video into short video segments one by one; otherwise, the video file is the short video clip without segmentation.

3. Video content understanding for short video clips:

identifying shots, characters, scenes, events, objects and subtitles appearing in the short video clip;

constructing a plurality of deep learning models including lens recognition, face recognition, OCR recognition and voice recognition through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models. (specifically recognizing text, speech, objects, people, and actions in the video).

4. Judging the accuracy of the recognition result and manually rechecking:

and judging the accuracy of the identification result, carrying out manual review and modification on the inaccurate identification result, and adding subjective description information including time, place, people and events.

The method specifically comprises the following steps

4.1, judging the accuracy of the identification result:

intercepting key frames in the short video clips, acquiring pictures of characters, scenes and articles, respectively carrying out similarity calculation on the pictures of the characters, the pictures of the scenes and the images of the articles and sample pictures in a face gallery, a scene gallery and an article gallery, judging whether a similarity score is greater than or equal to a score threshold value or not, and if so, indicating that the identification is accurate; otherwise, the identification is inaccurate.

4.2, artificial rechecking:

and manually rechecking the identification information which is not accurately identified, confirming whether the description information extracted by the computer is correct, manually modifying the description information, and adding subjective description information which cannot be identified by the computer. The manual review mainly corrects key elements such as time, places, people and events, and modifies the information with wrong identification manually.

5. Establishing a material library:

and structuring and storing the accurate recognition result and the recognition result after the manual review in a database in a json format to form a material library.

2. Graph analysis

Processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; and generates the voice over speech and the corresponding subtitle file based on the processing result.

As shown in fig. 3, this section specifically includes,

1. judging the submitted text of the user:

after the user submits the content, firstly, judging whether the text submitted by the user is a plain text or a webpage link, and if the text submitted by the user is the plain text, directly entering the step 2; if the webpage is linked, extracting the webpage image-text content, formatting the extracted content to remove html tags and some meaningless characters, and entering the step 2;

2. text paragraph splitting, keyword extraction and named entity extraction:

and splitting the text into paragraphs, and extracting keywords and named entities from the text after the paragraphs are split.

This step may in particular comprise the following steps,

2.1, segmenting the paragraphs into key sentences which are used for extracting the text by using a TextRank algorithm, and segmenting the paragraphs of the text according to the key sentences; the TextRank is a weight algorithm designed for sentences in a text, each word is voted for a neighbor node by using a voting principle, the weight of each vote depends on the number of votes, a key sentence is counted according to the number of votes, and then paragraphs of an article are segmented according to the key sentence.

2.2, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, characters and events, and the named entities comprise names of people, places, names of organizations and verbs; paragraphs that do not meet the word count threshold (e.g., paragraph number is less than 30 words) or paragraphs that do not have keywords and/or named entities extracted are merged.

3. And (3) subtitle file generation:

and converting the text processed in the step S21 into the voice over speech by using a speech synthesis technology, and generating a corresponding subtitle file.

Synthesizing sound through a speech synthesis technology, performing grammatical analysis on the text processed in the step 2.1, extracting a subject, a predicate and an object, converting the subject, the predicate and the object into speech waveforms, synthesizing the speech waveforms into complete audio through a time domain waveform splicing technology based on a PSOLA method, converting the audio into voice-over-voice, and generating a corresponding subtitle file.

3. Material matching

Matching the keyword extraction and named entity identification processing result based on the paragraph with the materials in the material library to obtain the best matching materials.

As shown in fig. 4, this part specifically includes performing text semantic matching on the material in the material library and the processing result of step 2.2, performing text semantic similarity calculation on four key elements including time, place, person, and event, and performing descending sorting (i.e., reverse sorting) according to the similarity calculation result, and taking the first-ranked material as the best matching material.

The text semantic similarity calculation for the key elements specifically comprises the following contents,

3.1, respectively performing vector representation on the two texts, and converting the texts into vector matrixes;

3.2, the two texts are processed respectively and coded through a deep neural network(encode)Respectively obtaining the synthesized representations (embedding) of the two texts;

the composite token is summed from three tokens: (1) A token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; (2) A segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; (3) A position embedding stage, which processes the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;

and 3.3, performing cosine similarity calculation on the synthesized representation of the two texts to obtain a similarity calculation result.

4. Video composition

The best matching material is merged with the side-over sound and the corresponding subtitle file using a video compositing algorithm (e.g., ffmpeg program) to generate a complete video file.

Example two

As shown in fig. 5, the present embodiment provides an intelligent teletext to video conversion system based on video structured data, which is used for implementing the method, and comprises,

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:

the invention provides an intelligent image-text video conversion method and system based on video structured data, which are used for carrying out structured analysis on video contents in an artificial and intelligent mode, and converting the image-text contents into video output through an algorithm based on the analyzed data, so that the problems of complexity and time consumption in material arrangement and manufacturing processes in a video production process are solved, and the video production efficiency is improved. The threshold of making video by a document creator is reduced, scripts do not need to be written, materials do not need to be sorted, rough cutting, fine cutting, audio track synthesis, proofreading, dubbing and the like do not need to be done, and the video can be converted into the video by one key as long as the text is written. The difficulty of searching materials by a video creator is reduced, ideal materials can be quickly and accurately positioned in massive video contents, and the video production period is shortened.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims

1. A method for converting intelligent pictures and texts into videos based on video structured data is characterized by comprising the following steps: comprises the following steps of (a) carrying out,

s1, establishing a material library:

s2, image-text analysis:

processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result;

s3, material matching:

s4, video synthesis:

2. The method for intelligent teletext based on video structured data according to claim 1, wherein: the step S1 specifically includes the following contents,

s11, converting the video file code stream into a preset format code stream to realize transcoding of all video files in a large-scale video data set;

s12, judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments one by one, and then entering a step S13; otherwise, directly entering step S13;

and S15, storing the accurate recognition result and the recognition result after the manual review in a json format in a database in a structured manner to generate a material library.

3. The method for intelligent teletext to video-based on video structured data according to claim 2, wherein: step S13 is specifically that a plurality of deep learning models including lens recognition, face recognition, OCR recognition and voice recognition are constructed through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models.

4. The method for intelligent teletext to video-based on video structured data according to claim 2, wherein: the step S14 specifically includes the following contents,

5. The method for intelligent teletext to video-based on video structured data according to claim 1, wherein: the step S2 specifically includes the following contents,

and S23, converting the text processed in the step S21 into voice over speech by using a speech synthesis technology, and generating a corresponding subtitle file.

6. The method for intelligent teletext based on video structured data according to claim 5, wherein: the step S22 specifically includes the following contents,

S222, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, characters and events, and the named entities comprise names of people, places, names of institutions and verbs; merging paragraphs for which the word count does not meet the word count threshold or paragraphs for which keywords and/or named entities have not been extracted.

7. The method for intelligent teletext based on video structured data according to claim 5, wherein: step S23 is specifically to synthesize sound by a speech synthesis technique, perform grammatical analysis on the text processed in step S21, extract subject, predicate, and object, convert the subject, predicate, and object into speech waveforms, synthesize the speech waveforms into complete audio by a time domain waveform splicing technique based on the PSOLA method, convert the audio into voice-over, and generate corresponding subtitle files.

8. The method of claim 5, wherein the method comprises: step S3 is specifically to perform text semantic matching on the materials in the material library and the processing result of step S22, perform text semantic similarity calculation on the four key elements including time, place, people, and events, and perform descending sorting according to the similarity calculation result, and take the first ranked material as the best matching material.

9. The method of claim 8, wherein the method comprises: the text semantic similarity calculation of the key elements specifically comprises the following contents,

s32, the two texts are respectively and independently processed and coded through a deep neural network, and synthetic representations of the two texts are respectively obtained;

the synthetic characterization includes: a token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; a segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; a position embedding stage, which is used for processing the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;

10. A system for converting intelligent graphics and text into video based on video structured data is characterized in that: a system for implementing the method of any of the preceding claims 1 to 9, the system comprising,

a material library module: transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured mode, and a material base is generated;

the image-text analysis module: processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating the voice-over and the corresponding subtitle file based on the processing result;