CN115272533A - Intelligent image-text video conversion method and system based on video structured data - Google Patents

Intelligent image-text video conversion method and system based on video structured data Download PDF

Info

Publication number
CN115272533A
CN115272533A CN202210907146.9A CN202210907146A CN115272533A CN 115272533 A CN115272533 A CN 115272533A CN 202210907146 A CN202210907146 A CN 202210907146A CN 115272533 A CN115272533 A CN 115272533A
Authority
CN
China
Prior art keywords
video
text
identification
processing
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210907146.9A
Other languages
Chinese (zh)
Inventor
陈鹏
张华伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
New One Beijing Technology Co ltd
Original Assignee
New One Beijing Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by New One Beijing Technology Co ltd filed Critical New One Beijing Technology Co ltd
Priority to CN202210907146.9A priority Critical patent/CN115272533A/en
Publication of CN115272533A publication Critical patent/CN115272533A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T11/002D [Two Dimensional] image generation
    • G06T11/60Editing figures and text; Combining figures or text
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • G06V20/635Overlay text, e.g. embedded captions in a TV program
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Acoustics & Sound (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

The invention discloses an intelligent image-text video conversion method and system based on video structured data, wherein the method comprises the steps of carrying out transcoding, shot segmentation and content identification on video files in large-scale video data sets, carrying out manual examination and modification on inaccurate identification results, and storing the inaccurate identification results in a database in a structured form to generate a material library; processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result; matching the keyword extraction and named entity recognition processing results with materials in a material library to obtain an optimal matching material; the best matching material is merged with the voice-over and subtitle files. The advantages are that: the problem that the arrangement of the materials and the manufacturing process are complicated and time-consuming in the video production process is solved, and the video production efficiency is improved.

Description

Intelligent image-text video conversion method and system based on video structured data
Technical Field
The invention relates to the technical field of computer video synthesis and artificial intelligence content generation, in particular to an intelligent image-text-to-video method and system based on video structured data.
Background
For the author of the file, video production belongs to the professional field, the threshold of entry is higher, and traditional video production needs to be written through the file script, the material is organized and compiled and establish the material storehouse, then accomplish video production through flows such as rough trimming, smart trimming, track synthesis, proofreading, whole process is consuming time and loaded down with trivial details, can't satisfy the epoch demand of current video information explosion.
Disclosure of Invention
The invention aims to provide an intelligent image-text video conversion method and system based on video structured data, so as to solve the problems in the prior art.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows:
a method for converting intelligent pictures and texts into videos based on video structured data comprises the following steps,
s1, establishing a material library:
transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;
s2, image-text analysis:
processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating the voice-over and the corresponding subtitle file based on the processing result;
s3, material matching:
matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;
s4, video synthesis:
and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.
Preferably, step S1 specifically includes the following steps,
s11, converting the video file code stream into a preset format code stream, and transcoding all video files in a large-scale video data set;
s12, judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments, and entering the step S13; otherwise, directly entering step S13;
when the cosine similarity of two adjacent frames of pictures of the video file is greater than or equal to a similarity threshold, the pictures need to be segmented, otherwise, the pictures do not need to be segmented;
s13, identifying shots, characters, scenes, events, objects and subtitles appearing in the short video clip;
s14, judging the accuracy of the identification result, carrying out manual review and modification on the inaccurate identification result, and adding subjective description information including time, place, people and events;
and S15, storing the accurate recognition result and the recognition result after manual rechecking in a database in a json format in a structuralization mode, and generating a material library.
Preferably, step S13 is specifically to construct a plurality of deep learning models including shot recognition, face recognition, OCR recognition and voice recognition through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models.
Preferably, step S14 specifically includes the following steps,
s141, intercepting key frames in the short video clips, acquiring pictures of characters, scenes and articles, respectively carrying out similarity calculation on the pictures of the characters, the pictures of the scenes and the images of the articles and sample pictures in a face gallery, a scene gallery and an article gallery, judging whether a similarity score is greater than or equal to a score threshold value or not, and if yes, indicating that the identification is accurate; otherwise, the identification is inaccurate;
and S142, manually rechecking the identification information with inaccurate identification, manually modifying the identification information, and adding subjective description information which cannot be identified by the computer.
Preferably, step S2 specifically includes the following steps,
s21, judging whether the text submitted by the user is a plain text or a webpage link, and if the text is the plain text, directly entering the step S22; if the webpage is linked, extracting the webpage image-text content, formatting the extracted content to remove html tags and meaningless characters, and then entering step S22;
s22, segmenting the text, and extracting keywords and named entities of the segmented text;
and S23, converting the text processed in the step S21 into voice over voice by using a voice synthesis technology, and generating a corresponding subtitle file.
Preferably, step S22 specifically includes the following steps,
s221, extracting key sentences of the text by using a TextRank algorithm, and segmenting the text into paragraphs according to the key sentences
S222, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, figures and events, and the named entities comprise names of people, places, organizations and verbs; merging paragraphs for which the word count does not meet the word count threshold or paragraphs for which keywords and/or named entities have not been extracted.
Preferably, in step S23, the voice is synthesized through a voice synthesis technology, the text processed in step S21 is parsed, a subject, a predicate and an object are extracted and converted into a voice waveform, and then the voice waveform is synthesized into a complete audio through a time domain waveform splicing technology based on the PSOLA method, and then the complete audio is converted into a voice-over, and a corresponding subtitle file is generated.
Preferably, step S3 is specifically to perform text semantic matching on the materials in the material library and the processing result of step S22, perform text semantic similarity calculation on the four key elements including time, place, people, and events, respectively, perform descending order according to the similarity calculation result, and take the first ranked material as the best matching material.
Preferably, the text semantic similarity calculation for the key elements specifically includes the following contents,
s31, respectively carrying out vector representation on the two texts, and converting the texts into vector matrixes;
s32, the two texts are respectively and independently processed, coding is carried out through a deep neural network, and synthetic representation of the two texts is respectively obtained;
the synthetic characterization includes: a token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; a segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; a position embedding stage, which processes the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;
and S33, performing cosine similarity calculation on the synthesized representation of the two texts to obtain a similarity calculation result.
The invention also aims to provide an intelligent teletext to video conversion system based on video structured data, which is used for realizing any one of the methods, and comprises,
the material library module: transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;
the image-text analysis module: processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result;
the material matching module: matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;
a video synthesis module: and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.
The invention has the beneficial effects that: 1. the video content is structurally analyzed in an artificial and intelligent mode, the image-text content is converted into video output through an algorithm based on the analyzed data, the problem that the material arrangement and manufacturing process in the video production process is complex and time-consuming is solved, and the video production efficiency is improved. 2. The method reduces the threshold of the author of the file for making the video, does not need to write scripts, sort materials, roughly cut, finely cut, synthesize audio tracks, correct, dub and the like, and can convert the text into the video by one key as long as the text is written. 3. The difficulty of searching materials by a video creator is reduced, ideal materials can be quickly and accurately positioned in massive video contents, and the video production period is shortened.
Drawings
FIG. 1 is a flow chart of a method for converting graphics to video according to an embodiment of the present invention;
FIG. 2 is a flow chart of the establishment of a material library in an embodiment of the present invention;
FIG. 3 is a flow chart of graphics context analysis in an embodiment of the present invention;
FIG. 4 is a flow chart of material matching in an embodiment of the present invention;
fig. 5 is a block diagram of a teletext to video system according to an embodiment of the invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
Example one
As shown in fig. 1, in this embodiment, an intelligent teletext to video conversion method based on video structured data is provided, which includes four parts, namely, establishing a material library, performing teletext analysis, performing material matching, and performing video synthesis. The following is a description of the four sections.
1. Establishing a material library
The method comprises the steps of transcoding video files in a large-scale video data set, segmenting shots and identifying contents, carrying out manual examination and modification on inaccurate identification results, storing the accurate identification results and the identification results after examination and modification in a database in a structured mode, and generating a material base.
As shown in fig. 2, this section specifically includes,
1. transcoding the video file:
and converting the video file code stream into a preset format code stream, and realizing transcoding of all video files in a large-scale video data set. For example, the video file stream is completely converted into MP4 format.
2. Lens segmentation:
judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments, and entering the step 3; otherwise, directly entering step 3. The shot segmentation is to compare two adjacent frames of a video file, calculate the cosine similarity of the two frames of pictures, segment if the similarity is greater than or equal to a similarity threshold, and segment a long video into short video segments one by one; otherwise, the video file is the short video clip without segmentation.
3. Video content understanding for short video clips:
identifying shots, characters, scenes, events, objects and subtitles appearing in the short video clip;
constructing a plurality of deep learning models including lens recognition, face recognition, OCR recognition and voice recognition through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models. (specifically recognizing text, speech, objects, people, and actions in the video).
4. Judging the accuracy of the recognition result and manually rechecking:
and judging the accuracy of the identification result, carrying out manual review and modification on the inaccurate identification result, and adding subjective description information including time, place, people and events.
The method specifically comprises the following steps
4.1, judging the accuracy of the identification result:
intercepting key frames in the short video clips, acquiring pictures of characters, scenes and articles, respectively carrying out similarity calculation on the pictures of the characters, the pictures of the scenes and the images of the articles and sample pictures in a face gallery, a scene gallery and an article gallery, judging whether a similarity score is greater than or equal to a score threshold value or not, and if so, indicating that the identification is accurate; otherwise, the identification is inaccurate.
4.2, artificial rechecking:
and manually rechecking the identification information which is not accurately identified, confirming whether the description information extracted by the computer is correct, manually modifying the description information, and adding subjective description information which cannot be identified by the computer. The manual review mainly corrects key elements such as time, places, people and events, and modifies the information with wrong identification manually.
5. Establishing a material library:
and structuring and storing the accurate recognition result and the recognition result after the manual review in a database in a json format to form a material library.
2. Graph analysis
Processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; and generates the voice over speech and the corresponding subtitle file based on the processing result.
As shown in fig. 3, this section specifically includes,
1. judging the submitted text of the user:
after the user submits the content, firstly, judging whether the text submitted by the user is a plain text or a webpage link, and if the text submitted by the user is the plain text, directly entering the step 2; if the webpage is linked, extracting the webpage image-text content, formatting the extracted content to remove html tags and some meaningless characters, and entering the step 2;
2. text paragraph splitting, keyword extraction and named entity extraction:
and splitting the text into paragraphs, and extracting keywords and named entities from the text after the paragraphs are split.
This step may in particular comprise the following steps,
2.1, segmenting the paragraphs into key sentences which are used for extracting the text by using a TextRank algorithm, and segmenting the paragraphs of the text according to the key sentences; the TextRank is a weight algorithm designed for sentences in a text, each word is voted for a neighbor node by using a voting principle, the weight of each vote depends on the number of votes, a key sentence is counted according to the number of votes, and then paragraphs of an article are segmented according to the key sentence.
2.2, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, characters and events, and the named entities comprise names of people, places, names of organizations and verbs; paragraphs that do not meet the word count threshold (e.g., paragraph number is less than 30 words) or paragraphs that do not have keywords and/or named entities extracted are merged.
3. And (3) subtitle file generation:
and converting the text processed in the step S21 into the voice over speech by using a speech synthesis technology, and generating a corresponding subtitle file.
Synthesizing sound through a speech synthesis technology, performing grammatical analysis on the text processed in the step 2.1, extracting a subject, a predicate and an object, converting the subject, the predicate and the object into speech waveforms, synthesizing the speech waveforms into complete audio through a time domain waveform splicing technology based on a PSOLA method, converting the audio into voice-over-voice, and generating a corresponding subtitle file.
3. Material matching
Matching the keyword extraction and named entity identification processing result based on the paragraph with the materials in the material library to obtain the best matching materials.
As shown in fig. 4, this part specifically includes performing text semantic matching on the material in the material library and the processing result of step 2.2, performing text semantic similarity calculation on four key elements including time, place, person, and event, and performing descending sorting (i.e., reverse sorting) according to the similarity calculation result, and taking the first-ranked material as the best matching material.
The text semantic similarity calculation for the key elements specifically comprises the following contents,
3.1, respectively performing vector representation on the two texts, and converting the texts into vector matrixes;
3.2, the two texts are processed respectively and coded through a deep neural network(encode)Respectively obtaining the synthesized representations (embedding) of the two texts;
the composite token is summed from three tokens: (1) A token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; (2) A segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; (3) A position embedding stage, which processes the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;
and 3.3, performing cosine similarity calculation on the synthesized representation of the two texts to obtain a similarity calculation result.
4. Video composition
The best matching material is merged with the side-over sound and the corresponding subtitle file using a video compositing algorithm (e.g., ffmpeg program) to generate a complete video file.
Example two
As shown in fig. 5, the present embodiment provides an intelligent teletext to video conversion system based on video structured data, which is used for implementing the method, and comprises,
the material library module: transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;
the image-text analysis module: processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result;
the material matching module: matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;
a video synthesis module: and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.
By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained:
the invention provides an intelligent image-text video conversion method and system based on video structured data, which are used for carrying out structured analysis on video contents in an artificial and intelligent mode, and converting the image-text contents into video output through an algorithm based on the analyzed data, so that the problems of complexity and time consumption in material arrangement and manufacturing processes in a video production process are solved, and the video production efficiency is improved. The threshold of making video by a document creator is reduced, scripts do not need to be written, materials do not need to be sorted, rough cutting, fine cutting, audio track synthesis, proofreading, dubbing and the like do not need to be done, and the video can be converted into the video by one key as long as the text is written. The difficulty of searching materials by a video creator is reduced, ideal materials can be quickly and accurately positioned in massive video contents, and the video production period is shortened.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and improvements can be made without departing from the principle of the present invention, and such modifications and improvements should also be considered within the scope of the present invention.

Claims (10)

1. A method for converting intelligent pictures and texts into videos based on video structured data is characterized by comprising the following steps: comprises the following steps of (a) carrying out,
s1, establishing a material library:
transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured form, and a material library is generated;
s2, image-text analysis:
processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating a voice over white and a corresponding subtitle file based on the processing result;
s3, material matching:
matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;
s4, video synthesis:
and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.
2. The method for intelligent teletext based on video structured data according to claim 1, wherein: the step S1 specifically includes the following contents,
s11, converting the video file code stream into a preset format code stream to realize transcoding of all video files in a large-scale video data set;
s12, judging whether the transcoded video file needs to be subjected to shot segmentation, if so, segmenting the transcoded video file into short video segments one by one, and then entering a step S13; otherwise, directly entering step S13;
when the cosine similarity of two adjacent frames of pictures of the video file is greater than or equal to a similarity threshold, the pictures need to be segmented, otherwise, the pictures do not need to be segmented;
s13, identifying shots, characters, scenes, events, objects and subtitles appearing in the short video clip;
s14, judging the accuracy of the identification result, carrying out manual review and modification on the inaccurate identification result, and adding subjective description information including time, place, people and events;
and S15, storing the accurate recognition result and the recognition result after the manual review in a json format in a database in a structured manner to generate a material library.
3. The method for intelligent teletext to video-based on video structured data according to claim 2, wherein: step S13 is specifically that a plurality of deep learning models including lens recognition, face recognition, OCR recognition and voice recognition are constructed through a deep learning convolutional neural network; and extracting shot information, character information, scene information, event information, object information and subtitle information in the short video segment by using the deep learning models.
4. The method for intelligent teletext to video-based on video structured data according to claim 2, wherein: the step S14 specifically includes the following contents,
s141, intercepting key frames in the short video clips, acquiring pictures of characters, scenes and articles, respectively carrying out similarity calculation on the pictures of the characters, the pictures of the scenes and the images of the articles and sample pictures in a face gallery, a scene gallery and an article gallery, judging whether a similarity score is greater than or equal to a score threshold value or not, and if yes, indicating that the identification is accurate; otherwise, the identification is inaccurate;
and S142, manually rechecking the identification information with inaccurate identification, manually modifying the identification information, and adding subjective description information which cannot be identified by the computer.
5. The method for intelligent teletext to video-based on video structured data according to claim 1, wherein: the step S2 specifically includes the following contents,
s21, judging whether the text submitted by the user is a plain text or a webpage link, and if the text is the plain text, directly entering the step S22; if the webpage is linked, extracting the webpage image-text content, formatting the extracted content to remove html tags and meaningless characters, and then entering step S22;
s22, segmenting the text, and extracting keywords and named entities of the segmented text;
and S23, converting the text processed in the step S21 into voice over speech by using a speech synthesis technology, and generating a corresponding subtitle file.
6. The method for intelligent teletext based on video structured data according to claim 5, wherein: the step S22 specifically includes the following contents,
s221, extracting key sentences of the text by using a TextRank algorithm, and segmenting the text into paragraphs according to the key sentences
S222, extracting keywords and named entities from the segmented paragraphs, wherein the keywords comprise time, scenes, characters and events, and the named entities comprise names of people, places, names of institutions and verbs; merging paragraphs for which the word count does not meet the word count threshold or paragraphs for which keywords and/or named entities have not been extracted.
7. The method for intelligent teletext based on video structured data according to claim 5, wherein: step S23 is specifically to synthesize sound by a speech synthesis technique, perform grammatical analysis on the text processed in step S21, extract subject, predicate, and object, convert the subject, predicate, and object into speech waveforms, synthesize the speech waveforms into complete audio by a time domain waveform splicing technique based on the PSOLA method, convert the audio into voice-over, and generate corresponding subtitle files.
8. The method of claim 5, wherein the method comprises: step S3 is specifically to perform text semantic matching on the materials in the material library and the processing result of step S22, perform text semantic similarity calculation on the four key elements including time, place, people, and events, and perform descending sorting according to the similarity calculation result, and take the first ranked material as the best matching material.
9. The method of claim 8, wherein the method comprises: the text semantic similarity calculation of the key elements specifically comprises the following contents,
s31, respectively carrying out vector representation on the two texts, and converting the texts into vector matrixes;
s32, the two texts are respectively and independently processed and coded through a deep neural network, and synthetic representations of the two texts are respectively obtained;
the synthetic characterization includes: a token embedding stage, namely processing words and converting each word into a vector with fixed dimensionality; a segment embedding stage, wherein sentences are processed, and the representation of the sentences is extracted; a position embedding stage, which is used for processing the same word appearing at different positions; adding the representations of the three stages according to elements to obtain a synthesized representation;
and S33, performing cosine similarity calculation on the synthesized representation of the two texts to obtain a similarity calculation result.
10. A system for converting intelligent graphics and text into video based on video structured data is characterized in that: a system for implementing the method of any of the preceding claims 1 to 9, the system comprising,
a material library module: transcoding, shot segmentation and content identification are carried out on video files in a large-scale video data set, manual examination and modification are carried out on inaccurate identification results, the accurate identification results and the identification results after examination and modification are stored in a database in a structured mode, and a material base is generated;
the image-text analysis module: processing a text submitted by a user based on the text type, and performing paragraph splitting, keyword extraction and named entity identification processing on a processing result; generating the voice-over and the corresponding subtitle file based on the processing result;
the material matching module: matching the keyword extraction and named entity identification processing result based on the paragraph with materials in a material library to obtain an optimal matching material;
a video synthesis module: and merging the best matching material, the voice over the white and the corresponding subtitle file by using a video synthesis algorithm to generate a complete video file.
CN202210907146.9A 2022-07-29 2022-07-29 Intelligent image-text video conversion method and system based on video structured data Pending CN115272533A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210907146.9A CN115272533A (en) 2022-07-29 2022-07-29 Intelligent image-text video conversion method and system based on video structured data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210907146.9A CN115272533A (en) 2022-07-29 2022-07-29 Intelligent image-text video conversion method and system based on video structured data

Publications (1)

Publication Number Publication Date
CN115272533A true CN115272533A (en) 2022-11-01

Family

ID=83770787

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210907146.9A Pending CN115272533A (en) 2022-07-29 2022-07-29 Intelligent image-text video conversion method and system based on video structured data

Country Status (1)

Country Link
CN (1) CN115272533A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689833A (en) * 2022-12-29 2023-02-03 成都华栖云科技有限公司 Intelligent teaching spatial mode construction method based on multi-dimensional perception and pervasive computing
CN117082293A (en) * 2023-10-16 2023-11-17 成都华栖云科技有限公司 Automatic video generation method and device based on text creative

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115689833A (en) * 2022-12-29 2023-02-03 成都华栖云科技有限公司 Intelligent teaching spatial mode construction method based on multi-dimensional perception and pervasive computing
CN117082293A (en) * 2023-10-16 2023-11-17 成都华栖云科技有限公司 Automatic video generation method and device based on text creative
CN117082293B (en) * 2023-10-16 2023-12-19 成都华栖云科技有限公司 Automatic video generation method and device based on text creative

Similar Documents

Publication Publication Date Title
US11776267B2 (en) Intelligent cataloging method for all-media news based on multi-modal information fusion understanding
CN115272533A (en) Intelligent image-text video conversion method and system based on video structured data
CN107608960B (en) Method and device for linking named entities
CN101382937A (en) Multimedia resource processing method based on speech recognition and on-line teaching system thereof
Poignant et al. Unsupervised speaker identification in TV broadcast based on written names
CN112784078A (en) Video automatic editing method based on semantic recognition
JP2005167452A (en) Video scene interval information extracting method, apparatus, program, and recording medium with program recorded thereon
CN114547370A (en) Video abstract extraction method and system
CN116092472A (en) Speech synthesis method and synthesis system
CN111681678A (en) Method, system, device and storage medium for automatically generating sound effect and matching video
CN114547373A (en) Method for intelligently identifying and searching programs based on audio
CN114048335A (en) Knowledge base-based user interaction method and device
CN116958997B (en) Graphic summary method and system based on heterogeneous graphic neural network
CN114328899A (en) Text summary generation method, device, equipment and storage medium
US10958982B1 (en) Closed-caption processing using machine learning for media advertisement detection
CN117594036A (en) Method for converting audio frequency into video frequency based on video big data
CN117216008A (en) Knowledge graph-based archive multi-mode intelligent compiling method and system
Kim et al. Towards practical and efficient image-to-speech captioning with vision-language pre-training and multi-modal tokens
Emad et al. Automatic Video summarization with Timestamps using natural language processing text fusion
CN115795026A (en) Chinese text abstract generation method based on comparative learning
Haloi et al. Unsupervised story segmentation and indexing of broadcast news video
CN115988149A (en) Method for generating video by AI intelligent graphics context
Hukkeri et al. Erratic navigation in lecture videos using hybrid text based index point generation
CN111681680B (en) Method, system, device and readable storage medium for acquiring audio frequency by video recognition object
CN115481254A (en) Method, system, readable storage medium and equipment for analyzing video effect content of movie and television play script

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination