CN111259109B - Method for converting audio frequency into video frequency based on video big data - Google Patents
Method for converting audio frequency into video frequency based on video big data Download PDFInfo
- Publication number
- CN111259109B CN111259109B CN202010025473.2A CN202010025473A CN111259109B CN 111259109 B CN111259109 B CN 111259109B CN 202010025473 A CN202010025473 A CN 202010025473A CN 111259109 B CN111259109 B CN 111259109B
- Authority
- CN
- China
- Prior art keywords
- video
- audio
- information
- big
- text information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000005516 engineering process Methods 0.000 claims abstract description 16
- 238000013135 deep learning Methods 0.000 claims abstract description 10
- 238000013473 artificial intelligence Methods 0.000 claims abstract description 6
- 238000003058 natural language processing Methods 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims abstract description 4
- 230000008451 emotion Effects 0.000 claims description 9
- 230000033764 rhythmic process Effects 0.000 claims description 6
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013528 artificial neural network Methods 0.000 claims description 3
- 238000013527 convolutional neural network Methods 0.000 claims description 3
- 238000002372 labelling Methods 0.000 claims description 3
- 238000009877 rendering Methods 0.000 claims description 3
- 230000002194 synthesizing effect Effects 0.000 abstract description 2
- 230000000007 visual effect Effects 0.000 description 5
- 230000002457 bidirectional effect Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000012634 fragment Substances 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001308 synthesis method Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7834—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
Abstract
The invention discloses a method for converting audio frequency into video frequency based on video big data, which comprises the following specific steps: the user inputs a piece of audio information; extracting the voice as text information using voice recognition technology; performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; identifying audio information using a deep learning technique; automatically tagging a video big dataset based on video understanding by using a deep learning technique; in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree; generating the extracted text information as subtitle information; combining the video, the subtitle and the audio to generate a recommended video; the invention can convert the audio input by the user into the corresponding video content, greatly reduces the time cost of manually synthesizing the video and improves the content creation efficiency.
Description
Technical Field
The invention relates to the technical field of media resource management, in particular to a method for converting audio frequency into video frequency based on video big data.
Background
For a content producer, audio information cannot provide visual picture information for a user due to the voice characteristic of the audio information, so that the user is not benefited to understand and accept the information content, and the traditional manual video conversion method needs to manually collect, browse and mark a large amount of video data, then select a plurality of fragments matched with the audio information, and huge time and energy are consumed.
Disclosure of Invention
Aiming at the defects and shortcomings of the prior art, the invention provides an audio-to-video method based on video big data, which can automatically match and select proper video clips from massive existing video data, quickly convert the audio into corresponding video content, provide a stronger visual impact and hearing experience for users, and convey information represented by authors to the users in a more vivid, plump and visual image.
In order to achieve the above purpose, the invention adopts the following technical scheme: it comprises the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using a speech recognition technique;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology;
4. identifying audio information using a deep learning technique;
5. automatically tagging a video big dataset based on video understanding by using a deep learning technique;
6. in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
Further, the voice recognition technology in the second step adopts a deep neural network technology;
further, the specific method of the third step is as follows: obtaining a word or a word vector through an embellishing layer technology, inputting a bidirectional LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a SoftMax hidden layer, and extracting a keyword sequence through a CRF supervision layer;
further, the specific method of the fourth step is as follows: identifying melody rhythm, emotion and genre characteristics of the audio by using the GRU network;
further, the specific method of the fifth step is as follows: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture and emotion analysis, and extracting scene information, object information, character expression and motion information of the video as tag content of the video;
further, the specific method in the step six is as follows: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; and secondly matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value, and outputting the video with the similarity exceeding 0.8.
After the scheme is adopted, the beneficial effects of the invention are as follows: according to the method for converting audio into video based on the video big data, disclosed by the invention, the video synthesis method based on the video big data tag matching system constructed by the artificial intelligence natural language processing technology and the deep learning technology can convert the audio input by a user into corresponding video content, so that the time cost of manually synthesizing the video is greatly reduced, and the content creation efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
Fig. 1 is a flow chart of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, the technical scheme adopted in this embodiment is as follows: it comprises the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using deep neural network technology;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method comprises the steps of obtaining a word or word vector through an enabling layer technology, inputting a bidirectional LSTM, calculating through a softMax hidden layer to obtain an unsupervised probability sequence of a BIO labeling system, and extracting a keyword sequence through a CRF supervision layer;
4. identifying audio information by using a deep learning technology, in particular identifying melody rhythm, emotion and genre characteristics of the audio by using a GRU network;
5. based on video understanding, a large video data set is automatically labeled by utilizing a deep learning technology, specifically: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture, emotion analysis and the like, extracting scene information, object information, character expression, motion information and the like of the video, and taking the extracted scene information, object information, character expression, motion information and the like as tag content of the video;
6. in a large video data set label system, label retrieval matching is carried out, and video data with high matching degree is output, specifically: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; secondly, matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value for the second time, and outputting videos with similarity exceeding 0.8;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
The method for converting the audio frequency into the video frequency based on the video big data can automatically match and select proper video frequency fragments from massive existing video frequency data, quickly convert the audio frequency into corresponding video frequency contents, provide the users with stronger visual impact and hearing experience, and convey the information represented by the authors to the users in a more vivid, plump and visual image.
The foregoing is merely illustrative of the present invention and not restrictive, and other modifications and equivalents thereof may occur to those skilled in the art without departing from the spirit and scope of the present invention.
Claims (3)
1. The method for converting audio frequency into video frequency based on video big data is characterized by comprising the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using a speech recognition technique;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method for identifying and extracting the keywords of the extracted text information by utilizing the artificial intelligence natural language processing technology comprises the following steps: obtaining a word or a word vector by embedding an embedding layer technology, inputting a two-way long-short-term memory network LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a softMax hidden layer, and extracting a keyword sequence through a conditional random field CRF supervision layer;
4. identifying audio information using a deep learning technique;
5. automatically tagging a video big dataset based on video understanding by using a deep learning technique; the automatic tagging of a video big dataset based on video understanding using a deep learning technique includes: the automatic tagging of the video big data set is to extract space-time information of the video by using a depth 3D convolutional neural network, perform scene recognition, motion capture and emotion analysis, extract scene information, object information, character expression and motion information of the video and serve as tag content of the video;
6. in a video big data set label system, carrying out label retrieval matching with the keywords, and outputting video data with high matching degree; and carrying out secondary comparison on the similarity between the video tag value of the video data which are matched and output and the audio information, and outputting the video data which accord with the audio characteristic value, wherein the audio characteristic value comprises at least one of the following components: audio melody rhythm, emotion, genre;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
2. The method for converting audio to video based on big video data according to claim 1, wherein the speech recognition technique in the second step uses a deep neural network technique.
3. The method for converting audio to video based on big video data according to claim 1, wherein the specific method in the fourth step is as follows: the melody rhythm, emotion, genre features of the audio are identified using the GRU network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010025473.2A CN111259109B (en) | 2020-01-10 | 2020-01-10 | Method for converting audio frequency into video frequency based on video big data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010025473.2A CN111259109B (en) | 2020-01-10 | 2020-01-10 | Method for converting audio frequency into video frequency based on video big data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111259109A CN111259109A (en) | 2020-06-09 |
CN111259109B true CN111259109B (en) | 2023-12-05 |
Family
ID=70950316
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010025473.2A Active CN111259109B (en) | 2020-01-10 | 2020-01-10 | Method for converting audio frequency into video frequency based on video big data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111259109B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CA3184814A1 (en) * | 2020-07-03 | 2022-01-06 | Nazar Yurievych PONOCHEVNYI | A system (variants) for providing a harmonious combination of video files and audio files and a related method |
CN113656643A (en) * | 2021-08-20 | 2021-11-16 | 珠海九松科技有限公司 | Algorithm for analyzing film-watching mood by using AI (artificial intelligence) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103793446A (en) * | 2012-10-29 | 2014-05-14 | 汤晓鸥 | Music video generation method and system |
CN107832382A (en) * | 2017-10-30 | 2018-03-23 | 百度在线网络技术(北京)有限公司 | Method, apparatus, equipment and storage medium based on word generation video |
CN108986186A (en) * | 2018-08-14 | 2018-12-11 | 山东师范大学 | The method and system of text conversion video |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
GB2359918A (en) * | 2000-03-01 | 2001-09-05 | Sony Uk Ltd | Audio and/or video generation apparatus having a metadata generator |
EP3532906A4 (en) * | 2016-10-28 | 2020-04-15 | Vilynx, Inc. | Video tagging system and method |
KR102085908B1 (en) * | 2018-05-10 | 2020-03-09 | 네이버 주식회사 | Content providing server, content providing terminal and content providing method |
-
2020
- 2020-01-10 CN CN202010025473.2A patent/CN111259109B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103793446A (en) * | 2012-10-29 | 2014-05-14 | 汤晓鸥 | Music video generation method and system |
CN107832382A (en) * | 2017-10-30 | 2018-03-23 | 百度在线网络技术(北京)有限公司 | Method, apparatus, equipment and storage medium based on word generation video |
CN108986186A (en) * | 2018-08-14 | 2018-12-11 | 山东师范大学 | The method and system of text conversion video |
Also Published As
Publication number | Publication date |
---|---|
CN111259109A (en) | 2020-06-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107844469B (en) | Text simplification method based on word vector query model | |
CN111259196B (en) | Method for converting article into video based on video big data | |
CN110119444B (en) | Drawing type and generating type combined document abstract generating model | |
Chen et al. | Efficient spatial temporal convolutional features for audiovisual continuous affect recognition | |
CN110633683A (en) | Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM | |
WO2023065617A1 (en) | Cross-modal retrieval system and method based on pre-training model and recall and ranking | |
Alexiou et al. | Exploring synonyms as context in zero-shot action recognition | |
CN111259109B (en) | Method for converting audio frequency into video frequency based on video big data | |
CN111242033A (en) | Video feature learning method based on discriminant analysis of video and character pairs | |
CN111325571B (en) | Automatic generation method, device and system for commodity comment labels for multitask learning | |
Zhao et al. | Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks | |
CN110866129A (en) | Cross-media retrieval method based on cross-media uniform characterization model | |
CN108805036A (en) | A kind of new non-supervisory video semanteme extracting method | |
CN114861082A (en) | Multi-dimensional semantic representation-based aggressive comment detection method | |
CN115129934A (en) | Multi-mode video understanding method | |
CN112949284B (en) | Text semantic similarity prediction method based on Transformer model | |
CN114399646B (en) | Image description method and device based on transform structure | |
CN115457309A (en) | Image unsupervised classification method based on natural language | |
CN110879843B (en) | Method for constructing self-adaptive knowledge graph technology based on machine learning | |
Rafi et al. | A linear sub-structure with co-variance shift for image captioning | |
Chandaran et al. | Image Captioning Using Deep Learning Techniques for Partially Impaired People | |
Agrahari et al. | Text Recognition using Deep Learning: A Review | |
Xu et al. | Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothing | |
Agarwal et al. | A Deep Learning Framework for Visual to Caption Translation | |
CN117688220A (en) | Multi-mode information retrieval method and system based on large language model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210220 Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors Applicant after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Address before: 310012 no.2-10, north of building 13, 199 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province Applicant before: Hangzhou Huichuan Intelligent Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |