CN111259109B - Method for converting audio frequency into video frequency based on video big data - Google Patents

Method for converting audio frequency into video frequency based on video big data Download PDF

Info

Publication number
CN111259109B
CN111259109B CN202010025473.2A CN202010025473A CN111259109B CN 111259109 B CN111259109 B CN 111259109B CN 202010025473 A CN202010025473 A CN 202010025473A CN 111259109 B CN111259109 B CN 111259109B
Authority
CN
China
Prior art keywords
video
audio
information
big
text information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010025473.2A
Other languages
Chinese (zh)
Other versions
CN111259109A (en
Inventor
康洪文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202010025473.2A priority Critical patent/CN111259109B/en
Publication of CN111259109A publication Critical patent/CN111259109A/en
Application granted granted Critical
Publication of CN111259109B publication Critical patent/CN111259109B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7834Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using audio features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue

Abstract

The invention discloses a method for converting audio frequency into video frequency based on video big data, which comprises the following specific steps: the user inputs a piece of audio information; extracting the voice as text information using voice recognition technology; performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; identifying audio information using a deep learning technique; automatically tagging a video big dataset based on video understanding by using a deep learning technique; in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree; generating the extracted text information as subtitle information; combining the video, the subtitle and the audio to generate a recommended video; the invention can convert the audio input by the user into the corresponding video content, greatly reduces the time cost of manually synthesizing the video and improves the content creation efficiency.

Description

Method for converting audio frequency into video frequency based on video big data
Technical Field
The invention relates to the technical field of media resource management, in particular to a method for converting audio frequency into video frequency based on video big data.
Background
For a content producer, audio information cannot provide visual picture information for a user due to the voice characteristic of the audio information, so that the user is not benefited to understand and accept the information content, and the traditional manual video conversion method needs to manually collect, browse and mark a large amount of video data, then select a plurality of fragments matched with the audio information, and huge time and energy are consumed.
Disclosure of Invention
Aiming at the defects and shortcomings of the prior art, the invention provides an audio-to-video method based on video big data, which can automatically match and select proper video clips from massive existing video data, quickly convert the audio into corresponding video content, provide a stronger visual impact and hearing experience for users, and convey information represented by authors to the users in a more vivid, plump and visual image.
In order to achieve the above purpose, the invention adopts the following technical scheme: it comprises the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using a speech recognition technique;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology;
4. identifying audio information using a deep learning technique;
5. automatically tagging a video big dataset based on video understanding by using a deep learning technique;
6. in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
Further, the voice recognition technology in the second step adopts a deep neural network technology;
further, the specific method of the third step is as follows: obtaining a word or a word vector through an embellishing layer technology, inputting a bidirectional LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a SoftMax hidden layer, and extracting a keyword sequence through a CRF supervision layer;
further, the specific method of the fourth step is as follows: identifying melody rhythm, emotion and genre characteristics of the audio by using the GRU network;
further, the specific method of the fifth step is as follows: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture and emotion analysis, and extracting scene information, object information, character expression and motion information of the video as tag content of the video;
further, the specific method in the step six is as follows: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; and secondly matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value, and outputting the video with the similarity exceeding 0.8.
After the scheme is adopted, the beneficial effects of the invention are as follows: according to the method for converting audio into video based on the video big data, disclosed by the invention, the video synthesis method based on the video big data tag matching system constructed by the artificial intelligence natural language processing technology and the deep learning technology can convert the audio input by a user into corresponding video content, so that the time cost of manually synthesizing the video is greatly reduced, and the content creation efficiency is improved.
Drawings
In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.
Fig. 1 is a flow chart of the present invention.
Detailed Description
The invention will be further described with reference to the accompanying drawings.
Referring to fig. 1, the technical scheme adopted in this embodiment is as follows: it comprises the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using deep neural network technology;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method comprises the steps of obtaining a word or word vector through an enabling layer technology, inputting a bidirectional LSTM, calculating through a softMax hidden layer to obtain an unsupervised probability sequence of a BIO labeling system, and extracting a keyword sequence through a CRF supervision layer;
4. identifying audio information by using a deep learning technology, in particular identifying melody rhythm, emotion and genre characteristics of the audio by using a GRU network;
5. based on video understanding, a large video data set is automatically labeled by utilizing a deep learning technology, specifically: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture, emotion analysis and the like, extracting scene information, object information, character expression, motion information and the like of the video, and taking the extracted scene information, object information, character expression, motion information and the like as tag content of the video;
6. in a large video data set label system, label retrieval matching is carried out, and video data with high matching degree is output, specifically: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; secondly, matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value for the second time, and outputting videos with similarity exceeding 0.8;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
The method for converting the audio frequency into the video frequency based on the video big data can automatically match and select proper video frequency fragments from massive existing video frequency data, quickly convert the audio frequency into corresponding video frequency contents, provide the users with stronger visual impact and hearing experience, and convey the information represented by the authors to the users in a more vivid, plump and visual image.
The foregoing is merely illustrative of the present invention and not restrictive, and other modifications and equivalents thereof may occur to those skilled in the art without departing from the spirit and scope of the present invention.

Claims (3)

1. The method for converting audio frequency into video frequency based on video big data is characterized by comprising the following steps:
1. the user inputs a piece of audio information;
2. extracting the audio content as text information using a speech recognition technique;
3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method for identifying and extracting the keywords of the extracted text information by utilizing the artificial intelligence natural language processing technology comprises the following steps: obtaining a word or a word vector by embedding an embedding layer technology, inputting a two-way long-short-term memory network LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a softMax hidden layer, and extracting a keyword sequence through a conditional random field CRF supervision layer;
4. identifying audio information using a deep learning technique;
5. automatically tagging a video big dataset based on video understanding by using a deep learning technique; the automatic tagging of a video big dataset based on video understanding using a deep learning technique includes: the automatic tagging of the video big data set is to extract space-time information of the video by using a depth 3D convolutional neural network, perform scene recognition, motion capture and emotion analysis, extract scene information, object information, character expression and motion information of the video and serve as tag content of the video;
6. in a video big data set label system, carrying out label retrieval matching with the keywords, and outputting video data with high matching degree; and carrying out secondary comparison on the similarity between the video tag value of the video data which are matched and output and the audio information, and outputting the video data which accord with the audio characteristic value, wherein the audio characteristic value comprises at least one of the following components: audio melody rhythm, emotion, genre;
7. generating the text information extracted in the second step into video subtitles;
8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.
2. The method for converting audio to video based on big video data according to claim 1, wherein the speech recognition technique in the second step uses a deep neural network technique.
3. The method for converting audio to video based on big video data according to claim 1, wherein the specific method in the fourth step is as follows: the melody rhythm, emotion, genre features of the audio are identified using the GRU network.
CN202010025473.2A 2020-01-10 2020-01-10 Method for converting audio frequency into video frequency based on video big data Active CN111259109B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010025473.2A CN111259109B (en) 2020-01-10 2020-01-10 Method for converting audio frequency into video frequency based on video big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010025473.2A CN111259109B (en) 2020-01-10 2020-01-10 Method for converting audio frequency into video frequency based on video big data

Publications (2)

Publication Number Publication Date
CN111259109A CN111259109A (en) 2020-06-09
CN111259109B true CN111259109B (en) 2023-12-05

Family

ID=70950316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010025473.2A Active CN111259109B (en) 2020-01-10 2020-01-10 Method for converting audio frequency into video frequency based on video big data

Country Status (1)

Country Link
CN (1) CN111259109B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3184814A1 (en) * 2020-07-03 2022-01-06 Nazar Yurievych PONOCHEVNYI A system (variants) for providing a harmonious combination of video files and audio files and a related method
CN113656643A (en) * 2021-08-20 2021-11-16 珠海九松科技有限公司 Algorithm for analyzing film-watching mood by using AI (artificial intelligence)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793446A (en) * 2012-10-29 2014-05-14 汤晓鸥 Music video generation method and system
CN107832382A (en) * 2017-10-30 2018-03-23 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and storage medium based on word generation video
CN108986186A (en) * 2018-08-14 2018-12-11 山东师范大学 The method and system of text conversion video

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
GB2359918A (en) * 2000-03-01 2001-09-05 Sony Uk Ltd Audio and/or video generation apparatus having a metadata generator
EP3532906A4 (en) * 2016-10-28 2020-04-15 Vilynx, Inc. Video tagging system and method
KR102085908B1 (en) * 2018-05-10 2020-03-09 네이버 주식회사 Content providing server, content providing terminal and content providing method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103793446A (en) * 2012-10-29 2014-05-14 汤晓鸥 Music video generation method and system
CN107832382A (en) * 2017-10-30 2018-03-23 百度在线网络技术(北京)有限公司 Method, apparatus, equipment and storage medium based on word generation video
CN108986186A (en) * 2018-08-14 2018-12-11 山东师范大学 The method and system of text conversion video

Also Published As

Publication number Publication date
CN111259109A (en) 2020-06-09

Similar Documents

Publication Publication Date Title
CN107844469B (en) Text simplification method based on word vector query model
CN111259196B (en) Method for converting article into video based on video big data
CN110119444B (en) Drawing type and generating type combined document abstract generating model
Chen et al. Efficient spatial temporal convolutional features for audiovisual continuous affect recognition
CN110633683A (en) Chinese sentence-level lip language recognition method combining DenseNet and resBi-LSTM
WO2023065617A1 (en) Cross-modal retrieval system and method based on pre-training model and recall and ranking
Alexiou et al. Exploring synonyms as context in zero-shot action recognition
CN111259109B (en) Method for converting audio frequency into video frequency based on video big data
CN111242033A (en) Video feature learning method based on discriminant analysis of video and character pairs
CN111325571B (en) Automatic generation method, device and system for commodity comment labels for multitask learning
Zhao et al. Videowhisper: Toward discriminative unsupervised video feature learning with attention-based recurrent neural networks
CN110866129A (en) Cross-media retrieval method based on cross-media uniform characterization model
CN108805036A (en) A kind of new non-supervisory video semanteme extracting method
CN114861082A (en) Multi-dimensional semantic representation-based aggressive comment detection method
CN115129934A (en) Multi-mode video understanding method
CN112949284B (en) Text semantic similarity prediction method based on Transformer model
CN114399646B (en) Image description method and device based on transform structure
CN115457309A (en) Image unsupervised classification method based on natural language
CN110879843B (en) Method for constructing self-adaptive knowledge graph technology based on machine learning
Rafi et al. A linear sub-structure with co-variance shift for image captioning
Chandaran et al. Image Captioning Using Deep Learning Techniques for Partially Impaired People
Agrahari et al. Text Recognition using Deep Learning: A Review
Xu et al. Humor Detection System for MuSE 2023: Contextual Modeling, Pesudo Labelling, and Post-smoothing
Agarwal et al. A Deep Learning Framework for Visual to Caption Translation
CN117688220A (en) Multi-mode information retrieval method and system based on large language model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20210220

Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors

Applicant after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Address before: 310012 no.2-10, north of building 13, 199 Wensan Road, Xihu District, Hangzhou City, Zhejiang Province

Applicant before: Hangzhou Huichuan Intelligent Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant