CN111259109B

CN111259109B - Method for converting audio frequency into video frequency based on video big data

Info

Publication number: CN111259109B
Application number: CN202010025473.2A
Authority: CN
Inventors: 康洪文
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-01-10
Filing date: 2020-01-10
Publication date: 2023-12-05
Anticipated expiration: 2040-01-10
Also published as: CN111259109A

Abstract

The invention discloses a method for converting audio frequency into video frequency based on video big data, which comprises the following specific steps: the user inputs a piece of audio information; extracting the voice as text information using voice recognition technology; performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; identifying audio information using a deep learning technique; automatically tagging a video big dataset based on video understanding by using a deep learning technique; in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree; generating the extracted text information as subtitle information; combining the video, the subtitle and the audio to generate a recommended video; the invention can convert the audio input by the user into the corresponding video content, greatly reduces the time cost of manually synthesizing the video and improves the content creation efficiency.

Description

Method for converting audio frequency into video frequency based on video big data

Technical Field

The invention relates to the technical field of media resource management, in particular to a method for converting audio frequency into video frequency based on video big data.

Background

For a content producer, audio information cannot provide visual picture information for a user due to the voice characteristic of the audio information, so that the user is not benefited to understand and accept the information content, and the traditional manual video conversion method needs to manually collect, browse and mark a large amount of video data, then select a plurality of fragments matched with the audio information, and huge time and energy are consumed.

Disclosure of Invention

Aiming at the defects and shortcomings of the prior art, the invention provides an audio-to-video method based on video big data, which can automatically match and select proper video clips from massive existing video data, quickly convert the audio into corresponding video content, provide a stronger visual impact and hearing experience for users, and convey information represented by authors to the users in a more vivid, plump and visual image.

In order to achieve the above purpose, the invention adopts the following technical scheme: it comprises the following steps:

1. the user inputs a piece of audio information;

2. extracting the audio content as text information using a speech recognition technique;

3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology;

4. identifying audio information using a deep learning technique;

5. automatically tagging a video big dataset based on video understanding by using a deep learning technique;

6. in a large video data set label system, carrying out label retrieval and matching, and outputting video data with high matching degree;

7. generating the text information extracted in the second step into video subtitles;

8. and merging and rendering the video, the subtitle and the audio to generate a recommended video.

Further, the voice recognition technology in the second step adopts a deep neural network technology;

further, the specific method of the third step is as follows: obtaining a word or a word vector through an embellishing layer technology, inputting a bidirectional LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a SoftMax hidden layer, and extracting a keyword sequence through a CRF supervision layer;

further, the specific method of the fourth step is as follows: identifying melody rhythm, emotion and genre characteristics of the audio by using the GRU network;

further, the specific method of the fifth step is as follows: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture and emotion analysis, and extracting scene information, object information, character expression and motion information of the video as tag content of the video;

further, the specific method in the step six is as follows: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; and secondly matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value, and outputting the video with the similarity exceeding 0.8.

After the scheme is adopted, the beneficial effects of the invention are as follows: according to the method for converting audio into video based on the video big data, disclosed by the invention, the video synthesis method based on the video big data tag matching system constructed by the artificial intelligence natural language processing technology and the deep learning technology can convert the audio input by a user into corresponding video content, so that the time cost of manually synthesizing the video is greatly reduced, and the content creation efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions of the prior art, the drawings which are used in the description of the embodiments or the prior art will be briefly described, it being obvious that the drawings in the description below are only some embodiments of the invention, and that other drawings can be obtained according to these drawings without inventive faculty for a person skilled in the art.

Fig. 1 is a flow chart of the present invention.

Detailed Description

The invention will be further described with reference to the accompanying drawings.

Referring to fig. 1, the technical scheme adopted in this embodiment is as follows: it comprises the following steps:

1. the user inputs a piece of audio information;

2. extracting the audio content as text information using deep neural network technology;

3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method comprises the steps of obtaining a word or word vector through an enabling layer technology, inputting a bidirectional LSTM, calculating through a softMax hidden layer to obtain an unsupervised probability sequence of a BIO labeling system, and extracting a keyword sequence through a CRF supervision layer;

4. identifying audio information by using a deep learning technology, in particular identifying melody rhythm, emotion and genre characteristics of the audio by using a GRU network;

5. based on video understanding, a large video data set is automatically labeled by utilizing a deep learning technology, specifically: extracting space-time information of the video by using a depth 3D convolutional neural network, performing scene recognition, motion capture, emotion analysis and the like, extracting scene information, object information, character expression, motion information and the like of the video, and taking the extracted scene information, object information, character expression, motion information and the like as tag content of the video;

6. in a large video data set label system, label retrieval matching is carried out, and video data with high matching degree is output, specifically: performing similarity calculation on the keyword sequence characteristic value extracted in the third step and the label characteristic value in the video label library established in the fifth step, and outputting a video with similarity exceeding 0.85; secondly, matching the characteristic values of the rhythm, emotion, genre and the like of the audio melody extracted in the fourth step with the output video tag value for the second time, and outputting videos with similarity exceeding 0.8;

The method for converting the audio frequency into the video frequency based on the video big data can automatically match and select proper video frequency fragments from massive existing video frequency data, quickly convert the audio frequency into corresponding video frequency contents, provide the users with stronger visual impact and hearing experience, and convey the information represented by the authors to the users in a more vivid, plump and visual image.

The foregoing is merely illustrative of the present invention and not restrictive, and other modifications and equivalents thereof may occur to those skilled in the art without departing from the spirit and scope of the present invention.

Claims

1. The method for converting audio frequency into video frequency based on video big data is characterized by comprising the following steps:

1. the user inputs a piece of audio information;

3. performing keyword recognition and extraction on the extracted text information by using an artificial intelligence natural language processing technology; the method for identifying and extracting the keywords of the extracted text information by utilizing the artificial intelligence natural language processing technology comprises the following steps: obtaining a word or a word vector by embedding an embedding layer technology, inputting a two-way long-short-term memory network LSTM, calculating to obtain an unsupervised probability sequence of a BIO labeling system through a softMax hidden layer, and extracting a keyword sequence through a conditional random field CRF supervision layer;

4. identifying audio information using a deep learning technique;

5. automatically tagging a video big dataset based on video understanding by using a deep learning technique; the automatic tagging of a video big dataset based on video understanding using a deep learning technique includes: the automatic tagging of the video big data set is to extract space-time information of the video by using a depth 3D convolutional neural network, perform scene recognition, motion capture and emotion analysis, extract scene information, object information, character expression and motion information of the video and serve as tag content of the video;

6. in a video big data set label system, carrying out label retrieval matching with the keywords, and outputting video data with high matching degree; and carrying out secondary comparison on the similarity between the video tag value of the video data which are matched and output and the audio information, and outputting the video data which accord with the audio characteristic value, wherein the audio characteristic value comprises at least one of the following components: audio melody rhythm, emotion, genre;

2. The method for converting audio to video based on big video data according to claim 1, wherein the speech recognition technique in the second step uses a deep neural network technique.

3. The method for converting audio to video based on big video data according to claim 1, wherein the specific method in the fourth step is as follows: the melody rhythm, emotion, genre features of the audio are identified using the GRU network.