CN110781328A

CN110781328A - Video generation method, system, device and storage medium based on voice recognition

Info

Publication number: CN110781328A
Application number: CN201910846382.2A
Authority: CN
Inventors: 呼伦夫
Original assignee: Tianmai Juyuan (hangzhou) Media Technology Co Ltd
Current assignee: Beijing Lajin Zhongbo Technology Co ltd
Priority date: 2019-09-09
Filing date: 2019-09-09
Publication date: 2020-02-11

Abstract

The invention discloses a video generation method, a system, a device and a storage medium based on voice recognition, wherein the method comprises the following steps: acquiring voice information, and generating character information after recognizing the voice information; analyzing the character information to obtain character characteristics; acquiring picture information and/or video information by combining the character features and a preset retrieval model; video data is generated in conjunction with the voice information and the picture information and/or video information. The invention automatically identifies and analyzes according to the voice information input by the user, acquires the picture information and/or the video information according to the character characteristics acquired by analysis, does not need to manually search and collect the picture or the video material, greatly saves the video material collecting time, improves the video making efficiency, achieves the effect of quickly making the video, and can be widely applied to the field of video making.

Description

Video generation method, system, device and storage medium based on voice recognition

Technical Field

The present invention relates to the field of video production, and in particular, to a method, a system, an apparatus, and a storage medium for generating a video based on speech recognition.

Background

With the development of internet technology and self-media, many video platforms and corresponding video software, such as today's headlines, watermelon videos, jitters and the like, are available, and many network redplayers and self-media bloggers are also available. The blogger obtains click rate and attracts the attention of the fan by making videos and playing the videos on the video software, such as making movie comment videos or current affair comment videos. When making a video, a blogger must not only write draft words, but also collect pictures or video materials, so that making a video takes much time, and the efficiency of making the video is seriously influenced. At present, the bloggers urgently hope to have a corresponding scheme to help improve the video production efficiency, however, no corresponding scheme exists at present.

Disclosure of Invention

In order to solve the above technical problems, it is an object of the present invention to provide a method, system, apparatus, and storage medium capable of rapidly making a video based on voice recognition.

The first technical scheme adopted by the invention is as follows:

a video generation method based on voice recognition comprises the following steps:

acquiring voice information, and generating character information after recognizing the voice information;

analyzing the character information to obtain character characteristics;

acquiring picture information and/or video information by combining the character features and a preset retrieval model;

video data is generated in conjunction with the voice information and the picture information and/or video information.

Further, the step of obtaining the character features after analyzing the character information specifically includes the following steps:

identifying noun words in the character information, and counting the occurrence times of each noun word;

and acquiring a plurality of key noun words as character features according to the occurrence frequency of each noun word.

Further, the preset retrieval model is a web crawler model, and the acquiring of the picture information and/or the video information by combining the character features and the preset retrieval model specifically comprises:

and scanning and retrieving in the network by combining the character characteristics and the web crawler model, and acquiring picture information and/or video information corresponding to the character characteristics.

Further, the step of generating video data by combining the voice information and the picture information and/or the video information specifically includes the following steps:

typesetting the retrieved picture information and/or video information;

and synthesizing the voice information and the picture information and/or the video information into video data by adopting a preset rendering engine.

Further, the step of synthesizing the voice information and the picture information and/or the video information into video data by using a preset rendering engine specifically includes:

acquiring a playing scene model by combining character characteristics and a preset model database;

and synthesizing the voice information, the playing scene model and the picture information and/or the video information into video data by adopting a preset rendering engine.

Further, the method also comprises a subtitle generating step, wherein the subtitle generating step specifically comprises the following steps:

dividing the text information into a plurality of subtitles according to a preset mode, labeling and sequencing each subtitle, and playing the subtitles according to the sequencing order;

the voice in the video is synchronized with the subtitles by identifying the characters of the voice information in the video playing process and controlling the display time of each subtitle according to the identified characters.

The second technical scheme adopted by the invention is as follows:

a video generation system based on speech recognition, comprising:

the voice conversion module is used for acquiring voice information, identifying the voice information and generating character information;

the character analysis module is used for analyzing the character information to obtain character characteristics;

the picture acquisition module is used for acquiring picture information and/or video information by combining the character characteristics and a preset retrieval model;

and the video generation module is used for generating video data by combining the voice information and the picture information and/or the video information.

Furthermore, the character analysis module comprises a vocabulary statistics unit and a feature extraction unit;

the word counting unit is used for identifying noun words in the character information and counting the occurrence frequency of each noun word;

the feature extraction unit is used for acquiring a plurality of key noun vocabularies as character features according to the occurrence frequency of each noun vocabulary.

The third technical scheme adopted by the invention is as follows:

a video generation apparatus based on speech recognition, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method described above.

The fourth technical scheme adopted by the invention is as follows:

a storage medium having stored therein processor-executable instructions for performing the method as described above when executed by a processor.

The invention has the beneficial effects that: according to the invention, automatic identification and analysis are carried out according to the voice information input by the user, and the picture information and/or the video information are obtained according to the character characteristics obtained by analysis, so that the picture or the video material does not need to be searched and collected manually, the video material collecting time is greatly saved, the video making efficiency is improved, and the effect of quickly making the video is achieved.

Drawings

FIG. 1 is a flow chart of the steps of a method for video generation based on speech recognition according to the present invention;

FIG. 2 is a block diagram of a video generation system based on speech recognition according to the present invention;

FIG. 3 is a diagram of a playback scene model in an exemplary embodiment;

FIG. 4 is a diagram of another playback scene model in an exemplary embodiment.

Detailed Description

As shown in fig. 1, the present embodiment provides a video generation method based on speech recognition, including the following steps:

s1, acquiring voice information, recognizing the voice information and generating character information;

s2, analyzing the character information to obtain character characteristics;

s3, combining the character features and a preset retrieval model to obtain picture information and/or video information;

and S4, combining the voice information and the picture information and/or the video information to generate video data.

In the method of this embodiment, the voice information may be voice data downloaded from the internet, or voice data input by the user through voice, and in this embodiment, the voice input by the user, for example, after the publisher writes a manuscript, the word of the manuscript is recited to the manuscript to input the voice information. After the voice information is obtained, the voice information is recognized to generate text information, wherein the recognition of the voice information can be realized by adopting the existing technology, for example, the recognition is carried out by using a Baidu AI open platform, and the recognition can also be carried out by adopting the existing software for converting voice into text. The method comprises the steps of analyzing text information to extract main information, namely text characteristics, obtaining picture information and/or video information by adopting a preset retrieval model according to the obtained text characteristics, wherein the preset retrieval model can be a web crawler model or a picture-text cross-modal retrieval model, the picture information is a picture corresponding to the text characteristics, the video information is a video corresponding to the text characteristics, for example, if the text characteristics are bridges, pictures of a plurality of bridges or aerial view videos of the bridges are obtained. And finally, combining the voice information and the picture information and/or the video information to generate video data. Therefore, the user only needs to input the voice information, the corresponding picture and/or video material can be automatically obtained according to the voice information, the troubles of collecting and editing the video material and the like of the user are avoided, the video making time is greatly shortened, and the desire of quickly making the video is realized.

Wherein the step S2 specifically includes steps S21 to S22:

s21, identifying noun words in the character information, and counting the occurrence frequency of each noun word;

s22, acquiring a plurality of key noun words as character features according to the occurrence frequency of each noun word.

After identifying the noun words in the text information, counting the occurrence frequency of each word, and obtaining a plurality of key noun words according to the counting result, for example, obtaining the first five noun words with the largest occurrence frequency as the key noun words through counting. Taking the obtained key noun words as character features, for example, the input character information is a comment on movie & lt & ltChinese & gt western journey & gt, and the recognized key words are as follows: and if the doctor has western grand, ace treasure, purple clouds fairy and zhongxing chi, the corresponding picture or video can be searched according to the obtained keywords, such as video data or picture data in western grand.

The preset retrieval model is a web crawler model, and the step S3 specifically includes: and scanning and retrieving in the network by combining the character characteristics and the web crawler model, and acquiring picture information and/or video information corresponding to the character characteristics.

The preset retrieval model can be a web crawler model or a picture-text cross-modal retrieval model, when the picture-text cross-modal retrieval model is adopted, a picture-text database needs to be established in advance, and finally, the final picture information is obtained by comparing a similarity matrix of a picture and a text, the comparison result of the model is accurate, but a database needs to be established and the resource of the database is relatively limited, so the scheme adopts the web crawler model, the picture information and/or video information corresponding to the character characteristics is directly retrieved from the network through the web crawler model, the web crawler model is realized by adopting the existing model, and a special model structure is not required.

Wherein, the step S4 specifically includes steps S41 to S42:

s41, typesetting the retrieved picture information and/or video information;

and S42, synthesizing the voice information and the picture information and/or the video information into video data by adopting a preset rendering engine.

The picture information and/or the video information are typeset, wherein the typesetting can be performed manually by a user or automatically by a system. When the typesetting is performed manually, the playing sequence of each picture or video, the playing time of each picture or video and the like can be adjusted and set manually to generate continuous picture data. When the automatic typesetting is carried out, the system automatically sequences the pictures or the videos to generate continuous picture data. And finally, synthesizing the voice information and the picture information and/or the video information into video data by adopting an AI rendering engine, wherein the generated news video can synchronously play the voice of the news manuscript on the audio and continuously play the corresponding pictures and/or videos on the video.

The step S42 specifically includes steps a1 to a 2:

a1, acquiring a playing scene model by combining character characteristics and a preset model database;

and A2, synthesizing the voice information, the playing scene model and the picture information and/or the video information into video data by adopting a preset rendering engine.

In order to increase the richness of news playing, a playing scene model is set, the playing scene model is a virtual playing scene, for example, the playing scene model comprises a virtual host, a virtual playing background and the like, and the corresponding playing scene model is obtained according to character characteristics. Referring to fig. 3, a user creates a video for commenting on a movie, and after identifying a keyword for comparison, obtains a playing background associated with the movie, wherein the virtual host wears leisure in the playing background, the preset action is lively, a virtual video playing window is arranged in the playing background, and the user plays picture data and/or video data composed by the user. Referring to fig. 4, the user creates a video of a critique, and after recognizing the keyword to be compared, obtains a play background related to the critique, and in the play background, wearing of the virtual host is more formal, and the preset action is also more modesty. The model database comprises a plurality of pre-designed playing scene models, and different watching visions are provided for different types of video contents by setting a plurality of different playing scene models, so that the video contents are richer.

Further, as a preferred embodiment, the method further comprises a subtitle generating step, wherein the subtitle generating step specifically comprises the following steps:

In this embodiment, the text information is first divided into a plurality of sections of subtitles, where the text information may be divided according to a consistent sentence break, for example, the text information is divided into a section of subtitles if a comma or a sentence is met, or the subtitles may be divided according to a word number, for example, the number of characters in each section of subtitles is fixed. After dividing each subtitle, sequencing each subtitle, and displaying and playing the subtitles according to the sequence, wherein the display duration of a certain subtitle in a video picture is controlled in the following way: the characters of the voice information in the video playing process are identified, the identified characters are matched with the characters in the subtitles, so that the display duration of each subtitle is obtained, and the subtitles are controlled to be synchronous with the video playing.

As shown in fig. 2, the present embodiment further provides a video generation system based on speech recognition, including:

Further as a preferred embodiment, the text analysis module comprises a vocabulary statistics unit and a feature extraction unit;

The video generation system based on speech recognition of the embodiment can execute the video generation method based on speech recognition provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.

The embodiment further provides a video generation device based on speech recognition, including:

at least one processor;

at least one memory for storing at least one program;

The video generation device based on speech recognition of the embodiment can execute the video generation method based on speech recognition provided by the embodiment of the method of the invention, can execute any combination of the implementation steps of the embodiment of the method, and has corresponding functions and beneficial effects of the method.

The present embodiments also provide a storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method as described above.

The storage medium of this embodiment can execute the video generation method based on speech recognition provided by the method embodiments of the present invention, can execute any combination of the implementation steps of the method embodiments, and has corresponding functions and advantages of the method.

While the preferred embodiments of the present invention have been illustrated and described, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A video generation method based on voice recognition is characterized by comprising the following steps:

analyzing the character information to obtain character characteristics;

2. The video generation method based on speech recognition according to claim 1, wherein the step of obtaining the text features after analyzing the text information specifically comprises the steps of:

3. The video generation method based on speech recognition according to claim 2, wherein the preset retrieval model is a web crawler model, and the acquiring of the picture information and/or the video information by combining the text features and the preset retrieval model specifically comprises:

4. The video generation method based on voice recognition according to claim 1, wherein the step of generating video data by combining the voice information and the picture information and/or the video information specifically comprises the following steps:

typesetting the retrieved picture information and/or video information;

5. The video generation method based on speech recognition according to claim 4, wherein the step of synthesizing the speech information and the picture information and/or the video information into video data by using a preset rendering engine specifically comprises:

6. The video generation method based on speech recognition according to claim 1, further comprising a subtitle generation step, wherein the subtitle generation specifically comprises the following steps:

7. A video generation system based on speech recognition, comprising:

8. The video generation method based on speech recognition of claim 7, wherein the text parsing module comprises a vocabulary statistics unit and a feature extraction unit;

9. A video generation apparatus based on speech recognition, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method for video generation based on speech recognition as claimed in any one of claims 1 to 6.

10. A storage medium having stored therein processor-executable instructions, which when executed by a processor, are configured to perform the method of any one of claims 1-6.