CN110691258A - Program material manufacturing method and device, computer storage medium and electronic equipment - Google Patents

Program material manufacturing method and device, computer storage medium and electronic equipment Download PDF

Info

Publication number
CN110691258A
CN110691258A CN201911045013.XA CN201911045013A CN110691258A CN 110691258 A CN110691258 A CN 110691258A CN 201911045013 A CN201911045013 A CN 201911045013A CN 110691258 A CN110691258 A CN 110691258A
Authority
CN
China
Prior art keywords
audio
audio file
voiceprint
determining
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911045013.XA
Other languages
Chinese (zh)
Inventor
黄建新
崔建伟
蔡贺
张歆
黄伟峰
朱米春
杜伟
王一韩
闫磊
钱岳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central Platform
China Central TV Station
Original Assignee
Central Platform
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central Platform filed Critical Central Platform
Priority to CN201911045013.XA priority Critical patent/CN110691258A/en
Publication of CN110691258A publication Critical patent/CN110691258A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/234Processing of video elementary streams, e.g. splicing of video streams or manipulating encoded video stream scene graphs
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • G10L17/08Use of distortion metrics or a particular distance between probe pattern and reference templates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/20Servers specifically adapted for the distribution of content, e.g. VOD servers; Operations thereof
    • H04N21/23Processing of content or additional data; Elementary server operations; Server middleware
    • H04N21/233Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/439Processing of audio elementary streams
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/40Client devices specifically adapted for the reception of or interaction with content, e.g. set-top-box [STB]; Operations thereof
    • H04N21/43Processing of content or additional data, e.g. demultiplexing additional data from a digital video stream; Elementary client operations, e.g. monitoring of home network or synchronising decoder's clock; Client middleware
    • H04N21/44Processing of video elementary streams, e.g. splicing a video clip retrieved from local storage with an incoming video stream or rendering scenes according to encoded video stream scene graphs

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Business, Economics & Management (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Computational Linguistics (AREA)
  • Television Signal Processing For Recording (AREA)

Abstract

A program material manufacturing method, a device, a computer storage medium and an electronic device comprise the following steps: determining an audio file of a program; the program comprises at least one character; determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information; matching the character with the time code information with the role information; determining material content according to the character and role information; and editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material. By adopting the scheme in the application, automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.

Description

Program material manufacturing method and device, computer storage medium and electronic equipment
Technical Field
The present application relates to program production technologies, and in particular, to a method and an apparatus for producing a program material, a computer storage medium, and an electronic device.
Background
At present, for the interview type programs, the problem discussion or conversation between a host and several guests is common, and during the production process of the programs of the television station, the conversation content needs to be recorded. When the program is produced in the later period, the conversation content of guests needs to be known, and the conversation content of which guest is needs to be distinguished, so that some important or valuable contents are selected for the later editing.
The existing method is that after the recording of the talking content is finished, all the talking content is recorded and different character roles are marked manually, then the idea is made by checking the text, the talking content of which roles is adopted is determined to be edited in the later period as a material, and then the position of the corresponding content is found manually in a non-editing system to be edited, so that the program is made. It can be seen that the whole process is very complicated and has a huge workload, and the selection of the material can be completed generally in a time which is several times as long as the time of the program.
Disclosure of Invention
The embodiment of the application provides a program material manufacturing method and device, a computer storage medium and electronic equipment, so as to solve the technical problems.
According to a first aspect of embodiments of the present application, there is provided a program material production method, including:
determining an audio file of a program; the program comprises at least one character;
determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information;
matching the character with the time code information with the role information;
determining material content according to the character and role information;
and editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material.
According to a second aspect of embodiments of the present application, there is provided a program material producing apparatus including:
the file determining module is used for determining an audio file of the program; the program comprises at least one character;
the role determining module is used for determining the role information of each voice segment according to the audio file;
the character transcription module is used for transcribing the audio file to obtain characters with time code information;
the matching module is used for matching the character with the time code information with the role information;
the material selection module is used for determining the content of the material according to the character and role information;
and the clipping module is used for clipping the video file corresponding to the audio file according to the time code information of the material content to obtain the program material.
According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the program material production method as described above.
According to a fourth aspect of embodiments herein, there is provided an electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of programming material as described above.
By adopting the program material making method and device, the computer storage medium and the electronic equipment, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flow chart illustrating an implementation of a program material manufacturing method according to a first embodiment of the present application;
fig. 2 is a schematic structural diagram of a program material production apparatus according to a second embodiment of the present application;
fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.
Detailed Description
Aiming at the problems in the prior art, the embodiment of the application provides a technical scheme for realizing voice transcription and role identification of interview programs by an intelligent voice voiceprint recognition technology, simplifies the selection and production processes of video program content materials, and improves the program production efficiency.
The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example one
Fig. 1 shows a schematic flow chart of an implementation of a program material manufacturing method in an embodiment of the present application.
As shown in the figure, the program material production method includes:
step 101, determining an audio file of a program; the program comprises at least one character;
step 102, determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information;
103, matching the character with the time code information with the role information;
104, determining material content according to the character and role information;
and 105, editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material.
In one embodiment, the determining an audio file of a program includes: and recording the program on site to obtain the audio file of the program.
In one embodiment, the determining an audio file of a program includes: and extracting the audio file of the program from the video file of the program.
The program may include one or more characters, and an audio file of the program may include audio segments of the one or more characters. The audio clip for each character may include one or more audio clips (or speech clips).
In the embodiment of the application, the role information of each voice segment is determined according to an audio file, and the audio file is transcribed into characters corresponding to the audio file, wherein the characters corresponding to the audio file have time code (or time code for short) information.
And matching the characters with the role information, namely determining the role corresponding to each character or each segment of characters. For example: the first sentence is said by role a, the second sentence is said by role B, and so on.
Then, the embodiment of the application can determine the material content according to the characters and the role information corresponding to the characters, and the material content can be a certain segment of characters or a plurality of segments of characters.
And because each or every segment of characters is provided with time code information, the material content is also provided with the time code information. According to the method and the device, the video file corresponding to the audio file can be edited according to the time code information of the material content, so that the program material can be obtained. For example: the method and the device for editing the video file comprise the steps that the texts corresponding to the audio file comprise five sections of texts, and the material content is determined to be the 1 st section of text, the 3 rd section of text and the 4 th section of text.
In particular, the audio file may belong to a first program, and the material obtained by the final editing may be used for a second program.
By adopting the program material making method provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.
In one embodiment, the determining the role information of each voice segment according to the audio file includes:
extracting voiceprint driver characteristics of each audio clip in the audio file according to the audio file;
matching with the pre-established voiceprint library according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;
and determining the role information of the audio clip as the role information corresponding to the vector feature in the voiceprint library according to the similarity between the vector feature of the audio clip and the vector feature in the voiceprint library.
In specific implementation, the role corresponding to each voice segment in the audio file is determined according to the audio file, and may be determined according to the voiceprint characteristics of the audio segment. Specifically, the voiceprint features of each audio segment can be extracted according to an audio file, then the extracted voiceprint features are matched with a plurality of voiceprint features in a preset voiceprint library, and finally, the role corresponding to the voiceprint features in the voiceprint library with the similarity greater than a preset threshold is determined to be the role corresponding to the extracted voiceprint features.
In specific implementation, the pre-established voiceprint library may include two attributes of a voiceprint feature and a role, and each voiceprint feature and each role have a one-to-one correspondence relationship.
The voiceprint feature can be a vector feature (or an i-vector feature), specifically, the vector feature extracted from the audio file can be realized by adopting the existing algorithm, and the method is not described herein any more.
In addition, the specific process of matching the voiceprint features with the voiceprint features in the voiceprint library can also be realized by adopting the existing feature similarity algorithm, and the detailed algorithm process is not repeated herein.
In specific implementation, when the similarity between the driver feature of the audio clip and the driver feature in the voiceprint library is greater than a preset similarity threshold, determining that the role information of the audio clip is the role information corresponding to the driver feature in the voiceprint library. The preset similarity threshold can be set according to actual needs.
In specific implementation, the role corresponding to the voiceprint feature with the largest similarity can be selected as the role corresponding to the audio clip.
In one embodiment, the extracting voiceprint driver features of each audio segment in an audio file according to the audio file includes:
splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
extracting voiceprint vector characteristics of any partial audio in each first audio segment;
and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
In a specific implementation, the sentence end position of the audio file may be determined according to a plurality of sentence components of the text transcribed from the audio file, specifically, according to ", (comma)", ". (period) "and other punctuations are distinguished as the sentence end position.
In a specific implementation, the audio pause position of the audio file may be determined according to the noise of the audio or the energy of the audio, and specifically, when the energy of the audio is lower than a preset energy threshold, the audio position may be determined as the audio pause position.
In one embodiment, extracting voiceprint vector features of each audio segment in an audio file according to the audio file may include:
splitting the audio file into a plurality of first audio fragments according to the sentence ending position of the audio file; each first audio segment comprises a plurality of second audio segments;
extracting voiceprint vector characteristics of any partial audio in each first audio segment;
and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
In one embodiment, extracting voiceprint vector features of each audio segment in an audio file according to the audio file may include:
splitting the audio file into a plurality of first audio segments according to the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
extracting voiceprint vector characteristics of any partial audio in each first audio segment;
and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
In one embodiment, extracting voiceprint vector features of each audio segment in an audio file according to the audio file may include:
splitting the audio file into a plurality of first audio segments according to the sentence ending position and the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
extracting voiceprint vector characteristics of any partial audio in each first audio segment;
and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
The first audio clip is a longer audio clip obtained by splitting according to a sentence ending position and/or an audio pause position of an audio file, the second audio clip can be a shorter audio clip in the first audio clip, and the plurality of second audio clips form the first audio clip. For example: assuming that the text of the audio file is "i have a beautiful home and i love my home", it may be determined that the first audio clip may be "i have a beautiful home" and "i love my home", and the second audio clip may be a plurality of audio clips such as "i", "have", "one", "beautiful home", or a plurality of audio clips such as "i", "love", "my home", according to the sentence end position.
The method and the device can only extract the voiceprint features of any part of audios in the first audio segment to serve as the voiceprint features of the first audio segment, so that the amount of voiceprint extraction and voiceprint matching calculation can be greatly reduced, and the program production efficiency is improved.
In one embodiment, the process of establishing the voiceprint library includes:
collecting any audio clips of a plurality of roles;
marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments;
and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.
In specific implementation, before matching the voiceprint features of the audio clip with the voiceprint library, a voiceprint library can be established, the voiceprint library can acquire any audio clip of a plurality of roles, label the roles of the any audio clip and extract the voiceprint features, and finally store the roles and the voiceprint features in one-to-one correspondence to obtain the voiceprint library.
In a specific implementation, the characters may be characters included in the audio segment, or may be characters that may be included in all programs.
In one embodiment, the transcribing the audio file into the text with the time code information includes:
determining a manuscript corresponding to the audio file;
inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model;
and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.
Typically, prior to recording a program, there will be a draft of the program, which may typically include the program name, show form, performer, and specific program content organized in chronological order. When the embodiment of the present application is implemented specifically, information such as "program name", "show form", and "performer" may not be recorded with sound, so the audio file described in the embodiment of the present application may only correspond to the specific program content organized according to the chronological order.
In specific implementation, a large number of samples can be collected in advance, a speech recognition deep neural network model is obtained by training the large number of samples, when a caption file is generated specifically, only an audio file and a manuscript corresponding to the audio file need to be input into the speech recognition deep neural network model obtained by the pre-training, and the speech recognition deep neural network model automatically outputs text contents with a timestamp of each word in the manuscript corresponding to the audio file.
The method for outputting the text content with the time stamp of each word in the manuscript corresponding to the audio file by adopting the pre-trained voice recognition deep neural network model and utilizing the voice recognition deep neural network model can greatly accelerate the generation efficiency of the caption file, has strong reproducibility and can be repeatedly used.
In one embodiment, the speech recognition deep neural network model outputs text with a timestamp for each word in the manuscript to which the audio file corresponds, including:
the voice recognition deep neural network model recognizes each frame of voice of the audio file into a state sequence;
obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;
generating one or more words from the plurality of phonemes;
matching the one or more words with each frame of voice content to obtain the relative time position of the voice clip corresponding to each word on a time axis;
and determining the time stamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.
In specific implementation, each frame of speech may be recognized as a state, the states corresponding to each frame of speech are combined into phonemes, and then, a plurality of phonemes are combined into words.
Since speech is a continuous stream of audio, it is usually composed of a mixture of mostly stable states and partially dynamically changing states. The method includes recognizing each frame of speech of an audio file into a state, and decoding the audio file by using techniques such as viterbi decoding in the prior art to obtain a state sequence, where the state sequence may correspond to a plurality of phonemes.
Human languages generally include three elements, namely voice, vocabulary and grammar, and basic vocabulary and grammar construction determine basic appearances of each language. Speech can be understood as the form in which a language is expressed acoustically, i.e. the sound a person utters when speaking. While sound includes three basic properties of loudness, tone and timbre, the phonemes described in the embodiments of the present application may be understood as the smallest phonetic unit divided from the timbre point of view.
The phonemes can in turn be divided into vowel phonemes and consonant phonemes depending on whether the airflow is impeded during the pronunciation process, for example: a. vowels such as o, e, etc.; b. p, f, etc.
Generally, in Chinese, 2-4 phones can form a syllable (e.g., mei), and one syllable corresponds to one Chinese character (e.g., Mei), i.e., 2-4 phones can form a word/word (e.g., m, e, i three phones form a word/word "Mei").
The audio file is usually played according to a time axis, after the one or more words are obtained, the one or more words can be matched with each frame of voice content, the relative time position of the voice clip corresponding to each word on the time axis of the audio file is obtained, and therefore the time stamp of each word is determined according to the relative time position of the voice clip corresponding to each word on the time axis.
Example two
Based on the same inventive concept, the embodiment of the present application provides a program material manufacturing apparatus, and the principle of the apparatus for solving the technical problem is similar to a program material manufacturing method, and repeated parts are not repeated.
Fig. 2 is a schematic structural diagram of a program material production apparatus according to a second embodiment of the present application.
As shown in the figure, the program material producing apparatus includes:
a file determining module 201, configured to determine an audio file of a program; the program comprises at least one character;
a role determination module 202, configured to determine role information of each voice segment according to the audio file;
the character transcription module 203 is used for transcribing the audio file to obtain characters with time code information;
a matching module 204, configured to match the text with the time code information with the role information;
the material selecting module 205 is used for determining material content according to the character and role information;
and the clipping module 206 is configured to clip the video file corresponding to the audio file according to the time code information of the material content, so as to obtain a program material.
By adopting the program material making device provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.
In one embodiment, the role determination module includes:
the feature extraction unit is used for extracting voiceprint vector features of each audio clip in the audio file according to the audio file;
the characteristic matching unit is used for matching the voiceprint library established in advance according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;
and the role determining unit is used for determining the role information of the audio clip as the role information corresponding to the driver features in the voiceprint library according to the similarity between the driver features of the audio clip and the driver features in the voiceprint library.
In one embodiment, the feature extraction unit includes:
the audio splitting subunit is used for splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
the feature extraction subunit is used for extracting voiceprint vector features of any part of audio in each first audio segment;
and the characteristic determining subunit is used for taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
In one embodiment, the apparatus further comprises:
the voiceprint library establishing module is used for acquiring any audio clips of a plurality of roles; marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments; and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.
In one embodiment, the text transcription module comprises:
the manuscript determining unit is used for determining a manuscript corresponding to the audio file;
the transcription unit is used for inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model; and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.
In one embodiment, the transfer unit includes:
a first processing subunit, configured to recognize each frame of speech of the audio file as a state sequence;
the second processing subunit is used for obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;
a third processing subunit for generating one or more words from the plurality of phonemes;
the fourth processing subunit is configured to match the one or more words with each frame of voice content, and obtain a relative time position of a voice clip corresponding to each word on a time axis;
and the fifth processing subunit is used for determining the timestamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.
EXAMPLE III
Based on the same inventive concept, embodiments of the present application further provide a computer storage medium, which is described below.
The computer storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the steps of the program material production method according to an embodiment.
By adopting the computer storage medium provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.
Example four
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which is described below.
Fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.
As shown, the electronic device includes memory 301 for storing one or more programs, and one or more processors 302; the one or more programs, when executed by the one or more processors, implement the method for making program material as described in embodiment one.
By adopting the electronic equipment provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.
EXAMPLE five
In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.
Assuming that a station is to make an installment, the following is the conversation content when recorded live (both video and audio) between the host and several guests:
"the king teacher listens to say that you have recently stepped on taiwan's land. "
"not recently, I have gone 1993. "
"but recently one more pass. "
"has just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? "
"I feel the human feelings, namely, the people can see the street at any time. "
"I also do. "
"very good. Although the people go to Taibei twice in the past, the people go to Hualian at most, and the people go to Tainan, enter the Taizhong, throw to south and throw to Niumtan. Especially, in the small towns in the hong Kong, lunch I eat is bought to be broken at the spot, scraping wood is made into a small shape, the wood is lovely, the wood is particularly good, and I can not know what wood is, and the wood is similar to the wood. I feel particularly that the people sell things, and the attitude is good and friendly to people. I did not like the word 'warm' in the past. "
"too flaring". "
"incite, sour. "
But I feel very warm when finishing the small towns of the deer harbor and the lunchman, and I like the word. "
After the recording is finished, the audio file of the talk content can be obtained.
According to the embodiment of the application, a small segment of audio clip spoken by each character is collected firstly according to the recorded audio file, the voiceprint characteristics of the small segment of audio clip are extracted, and the role information corresponding to the voiceprint characteristics is marked. For example: extracting the voiceprint characteristic corresponding to the audio clip of 'king teacher' spoken by the host, marking the voiceprint characteristic corresponding to the host, and storing the voiceprint characteristic in a voiceprint library; and extracting the voiceprint feature corresponding to the audio segment which is said by guest A and has been passed through in 1993, marking that the voiceprint feature corresponds to the guest A, and storing the voiceprint feature in a voiceprint library. Finally, the voiceprint characteristics corresponding to each person are obtained.
Then, the voiceprint features of all audios in the audio file are extracted, the voiceprint features of all audios in the audio file are matched with the established voiceprint library, and all audios of the audio file correspond to obtain role information. Or segmenting the audio file according to each sentence, and matching any part of audio in the audio of each sentence with the voiceprint library to obtain the role information of each sentence in the audio file, thereby reducing the matching calculation amount. For example: will "have just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? If the audio corresponding to the "just-passed" or the "yes-bar" of the sentence is matched with the voiceprint library, the voiceprint characteristics of the sentence can be determined, that is, the person who said the sentence can be determined.
Determining role information of the audio file according to the voiceprint matching, transcribing the audio file to obtain characters with time codes, and matching the role information with the characters to obtain the following information:
00:01:05,900 00:01:10,080
a host: "the king teacher listens to say that you have recently stepped on taiwan's land. "
00:01:12,750 00:01:20,240
Mr. King: "not recently, I have gone 1993. "
00:01:22,991 00:01:26,203
A host: "but recently one more pass. "
00:01:26,901 00:01:32,856
Mr. King: "has just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? "
00:01:32,905 00:01:36,401
Mr. text: "I feel the human feelings, namely, the people can see the street at any time. "
00:01:36,405 00:01:38,561
A host: "I also do. "
00:01:39,012 00:01:52,871
Mr. King: "very good. Although the people go to Taibei twice in the past, the people go to Hualian at most, and the people go to Tainan, enter the Taizhong, throw to south and throw to Niumtan. Especially, in the small towns in the hong Kong, lunch I eat is bought to be broken at the spot, scraping wood is made into a small shape, the wood is lovely, the wood is particularly good, and I can not know what wood is, and the wood is similar to the wood. I feel particularly that the people sell things, and the attitude is good and friendly to people. I did not like the word 'warm' in the past. "
00:01:52,998 00:01:53,805
Mr. text: "too flaring". "
00:01:53,908 00:01:54,674
A host: "incite, sour. "
00:01:54,785 00:01:58,609
Mr. King: but I feel very warm when finishing the small towns of the deer harbor and the lunchman, and I like the word. "
By the distinguishing of the roles, different viewpoints of different guest roles can be visually seen, and program producers can conveniently select materials and conceive programs according to the content.
The program producer can select the contents according to roles, characters and the like in a nonlinear editing system to determine which valuable contents can be used as material contents of subsequent programs. After the material content is determined, the corresponding position in the recorded video file can be positioned for cutting according to the time code corresponding to the determined material content.
For example: the program maker selects the following material contents:
00:01:05,900 00:01:10,080
a host: "the king teacher listens to say that you have recently stepped on taiwan's land. "
00:01:26,901 00:01:32,856
Mr. King: "has just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? "
00:01:32,905 00:01:36,401
Mr. text: "I feel the human feelings, namely, the people can see the street at any time. "
00:01:39,012 00:01:52,871
Mr. King: "very good. Although the people go to Taibei twice in the past, the people go to Hualian at most, and the people go to Tainan, enter the Taizhong, throw to south and throw to Niumtan. Especially, in the small towns in the hong Kong, lunch I eat is bought to be broken at the spot, scraping wood is made into a small shape, the wood is lovely, the wood is particularly good, and I can not know what wood is, and the wood is similar to the wood. I feel particularly that the people sell things, and the attitude is good and friendly to people. I did not like the word 'warm' in the past. "
00:01:54,785 00:01:58,609
Mr. King: but I feel very warm when finishing the small towns of the deer harbor and the lunchman, and I like the word. "
Then, according to the time code corresponding to each section of material content, locating to a corresponding position in the video file, for example: locating the video file at the time period position (the picture of the host in the words) according to 00:01:05,90000: 01:10,080, and intercepting a video clip of the video file at the time period; locating the video file at the time slot position (the picture of the sentence is spoken by Mr. Wang) according to 00:01:26,90100: 01:32,856, and intercepting the video clip of the video file at the time slot.
And finally, obtaining a plurality of video segments as program materials to generate a program file.
During specific implementation, the program file can be manually checked, program materials are checked, and the program file is broadcast at a corresponding time according to the program list after the final version is confirmed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims (14)

1. A method for producing a program material, comprising:
determining an audio file of a program; the program comprises at least one character;
determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information;
matching the character with the time code information with the role information;
determining material content according to the character and role information;
and editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material.
2. The method of claim 1, wherein determining the character information of each speech segment according to the audio file comprises:
extracting voiceprint driver characteristics of each audio clip in the audio file according to the audio file;
matching with the pre-established voiceprint library according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;
and determining the role information of the audio clip as the role information corresponding to the vector feature in the voiceprint library according to the similarity between the vector feature of the audio clip and the vector feature in the voiceprint library.
3. The method of claim 2, wherein the extracting voiceprint driver features of each audio segment in the audio file from the audio file comprises:
splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
extracting voiceprint vector characteristics of any partial audio in each first audio segment;
and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
4. The method according to claim 2, wherein the process of establishing the voiceprint library comprises:
collecting any audio clips of a plurality of roles;
marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments;
and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.
5. The method of claim 1, wherein transcribing the audio file into text with time code information comprises:
determining a manuscript corresponding to the audio file;
inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model;
and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.
6. The method of claim 5, wherein the speech recognition deep neural network model outputs text with a timestamp for each word in the manuscript to which the audio file corresponds, comprising:
the voice recognition deep neural network model recognizes each frame of voice of the audio file into a state sequence;
obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;
generating one or more words from the plurality of phonemes;
matching the one or more words with each frame of voice content to obtain the relative time position of the voice clip corresponding to each word on a time axis;
and determining the time stamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.
7. A program material producing apparatus, comprising:
the file determining module is used for determining an audio file of the program; the program comprises at least one character;
the role determining module is used for determining the role information of each voice segment according to the audio file;
the character transcription module is used for transcribing the audio file to obtain characters with time code information;
the matching module is used for matching the character with the time code information with the role information;
the material selection module is used for determining the content of the material according to the character and role information;
and the clipping module is used for clipping the video file corresponding to the audio file according to the time code information of the material content to obtain the program material.
8. The apparatus of claim 7, wherein the role determination module comprises:
the feature extraction unit is used for extracting voiceprint vector features of each audio clip in the audio file according to the audio file;
the characteristic matching unit is used for matching the voiceprint library established in advance according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;
and the role determining unit is used for determining the role information of the audio clip as the role information corresponding to the driver features in the voiceprint library according to the similarity between the driver features of the audio clip and the driver features in the voiceprint library.
9. The apparatus of claim 8, wherein the feature extraction unit comprises:
the audio splitting subunit is used for splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;
the feature extraction subunit is used for extracting voiceprint vector features of any part of audio in each first audio segment;
and the characteristic determining subunit is used for taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.
10. The apparatus of claim 8, further comprising:
the voiceprint library establishing module is used for acquiring any audio clips of a plurality of roles; marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments; and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.
11. The apparatus of claim 7, wherein the text transcription module comprises:
the manuscript determining unit is used for determining a manuscript corresponding to the audio file;
the transcription unit is used for inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model; and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.
12. The apparatus of claim 11, wherein the transcription unit comprises:
a first processing subunit, configured to recognize each frame of speech of the audio file as a state sequence;
the second processing subunit is used for obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;
a third processing subunit for generating one or more words from the plurality of phonemes;
the fourth processing subunit is configured to match the one or more words with each frame of voice content, and obtain a relative time position of a voice clip corresponding to each word on a time axis;
and the fifth processing subunit is used for determining the timestamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.
13. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.
14. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 6.
CN201911045013.XA 2019-10-30 2019-10-30 Program material manufacturing method and device, computer storage medium and electronic equipment Pending CN110691258A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911045013.XA CN110691258A (en) 2019-10-30 2019-10-30 Program material manufacturing method and device, computer storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911045013.XA CN110691258A (en) 2019-10-30 2019-10-30 Program material manufacturing method and device, computer storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN110691258A true CN110691258A (en) 2020-01-14

Family

ID=69114876

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911045013.XA Pending CN110691258A (en) 2019-10-30 2019-10-30 Program material manufacturing method and device, computer storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110691258A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111901549A (en) * 2020-08-07 2020-11-06 杭州当虹科技股份有限公司 Auxiliary field recording cataloguing method based on voice recognition technology
CN113269854A (en) * 2021-07-16 2021-08-17 成都索贝视频云计算有限公司 Method for intelligently generating interview-type comprehensive programs
CN113571061A (en) * 2020-04-28 2021-10-29 阿里巴巴集团控股有限公司 System, method, device and equipment for editing voice transcription text
CN114465737A (en) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
WO2022160749A1 (en) * 2021-01-29 2022-08-04 深圳壹秘科技有限公司 Role separation method for speech processing device, and speech processing device
CN115209218A (en) * 2022-06-27 2022-10-18 联想(北京)有限公司 Video information processing method, electronic equipment and storage medium
CN116600166A (en) * 2023-05-26 2023-08-15 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179403A1 (en) * 2005-02-10 2006-08-10 Transcript Associates, Inc. Media editing system
US20110239107A1 (en) * 2010-03-29 2011-09-29 Phillips Michael E Transcript editor
CN107124647A (en) * 2017-05-27 2017-09-01 深圳市酷开网络科技有限公司 A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN109326310A (en) * 2017-07-31 2019-02-12 西梅科技(北京)有限公司 A kind of method, apparatus and electronic equipment of automatic editing
CN110121103A (en) * 2019-05-06 2019-08-13 郭凌含 The automatic editing synthetic method of video and device
CN110166818A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Wait match generation method, computer equipment and the storage medium of audio-video
CN110166816A (en) * 2019-05-29 2019-08-23 上海乂学教育科技有限公司 The video editing method and system based on speech recognition for artificial intelligence education
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060179403A1 (en) * 2005-02-10 2006-08-10 Transcript Associates, Inc. Media editing system
US20110239107A1 (en) * 2010-03-29 2011-09-29 Phillips Michael E Transcript editor
CN107124647A (en) * 2017-05-27 2017-09-01 深圳市酷开网络科技有限公司 A kind of panoramic video automatically generates the method and device of subtitle file when recording
CN109326310A (en) * 2017-07-31 2019-02-12 西梅科技(北京)有限公司 A kind of method, apparatus and electronic equipment of automatic editing
CN110166818A (en) * 2018-11-30 2019-08-23 腾讯科技(深圳)有限公司 Wait match generation method, computer equipment and the storage medium of audio-video
CN110121103A (en) * 2019-05-06 2019-08-13 郭凌含 The automatic editing synthetic method of video and device
CN110166816A (en) * 2019-05-29 2019-08-23 上海乂学教育科技有限公司 The video editing method and system based on speech recognition for artificial intelligence education
CN110335612A (en) * 2019-07-11 2019-10-15 招商局金融科技有限公司 Minutes generation method, device and storage medium based on speech recognition

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113571061A (en) * 2020-04-28 2021-10-29 阿里巴巴集团控股有限公司 System, method, device and equipment for editing voice transcription text
CN111901549A (en) * 2020-08-07 2020-11-06 杭州当虹科技股份有限公司 Auxiliary field recording cataloguing method based on voice recognition technology
WO2022160749A1 (en) * 2021-01-29 2022-08-04 深圳壹秘科技有限公司 Role separation method for speech processing device, and speech processing device
CN113269854A (en) * 2021-07-16 2021-08-17 成都索贝视频云计算有限公司 Method for intelligently generating interview-type comprehensive programs
CN114465737A (en) * 2022-04-13 2022-05-10 腾讯科技(深圳)有限公司 Data processing method and device, computer equipment and storage medium
CN115209218A (en) * 2022-06-27 2022-10-18 联想(北京)有限公司 Video information processing method, electronic equipment and storage medium
CN116600166A (en) * 2023-05-26 2023-08-15 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis
CN116600166B (en) * 2023-05-26 2024-03-12 武汉星巡智能科技有限公司 Video real-time editing method, device and equipment based on audio analysis

Similar Documents

Publication Publication Date Title
CN110691258A (en) Program material manufacturing method and device, computer storage medium and electronic equipment
CN106331893B (en) Real-time caption presentation method and system
CN109949783B (en) Song synthesis method and system
CN105244022B (en) Audio-video method for generating captions and device
CN101739870B (en) Interactive language learning system and method
CN106531185B (en) voice evaluation method and system based on voice similarity
CN102779508B (en) Sound bank generates Apparatus for () and method therefor, speech synthesis system and method thereof
CN102568478B (en) Video play control method and system based on voice recognition
CN109285537B (en) Acoustic model establishing method, acoustic model establishing device, acoustic model synthesizing method, acoustic model synthesizing device, acoustic model synthesizing equipment and storage medium
CN105304080A (en) Speech synthesis device and speech synthesis method
CN110740275B (en) Nonlinear editing system
CN111785275A (en) Voice recognition method and device
Qian et al. A two-pass framework of mispronunciation detection and diagnosis for computer-aided pronunciation training
CN110781649A (en) Subtitle editing method and device, computer storage medium and electronic equipment
Wang et al. Comic-guided speech synthesis
CN111739536A (en) Audio processing method and device
CN112667787A (en) Intelligent response method, system and storage medium based on phonetics label
CN111933121B (en) Acoustic model training method and device
Santos et al. CORAA NURC-SP Minimal Corpus: a manually annotated corpus of Brazilian Portuguese spontaneous speech
CN110310620B (en) Speech fusion method based on native pronunciation reinforcement learning
KR101920653B1 (en) Method and program for edcating language by making comparison sound
CN112634861A (en) Data processing method and device, electronic equipment and readable storage medium
CN108182946B (en) Vocal music mode selection method and device based on voiceprint recognition
CN115424616A (en) Audio data screening method, device, equipment and computer readable medium
CN113628609A (en) Automatic audio content generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200114

RJ01 Rejection of invention patent application after publication