CN110691258A

CN110691258A - Program material manufacturing method and device, computer storage medium and electronic equipment

Info

Publication number: CN110691258A
Application number: CN201911045013.XA
Authority: CN
Inventors: 黄建新; 崔建伟; 蔡贺; 张歆; 黄伟峰; 朱米春; 杜伟; 王一韩; 闫磊; 钱岳
Original assignee: Central Platform
Current assignee: Central Platform; China Central TV Station
Priority date: 2019-10-30
Filing date: 2019-10-30
Publication date: 2020-01-14

Abstract

A program material manufacturing method, a device, a computer storage medium and an electronic device comprise the following steps: determining an audio file of a program; the program comprises at least one character; determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information; matching the character with the time code information with the role information; determining material content according to the character and role information; and editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material. By adopting the scheme in the application, automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.

Description

Program material manufacturing method and device, computer storage medium and electronic equipment

Technical Field

The present application relates to program production technologies, and in particular, to a method and an apparatus for producing a program material, a computer storage medium, and an electronic device.

Background

At present, for the interview type programs, the problem discussion or conversation between a host and several guests is common, and during the production process of the programs of the television station, the conversation content needs to be recorded. When the program is produced in the later period, the conversation content of guests needs to be known, and the conversation content of which guest is needs to be distinguished, so that some important or valuable contents are selected for the later editing.

The existing method is that after the recording of the talking content is finished, all the talking content is recorded and different character roles are marked manually, then the idea is made by checking the text, the talking content of which roles is adopted is determined to be edited in the later period as a material, and then the position of the corresponding content is found manually in a non-editing system to be edited, so that the program is made. It can be seen that the whole process is very complicated and has a huge workload, and the selection of the material can be completed generally in a time which is several times as long as the time of the program.

Disclosure of Invention

The embodiment of the application provides a program material manufacturing method and device, a computer storage medium and electronic equipment, so as to solve the technical problems.

According to a first aspect of embodiments of the present application, there is provided a program material production method, including:

determining an audio file of a program; the program comprises at least one character;

determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information;

matching the character with the time code information with the role information;

determining material content according to the character and role information;

and editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material.

According to a second aspect of embodiments of the present application, there is provided a program material producing apparatus including:

the file determining module is used for determining an audio file of the program; the program comprises at least one character;

the role determining module is used for determining the role information of each voice segment according to the audio file;

the character transcription module is used for transcribing the audio file to obtain characters with time code information;

the matching module is used for matching the character with the time code information with the role information;

the material selection module is used for determining the content of the material according to the character and role information;

and the clipping module is used for clipping the video file corresponding to the audio file according to the time code information of the material content to obtain the program material.

According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the program material production method as described above.

According to a fourth aspect of embodiments herein, there is provided an electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of programming material as described above.

By adopting the program material making method and device, the computer storage medium and the electronic equipment, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a schematic flow chart illustrating an implementation of a program material manufacturing method according to a first embodiment of the present application;

fig. 2 is a schematic structural diagram of a program material production apparatus according to a second embodiment of the present application;

fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.

Detailed Description

Aiming at the problems in the prior art, the embodiment of the application provides a technical scheme for realizing voice transcription and role identification of interview programs by an intelligent voice voiceprint recognition technology, simplifies the selection and production processes of video program content materials, and improves the program production efficiency.

The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.

In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Example one

Fig. 1 shows a schematic flow chart of an implementation of a program material manufacturing method in an embodiment of the present application.

As shown in the figure, the program material production method includes:

step 101, determining an audio file of a program; the program comprises at least one character;

step 102, determining the role information of each voice segment according to the audio file, and transcribing the audio file to obtain characters with time code information;

103, matching the character with the time code information with the role information;

104, determining material content according to the character and role information;

and 105, editing the video file corresponding to the audio file according to the time code information of the material content to obtain a program material.

In one embodiment, the determining an audio file of a program includes: and recording the program on site to obtain the audio file of the program.

In one embodiment, the determining an audio file of a program includes: and extracting the audio file of the program from the video file of the program.

The program may include one or more characters, and an audio file of the program may include audio segments of the one or more characters. The audio clip for each character may include one or more audio clips (or speech clips).

In the embodiment of the application, the role information of each voice segment is determined according to an audio file, and the audio file is transcribed into characters corresponding to the audio file, wherein the characters corresponding to the audio file have time code (or time code for short) information.

And matching the characters with the role information, namely determining the role corresponding to each character or each segment of characters. For example: the first sentence is said by role a, the second sentence is said by role B, and so on.

Then, the embodiment of the application can determine the material content according to the characters and the role information corresponding to the characters, and the material content can be a certain segment of characters or a plurality of segments of characters.

And because each or every segment of characters is provided with time code information, the material content is also provided with the time code information. According to the method and the device, the video file corresponding to the audio file can be edited according to the time code information of the material content, so that the program material can be obtained. For example: the method and the device for editing the video file comprise the steps that the texts corresponding to the audio file comprise five sections of texts, and the material content is determined to be the 1 st section of text, the 3 rd section of text and the 4 th section of text.

In particular, the audio file may belong to a first program, and the material obtained by the final editing may be used for a second program.

By adopting the program material making method provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.

In one embodiment, the determining the role information of each voice segment according to the audio file includes:

extracting voiceprint driver characteristics of each audio clip in the audio file according to the audio file;

matching with the pre-established voiceprint library according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;

and determining the role information of the audio clip as the role information corresponding to the vector feature in the voiceprint library according to the similarity between the vector feature of the audio clip and the vector feature in the voiceprint library.

In specific implementation, the role corresponding to each voice segment in the audio file is determined according to the audio file, and may be determined according to the voiceprint characteristics of the audio segment. Specifically, the voiceprint features of each audio segment can be extracted according to an audio file, then the extracted voiceprint features are matched with a plurality of voiceprint features in a preset voiceprint library, and finally, the role corresponding to the voiceprint features in the voiceprint library with the similarity greater than a preset threshold is determined to be the role corresponding to the extracted voiceprint features.

In specific implementation, the pre-established voiceprint library may include two attributes of a voiceprint feature and a role, and each voiceprint feature and each role have a one-to-one correspondence relationship.

The voiceprint feature can be a vector feature (or an i-vector feature), specifically, the vector feature extracted from the audio file can be realized by adopting the existing algorithm, and the method is not described herein any more.

In addition, the specific process of matching the voiceprint features with the voiceprint features in the voiceprint library can also be realized by adopting the existing feature similarity algorithm, and the detailed algorithm process is not repeated herein.

In specific implementation, when the similarity between the driver feature of the audio clip and the driver feature in the voiceprint library is greater than a preset similarity threshold, determining that the role information of the audio clip is the role information corresponding to the driver feature in the voiceprint library. The preset similarity threshold can be set according to actual needs.

In specific implementation, the role corresponding to the voiceprint feature with the largest similarity can be selected as the role corresponding to the audio clip.

In one embodiment, the extracting voiceprint driver features of each audio segment in an audio file according to the audio file includes:

splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;

extracting voiceprint vector characteristics of any partial audio in each first audio segment;

and taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.

In a specific implementation, the sentence end position of the audio file may be determined according to a plurality of sentence components of the text transcribed from the audio file, specifically, according to ", (comma)", ". (period) "and other punctuations are distinguished as the sentence end position.

In a specific implementation, the audio pause position of the audio file may be determined according to the noise of the audio or the energy of the audio, and specifically, when the energy of the audio is lower than a preset energy threshold, the audio position may be determined as the audio pause position.

In one embodiment, extracting voiceprint vector features of each audio segment in an audio file according to the audio file may include:

splitting the audio file into a plurality of first audio fragments according to the sentence ending position of the audio file; each first audio segment comprises a plurality of second audio segments;

splitting the audio file into a plurality of first audio segments according to the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;

splitting the audio file into a plurality of first audio segments according to the sentence ending position and the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;

The first audio clip is a longer audio clip obtained by splitting according to a sentence ending position and/or an audio pause position of an audio file, the second audio clip can be a shorter audio clip in the first audio clip, and the plurality of second audio clips form the first audio clip. For example: assuming that the text of the audio file is "i have a beautiful home and i love my home", it may be determined that the first audio clip may be "i have a beautiful home" and "i love my home", and the second audio clip may be a plurality of audio clips such as "i", "have", "one", "beautiful home", or a plurality of audio clips such as "i", "love", "my home", according to the sentence end position.

The method and the device can only extract the voiceprint features of any part of audios in the first audio segment to serve as the voiceprint features of the first audio segment, so that the amount of voiceprint extraction and voiceprint matching calculation can be greatly reduced, and the program production efficiency is improved.

In one embodiment, the process of establishing the voiceprint library includes:

collecting any audio clips of a plurality of roles;

marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments;

and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.

In specific implementation, before matching the voiceprint features of the audio clip with the voiceprint library, a voiceprint library can be established, the voiceprint library can acquire any audio clip of a plurality of roles, label the roles of the any audio clip and extract the voiceprint features, and finally store the roles and the voiceprint features in one-to-one correspondence to obtain the voiceprint library.

In a specific implementation, the characters may be characters included in the audio segment, or may be characters that may be included in all programs.

In one embodiment, the transcribing the audio file into the text with the time code information includes:

determining a manuscript corresponding to the audio file;

inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model;

and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.

Typically, prior to recording a program, there will be a draft of the program, which may typically include the program name, show form, performer, and specific program content organized in chronological order. When the embodiment of the present application is implemented specifically, information such as "program name", "show form", and "performer" may not be recorded with sound, so the audio file described in the embodiment of the present application may only correspond to the specific program content organized according to the chronological order.

In specific implementation, a large number of samples can be collected in advance, a speech recognition deep neural network model is obtained by training the large number of samples, when a caption file is generated specifically, only an audio file and a manuscript corresponding to the audio file need to be input into the speech recognition deep neural network model obtained by the pre-training, and the speech recognition deep neural network model automatically outputs text contents with a timestamp of each word in the manuscript corresponding to the audio file.

The method for outputting the text content with the time stamp of each word in the manuscript corresponding to the audio file by adopting the pre-trained voice recognition deep neural network model and utilizing the voice recognition deep neural network model can greatly accelerate the generation efficiency of the caption file, has strong reproducibility and can be repeatedly used.

In one embodiment, the speech recognition deep neural network model outputs text with a timestamp for each word in the manuscript to which the audio file corresponds, including:

the voice recognition deep neural network model recognizes each frame of voice of the audio file into a state sequence;

obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;

generating one or more words from the plurality of phonemes;

matching the one or more words with each frame of voice content to obtain the relative time position of the voice clip corresponding to each word on a time axis;

and determining the time stamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.

In specific implementation, each frame of speech may be recognized as a state, the states corresponding to each frame of speech are combined into phonemes, and then, a plurality of phonemes are combined into words.

Since speech is a continuous stream of audio, it is usually composed of a mixture of mostly stable states and partially dynamically changing states. The method includes recognizing each frame of speech of an audio file into a state, and decoding the audio file by using techniques such as viterbi decoding in the prior art to obtain a state sequence, where the state sequence may correspond to a plurality of phonemes.

Human languages generally include three elements, namely voice, vocabulary and grammar, and basic vocabulary and grammar construction determine basic appearances of each language. Speech can be understood as the form in which a language is expressed acoustically, i.e. the sound a person utters when speaking. While sound includes three basic properties of loudness, tone and timbre, the phonemes described in the embodiments of the present application may be understood as the smallest phonetic unit divided from the timbre point of view.

The phonemes can in turn be divided into vowel phonemes and consonant phonemes depending on whether the airflow is impeded during the pronunciation process, for example: a. vowels such as o, e, etc.; b. p, f, etc.

Generally, in Chinese, 2-4 phones can form a syllable (e.g., mei), and one syllable corresponds to one Chinese character (e.g., Mei), i.e., 2-4 phones can form a word/word (e.g., m, e, i three phones form a word/word "Mei").

The audio file is usually played according to a time axis, after the one or more words are obtained, the one or more words can be matched with each frame of voice content, the relative time position of the voice clip corresponding to each word on the time axis of the audio file is obtained, and therefore the time stamp of each word is determined according to the relative time position of the voice clip corresponding to each word on the time axis.

Example two

Based on the same inventive concept, the embodiment of the present application provides a program material manufacturing apparatus, and the principle of the apparatus for solving the technical problem is similar to a program material manufacturing method, and repeated parts are not repeated.

Fig. 2 is a schematic structural diagram of a program material production apparatus according to a second embodiment of the present application.

As shown in the figure, the program material producing apparatus includes:

a file determining module 201, configured to determine an audio file of a program; the program comprises at least one character;

a role determination module 202, configured to determine role information of each voice segment according to the audio file;

the character transcription module 203 is used for transcribing the audio file to obtain characters with time code information;

a matching module 204, configured to match the text with the time code information with the role information;

the material selecting module 205 is used for determining material content according to the character and role information;

and the clipping module 206 is configured to clip the video file corresponding to the audio file according to the time code information of the material content, so as to obtain a program material.

By adopting the program material making device provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production making efficiency is greatly improved compared with the traditional process, and the material selection and making process of video programs is simplified.

In one embodiment, the role determination module includes:

the feature extraction unit is used for extracting voiceprint vector features of each audio clip in the audio file according to the audio file;

the characteristic matching unit is used for matching the voiceprint library established in advance according to the driver characteristics; the voiceprint library comprises a corresponding relation between vector characteristics and role information;

and the role determining unit is used for determining the role information of the audio clip as the role information corresponding to the driver features in the voiceprint library according to the similarity between the driver features of the audio clip and the driver features in the voiceprint library.

In one embodiment, the feature extraction unit includes:

the audio splitting subunit is used for splitting the audio file into a plurality of first audio segments according to the sentence ending position and/or the audio pause position of the audio file; each first audio segment comprises a plurality of second audio segments;

the feature extraction subunit is used for extracting voiceprint vector features of any part of audio in each first audio segment;

and the characteristic determining subunit is used for taking the vector characteristic of any partial audio in the first audio segment as the vector characteristic of the first audio segment.

In one embodiment, the apparatus further comprises:

the voiceprint library establishing module is used for acquiring any audio clips of a plurality of roles; marking the roles of the audio segments, and extracting the voiceprint characteristics of the audio segments; and storing the voiceprint characteristics and the corresponding role information to obtain a voiceprint library.

In one embodiment, the text transcription module comprises:

the manuscript determining unit is used for determining a manuscript corresponding to the audio file;

the transcription unit is used for inputting the audio file and the manuscript corresponding to the audio file into a pre-trained speech recognition deep neural network model; and the speech recognition deep neural network model outputs words with time stamps of all words in the manuscript corresponding to the audio file.

In one embodiment, the transfer unit includes:

a first processing subunit, configured to recognize each frame of speech of the audio file as a state sequence;

the second processing subunit is used for obtaining a plurality of phonemes according to the state sequence of each frame of voice in the audio file;

a third processing subunit for generating one or more words from the plurality of phonemes;

the fourth processing subunit is configured to match the one or more words with each frame of voice content, and obtain a relative time position of a voice clip corresponding to each word on a time axis;

and the fifth processing subunit is used for determining the timestamp of each word according to the relative time position of the voice clip corresponding to each word on the time axis.

EXAMPLE III

Based on the same inventive concept, embodiments of the present application further provide a computer storage medium, which is described below.

The computer storage medium has a computer program stored thereon, and the computer program, when executed by a processor, implements the steps of the program material production method according to an embodiment.

By adopting the computer storage medium provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.

Example four

Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which is described below.

As shown, the electronic device includes memory 301 for storing one or more programs, and one or more processors 302; the one or more programs, when executed by the one or more processors, implement the method for making program material as described in embodiment one.

By adopting the electronic equipment provided by the embodiment of the application, the automatic role identification and material editing of interview type television programs based on voiceprint identification can be realized, the production and manufacturing efficiency is greatly improved compared with the traditional process, and the material selection and manufacturing process of video programs is simplified.

EXAMPLE five

In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.

Assuming that a station is to make an installment, the following is the conversation content when recorded live (both video and audio) between the host and several guests:

"the king teacher listens to say that you have recently stepped on taiwan's land. "

"not recently, I have gone 1993. "

"but recently one more pass. "

"has just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? "

"I feel the human feelings, namely, the people can see the street at any time. "

"I also do. "

"very good. Although the people go to Taibei twice in the past, the people go to Hualian at most, and the people go to Tainan, enter the Taizhong, throw to south and throw to Niumtan. Especially, in the small towns in the hong Kong, lunch I eat is bought to be broken at the spot, scraping wood is made into a small shape, the wood is lovely, the wood is particularly good, and I can not know what wood is, and the wood is similar to the wood. I feel particularly that the people sell things, and the attitude is good and friendly to people. I did not like the word 'warm' in the past. "

"too flaring". "

"incite, sour. "

But I feel very warm when finishing the small towns of the deer harbor and the lunchman, and I like the word. "

After the recording is finished, the audio file of the talk content can be obtained.

According to the embodiment of the application, a small segment of audio clip spoken by each character is collected firstly according to the recorded audio file, the voiceprint characteristics of the small segment of audio clip are extracted, and the role information corresponding to the voiceprint characteristics is marked. For example: extracting the voiceprint characteristic corresponding to the audio clip of 'king teacher' spoken by the host, marking the voiceprint characteristic corresponding to the host, and storing the voiceprint characteristic in a voiceprint library; and extracting the voiceprint feature corresponding to the audio segment which is said by guest A and has been passed through in 1993, marking that the voiceprint feature corresponds to the guest A, and storing the voiceprint feature in a voiceprint library. Finally, the voiceprint characteristics corresponding to each person are obtained.

Then, the voiceprint features of all audios in the audio file are extracted, the voiceprint features of all audios in the audio file are matched with the established voiceprint library, and all audios of the audio file correspond to obtain role information. Or segmenting the audio file according to each sentence, and matching any part of audio in the audio of each sentence with the voiceprint library to obtain the role information of each sentence in the audio file, thereby reducing the matching calculation amount. For example: will "have just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? If the audio corresponding to the "just-passed" or the "yes-bar" of the sentence is matched with the voiceprint library, the voiceprint characteristics of the sentence can be determined, that is, the person who said the sentence can be determined.

Determining role information of the audio file according to the voiceprint matching, transcribing the audio file to obtain characters with time codes, and matching the role information with the characters to obtain the following information:

00:01:05,900 00:01:10,080

a host: "the king teacher listens to say that you have recently stepped on taiwan's land. "

00:01:12,750 00:01:20,240

Mr. King: "not recently, I have gone 1993. "

00:01:22,991 00:01:26,203

A host: "but recently one more pass. "

00:01:26,901 00:01:32,856

Mr. King: "has just recently gone. I ask me a next brother first, you are familiar with taiwan, is bar? What are you feeling the most lovely things in taiwan? "

00:01:32,905 00:01:36,401

Mr. text: "I feel the human feelings, namely, the people can see the street at any time. "

00:01:36,405 00:01:38,561

A host: "I also do. "

00:01:39,012 00:01:52,871

Mr. King: "very good. Although the people go to Taibei twice in the past, the people go to Hualian at most, and the people go to Tainan, enter the Taizhong, throw to south and throw to Niumtan. Especially, in the small towns in the hong Kong, lunch I eat is bought to be broken at the spot, scraping wood is made into a small shape, the wood is lovely, the wood is particularly good, and I can not know what wood is, and the wood is similar to the wood. I feel particularly that the people sell things, and the attitude is good and friendly to people. I did not like the word 'warm' in the past. "

00:01:52,998 00:01:53,805

Mr. text: "too flaring". "

00:01:53,908 00:01:54,674

A host: "incite, sour. "

00:01:54,785 00:01:58,609

Mr. King: but I feel very warm when finishing the small towns of the deer harbor and the lunchman, and I like the word. "

By the distinguishing of the roles, different viewpoints of different guest roles can be visually seen, and program producers can conveniently select materials and conceive programs according to the content.

The program producer can select the contents according to roles, characters and the like in a nonlinear editing system to determine which valuable contents can be used as material contents of subsequent programs. After the material content is determined, the corresponding position in the recorded video file can be positioned for cutting according to the time code corresponding to the determined material content.

For example: the program maker selects the following material contents:

00:01:05,900 00:01:10,080

00:01:26,901 00:01:32,856

00:01:32,905 00:01:36,401

00:01:39,012 00:01:52,871

00:01:54,785 00:01:58,609

Then, according to the time code corresponding to each section of material content, locating to a corresponding position in the video file, for example: locating the video file at the time period position (the picture of the host in the words) according to 00:01:05,90000: 01:10,080, and intercepting a video clip of the video file at the time period; locating the video file at the time slot position (the picture of the sentence is spoken by Mr. Wang) according to 00:01:26,90100: 01:32,856, and intercepting the video clip of the video file at the time slot.

And finally, obtaining a plurality of video segments as program materials to generate a program file.

During specific implementation, the program file can be manually checked, program materials are checked, and the program file is broadcast at a corresponding time according to the program list after the final version is confirmed.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.

Claims

1. A method for producing a program material, comprising:

determining material content according to the character and role information;

2. The method of claim 1, wherein determining the character information of each speech segment according to the audio file comprises:

3. The method of claim 2, wherein the extracting voiceprint driver features of each audio segment in the audio file from the audio file comprises:

4. The method according to claim 2, wherein the process of establishing the voiceprint library comprises:

collecting any audio clips of a plurality of roles;

5. The method of claim 1, wherein transcribing the audio file into text with time code information comprises:

determining a manuscript corresponding to the audio file;

6. The method of claim 5, wherein the speech recognition deep neural network model outputs text with a timestamp for each word in the manuscript to which the audio file corresponds, comprising:

generating one or more words from the plurality of phonemes;

7. A program material producing apparatus, comprising:

8. The apparatus of claim 7, wherein the role determination module comprises:

9. The apparatus of claim 8, wherein the feature extraction unit comprises:

10. The apparatus of claim 8, further comprising:

11. The apparatus of claim 7, wherein the text transcription module comprises:

12. The apparatus of claim 11, wherein the transcription unit comprises:

13. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 6.

14. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 6.