WO2017054488A1

WO2017054488A1 - Television play control method, server and television play control system

Info

Publication number: WO2017054488A1
Application number: PCT/CN2016/084461
Authority: WO
Inventors: 戚炎兴
Original assignee: 深圳Tcl新技术有限公司
Priority date: 2015-09-29
Filing date: 2016-06-02
Publication date: 2017-04-06
Also published as: CN105227966A

Abstract

Disclosed is a television play control method, comprising the following steps: a server receiving first audio data and subtitle data transmitted by a television terminal; performing an identification process on the first audio data and the subtitle data to generate a role list and a sample audio parameter; transmitting the role list and the sample audio parameter to the television terminal, and when receiving a user setting parameter fed back by the television terminal according to the role list and the sample audio parameter, synthesizing the first audio data into second audio data; and transmitting the second audio data to the television terminal, so as to control the playing of the second audio data and the subtitle data on the television terminal. Also disclosed are a server and a television play control system. The present invention can correspondingly provide audio which can be understood by a user according to language requirements of different users, so as to avoid the defect that character dialogues and the plots can only be understood by means of subtitles, thereby improving the user experience in watching television.

Description

Television broadcast control method, server and television broadcast control system

Technical field

The present invention relates to the field of television technologies, and in particular, to a television broadcast control method, a server, and a television broadcast control system.

Background technique

In current video terminals, video files are usually played according to the audio track and subtitle data in the video file, so that different users can select the language they understand. However, this video playback method has at least the following drawbacks:

Most video files may only provide one voice, but more than two subtitles are provided. In this case, the user can only listen to the default voice provided in the video file, but when the user does not understand the default language, You can only understand the character dialogue and plot by watching the subtitles. This will reduce the user's audiovisual experience.

Summary of the invention

The main objective of the present invention is to provide a television broadcast control method, a server and a television broadcast control system, which are designed to provide audio that can be understood by a user according to the language requirements of different users, so as to avoid the use of subtitles to understand the character dialogue and The flaws in the plot, thus improving the user experience of watching TV.

To achieve the above object, the present invention provides a television broadcast control method, and the television broadcast control method includes the following steps:

Receiving, by the server, first audio data and subtitle data sent by the television terminal;

Performing recognition processing on the first audio data and the caption data to generate a character list and sample audio parameters;

Transmitting the role list and sample audio parameters to the television terminal, and synthesizing the first audio data upon receiving a user setting parameter fed back by the television terminal according to the character list and sample audio parameters Is the second audio data;

Transmitting the second audio data to the television terminal to control the second audio data and the caption data to be played at the television terminal.

In addition, to achieve the above object, the present invention further provides a server, where the server includes:

a first receiving module, configured to receive first audio data and subtitle data sent by the television terminal;

a generating processing module, configured to perform recognition processing on the first audio data and the caption data, and generate a role list and sample audio parameters;

a synthesis processing module, configured to send the role list and sample audio parameters to the television terminal, and when receiving the user setting parameter that is reported by the television terminal according to the role list and the sample audio parameter, The first audio data is synthesized into the second audio data;

And a first sending module, configured to send the second audio data to the television terminal to control the second audio data and the subtitle data to be played at the television terminal.

In addition, in order to achieve the above object, the present invention further provides a television broadcast control system, the television broadcast control system comprising a television terminal and a server as described above, the television terminal comprising:

a second sending module, configured to send first audio data and subtitle data to the server;

a second receiving module, configured to receive a role list and sample audio parameters generated by the server after the first audio data and the caption data are identified and processed;

a feedback module, configured to generate a user setting parameter according to the role list and the sample audio parameter, and feed back the user setting parameter to the server;

An acquiring module, configured to acquire second audio data that is synthesized by the server when the user setting parameter is received, where the first audio data is synthesized;

a synchronous play module, configured to synchronously play the second audio data, the video data, and the caption data;

The television terminal extracts the video data, the first audio data, and the subtitle data from a video file.

The television broadcast control method, the server and the television broadcast control system provided by the present invention first receive the first audio data and the caption data sent by the television terminal through the server, and perform recognition processing to generate a character list and sample audio parameters, and then Transmitting the role list and the sample audio parameters to the television terminal, and synthesizing the first audio data into the second audio data according to the user setting parameter when receiving the user setting parameter fed back by the television terminal, and finally Transmitting the second audio data to the television terminal to control the second audio data and the caption data to be played at the television terminal. In this way, according to the language requirements of different users, the audio that can be understood by the user can be provided correspondingly, and the user's personalized requirement for the character dialogue can be satisfied, thereby avoiding the defect of the character dialogue and the plot only by using the subtitle, thereby improving the user. Watch the experience of TV.

DRAWINGS

1 is a schematic flow chart of an embodiment of a television broadcast control method according to the present invention;

2 is a schematic diagram of a refinement process of the step of FIG. 1 for identifying the first audio data, and generating a role list and sample audio parameters;

3 is a waveform diagram of a subtitle time stamp and first audio data;

4 is a schematic diagram of a refinement process of performing spectrum analysis on the first audio data in the time segment in the step of FIG. 2, and performing a classification to generate a role list;

5 is a schematic diagram of a refinement process of generating a sample audio parameter corresponding to the role list by using a speech synthesis technology in the step of FIG. 2;

6 is a step of FIG. 1 for transmitting the role list and sample audio parameters to the television terminal, and when receiving the user setting parameter fed back by the television terminal, the first setting according to the user setting parameter The audio data is synthesized into a refinement flow diagram of the second audio data;

7 is a schematic diagram showing a synthesized waveform of the second audio data;

FIG. 8 is a schematic diagram of functional modules of an embodiment of a server according to the present invention; FIG.

9 is a schematic diagram of a refinement function module of the generation processing module in FIG. 8;

10 is a schematic diagram of a refinement function module of the categorizing unit of FIG. 9;

11 is a schematic diagram of a refinement function module of the generating unit in FIG. 9;

12 is a schematic diagram of a refinement function module of the synthesis processing module of FIG. 8;

13 is a schematic diagram of functional modules of an embodiment of a television broadcast control system according to the present invention;

FIG. 14 is a schematic diagram of a refinement function module of the television terminal of FIG.

The implementation, functional features, and advantages of the present invention will be further described in conjunction with the embodiments.

detailed description

It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.

The present invention provides a television broadcast control method. Referring to FIG. 1, in an embodiment, a television broadcast control method of the television terminal includes the following steps:

Step S10, the server receives the first audio data and the caption data sent by the television terminal;

In this embodiment, the audio and video playback of the television is completed by the television terminal and the server. The television terminal completes the collation and transmission of audio, subtitle and other data, and provides a user interface for the user to set parameters. The server receives data such as audio and subtitles sent by the television terminal, and completes processing of the audio and subtitle data to synthesize the audio data and transmit it to the television terminal for display.

In this embodiment, when the user turns on the dubbing setting function through the remote controller of the television terminal, the television terminal quickly decodes the video file, extracts the audio or default audio selected by the user, and the subtitle or default subtitle selected by the user, and the audio data. And the subtitle data is packaged and sent to the server.

Step S20, performing identification processing on the first audio data and the caption data, and generating a role list and sample audio parameters;

The server performs identification processing on the first audio data and the caption data to generate a character list and sample audio parameters. The generation of the role list may first select a predetermined number of timestamps, such as selecting three timestamps, and then separately identifying and analyzing the audio data in the three timestamps, and similarly sounding each timestamp. The voices are classified into one type, namely, role 1, role 2, and the like. The specific classification method can distinguish statistics according to the audio spectrum. The sample audio is a preset fixed audio, which can be audio of different genders, and sample audio corresponding to different frequencies of different genders. For example, select a specific audio "Do you choose this voice as the voiceover of the character?" It also provides sample audio such as male treble audio, male vocal audio, male woofer audio, female treble audio, female vocal audio, female woofer audio, etc. Of course, in other embodiments, the frequency range of the audio can be further fined. The points are not limited to the high, medium and low audio ranges in the embodiment. In addition, the sample audio can also be audio of a famous dubbing person or a professional dubbing person.

Step S30, sending the role list and the sample audio parameters to the television terminal, and synthesizing the first audio data into the second audio data when receiving the user setting parameter fed back by the television terminal;

In this embodiment, the server sends the generated role list and sample audio parameters to the television terminal, and the television terminal presents a user interface on the television screen for the user to input and select a role list on the user interface. And the sample audio parameters to generate the user setting parameters, and then the television terminal feeds back the user setting parameters to the server. The server synthesizes the first audio data into the second audio data according to the user setting parameter, wherein the synthesis process of the second audio data requires a text speech engine and a vocal cancellation program, which may be specifically The text speech engine generates new audio data corresponding to the subtitle data (specifically according to user setting parameters), and performs vocal cancellation on the first audio data according to the vocal cancellation program, and then The new audio data and the vocal-removed first audio data are combined into the second audio data.

Step S40: Send the second audio data to the television terminal to control the second audio data and the subtitle data to be played in the television terminal.

In this embodiment, the server sends the synthesized second audio data corresponding to the role list to the television terminal. It can be understood that, in addition to extracting audio data and subtitle data from the video file, the television terminal extracts video data from the video file, and at this time, the television terminal receives the second In the case of audio data, the video data is synchronized with the second audio data and the subtitle data, and finally played.

The television broadcast control method provided by the present invention first receives the first audio data and the caption data sent by the television terminal through the server, and performs recognition processing to generate a character list and sample audio parameters, and then the character list and the sample audio. Sending parameters to the television terminal, when receiving the user setting parameter fed back by the television terminal, synthesizing the first audio data into second audio data according to the user setting parameter, and finally using the second audio data Sending to the television terminal to control the second audio data and the subtitle data to be played at the television terminal. In this way, according to the language requirements of different users, the audio that can be understood by the user can be provided correspondingly, and the user's personalized requirement for the character dialogue can be satisfied, thereby avoiding the defect of the character dialogue and the plot only by using the subtitle, thereby improving the user. Watch the experience of TV.

In an embodiment, as shown in FIG. 2, on the basis of the foregoing FIG. 1, the step S20 includes:

Step S201, the server extracts a subtitle timestamp from the subtitle data;

Step S202, searching for a time segment in which the first audio data appears according to the subtitle timestamp;

In this embodiment, referring to FIG. 3, the server extracts the subtitle timestamp from the subtitle data, and finds a time segment in which the character dubbing appears according to the subtitle timestamp, and calls the speech recognition module to perform recognition processing, and statistics A plurality of audio data having a higher frequency appear in the time segment for the user to select.

Understandably, there are many roles in TV dubbing, and the main characters are usually more dubbed, and those with lower frequency may have more dubbing. If they are selected by the user, the user's operation burden will increase. .

Step S203, performing spectrum analysis on the first audio data in the time segment, and performing classification to generate a role list;

In this embodiment, by performing spectrum analysis on the first audio data in the time segment, using the spectrum range and the spectrum amplitude, finding the audio with the spectrum close, and classifying the same type to generate a role list.

In an optional embodiment, as shown in FIG. 4, the step S203 may specifically include:

Step S2031, respectively acquiring first audio data in the first time segment and the second time segment;

Step S2032, determining whether the spectrum range and the spectrum amplitude of the first audio data in the first time segment and the second time segment are consistent;

In this embodiment, taking two time segments, such as a first time segment and a second time segment, as an example, the server respectively acquires first audio data in the first time segment and the second time segment, and separately processes the first time segment and A spectrum range and a spectral amplitude of the first audio data in the second time segment are analyzed, and determining whether a spectrum range and a spectrum amplitude of the first audio data in the first time segment are different from a spectrum of the first audio data in the second time segment The range and spectrum are consistent.

Step S2033, if yes, classifying the first audio data in the first time segment and the second time segment into the same role;

Step S2034, if no, classifying the first audio data in the first time segment and the second time segment into different roles.

In this embodiment, if the server determines that the first audio data in the first time segment and the second time segment is classified into the same role, if not, the first time segment and the second time are The first audio data within the segment is classified into different roles.

It can be understood that when the spectral range and the spectral amplitude of the first audio data in the two time segments are consistent, if the similarity between the two is greater than or equal to 90%, the determination is consistent. Of course, in other embodiments, the value of the similarity is not limited to the embodiment, but may be reasonably selected according to actual needs.

The audio spectrum of one of the time segments is used as a reference, and is defined as role 1, and then compared with the audio spectrum in subsequent time segments. If the characteristics of the two spectra are determined to be close, the audio in the two time segments is taken. Classified as role 1; if it is judged that the features of the two spectra do not match, the audio in the subsequent time segment is classified as role 2 until the audio spectrum recognition in all time segments is completed. Finally, the number of occurrences of the character is counted, and the characters with more occurrences are the main personas.

In this embodiment, it is not necessary to identify the specific content of the audio data. Since the content of the audio data is already provided in the subtitle data, this embodiment mainly analyzes the audio spectrum of the audio data in the time stamp. Because the pronunciation of each persona is different in spectrum, for example, the pronunciation spectrum of male voice is mainly concentrated in the middle and low frequency regions, while the pronunciation spectrum of female voice is concentrated in the middle and high frequency regions. In addition, in the pronunciation spectrum between characters, the spectral amplitudes of the respective frequency points also differ. Therefore, the pronunciation and audio between the characters can be distinguished by combining the spectral range and the spectral amplitude.

Step S204: Generate a sample audio parameter corresponding to the character list by using a speech synthesis technology.

In this embodiment, taking the read time stamp 00:01:02:100~00:01:05:100 as an example, the timestamp indicates that the audio data in the time segment has persona audio, and the time segment The audio data is speech-recognized to recognize the audio of one of the personas.

In this embodiment, the sample audio is a preset fixed audio, which may be audio of different genders, and sample audio corresponding to different frequencies of different genders, for example, selecting a specific audio “whether or not the voice of the segment is selected as a character. Dubbing?", and provide male treble audio, male vocal audio, male woofer audio, female treble audio, female vocal audio, female woofer audio and other sample audio, of course, in other embodiments, you can also audio The frequency range is further subdivided and is not limited to the high, medium and low audio ranges in this embodiment. In this embodiment, after the TV terminal sends the role list and the sample audio parameters, the user interface mode is popped up for the user to select, wherein the role list is the result of the role classification in the foregoing text; and the sample parameter refers to each The timestamp parameter in the role collation and the sample audio that the user can choose to preview. The timestamp parameter allows the user to preview the original dubbing as well as the sample audio.

In an embodiment, as shown in FIG. 5, on the basis of the foregoing FIG. 2, the step S204 includes:

Step S2041: Extract, for each role in the role list, a predetermined number of subtitle timestamps from the subtitle data;

Step S2042: Generate, by the text-to-speech engine, a predetermined number of sample audio parameters corresponding to the predetermined number of subtitle timestamps, to send to the television terminal for preview selection.

In this embodiment, the user can preview the original voice of the character and the selected sample audio. When receiving the user setting parameter, that is, the sample audio selected by the user, the television terminal transmits the corresponding parameter to the server, and the server uses the text. The speech engine generates a predetermined number of sample audio parameters corresponding to the predetermined number of subtitle timestamps for transmission to the television terminal for preview selection.

For example, in the character audio recognition process, three similar pronunciation-like roles are classified, and the server provides three time stamps for each role classification, and simultaneously sends the generated sample audio to the television terminal. At this time, the user may select the provided time stamp for each role categorization to play the audio of the corresponding time at the television terminal, so that the user recognizes the person represented by the role classification. In addition, the user can preview the sample audio produced by the audition text speech engine to select and determine the appropriate sample audio parameters.

In an embodiment, as shown in FIG. 6, on the basis of the foregoing FIG. 1, the step S30 includes:

Step S301, the generated role list and sample audio parameters are sent to the television terminal;

Step S302, receiving user setting parameters fed back by the television terminal;

In this embodiment, the server sends the generated role list and sample audio parameters to the television terminal, and the television terminal presents a user interface on the television screen for the user to input and select a role list on the user interface. A sample audio parameter to generate the user setting parameter, and then the television terminal feeds back the user setting parameter to the server.

Step S303, performing audio filtering on the first audio data, and synthesizing the second audio data corresponding to the role list by using a text speech engine and combining the user setting parameters.

In this embodiment, referring to FIG. 7, new audio data corresponding to the subtitle data may be generated according to the text speech engine (specifically according to setting parameters of the user), and the first An audio data is subjected to vocal cancellation, and then the new audio data and the vocal-removed first audio data are synthesized into second audio data corresponding to the character list.

Among them, the existing vocal elimination method mainly uses the same vocal pronunciation in the left and right channels, and subtracts the left and right channels to remove the same part of the two channels, but this method is not only The background sound causes a large loss (especially in the low frequency part), and when the vocal pronunciation is different in the two channels, the vocal is not well eliminated. In the present application, a bandpass filter is used, and in the frequency band of the bandpass filter, only the amplitude of the original pronunciation is reduced, and the discrimination of the synthesized audio is not affected, so that the low frequency and high frequency portions can be better preserved. . In addition, there is no impact on the audio data outside the timestamp range.

The present invention also provides a server 1. Referring to FIG. 8, in an embodiment, the server 1 includes:

The first receiving module 10 is configured to receive first audio data and subtitle data sent by the television terminal;

In this embodiment, the audio and video playback of the television is completed by the television terminal in cooperation with the server 1. The television terminal completes the collation and transmission of audio and subtitle data, and provides a user interface for the user to set parameters. The server 1 receives data such as audio and subtitles transmitted by the television terminal, and completes processing of audio and subtitle data to synthesize the audio data and transmit it to the television terminal for display.

In this embodiment, when the user turns on the dubbing setting function through the remote controller of the television terminal, the television terminal quickly decodes the video file, extracts the audio or default audio selected by the user, and the subtitle or default subtitle selected by the user, and the audio data. And the subtitle data is packaged and sent to the server 1.

The generating processing module 20 is configured to perform recognition processing on the first audio data and the caption data to generate a role list and sample audio parameters;

The server 1 performs identification processing on the first audio data and the caption data to generate a character list and sample audio parameters. The generation of the role list may first select a predetermined number of timestamps, such as selecting three timestamps, and then separately identifying and analyzing the audio data in the three timestamps, and similarly sounding each timestamp. The voices are classified into one type, namely, role 1, role 2, and the like. The specific classification method can distinguish statistics according to the audio spectrum. The sample audio is a preset fixed audio, which can be audio of different genders, and sample audio corresponding to different frequencies of different genders. For example, select a specific audio "Do you choose this voice as the voiceover of the character?" It also provides sample audio such as male treble audio, male vocal audio, male woofer audio, female treble audio, female vocal audio, female woofer audio, etc. Of course, in other embodiments, the frequency range of the audio can be further fined. The points are not limited to the high, medium and low audio ranges in the embodiment. In addition, the sample audio can also be audio of a famous dubbing person or a professional dubbing person.

a synthesis processing module 30, configured to send the role list and sample audio parameters to the television terminal, and when receiving the user setting parameters fed back by the television terminal according to the role list and the sample audio parameters, The first audio data is synthesized into second audio data;

In this embodiment, the server 1 sends the generated role list and sample audio parameters to the television terminal, and the television terminal presents a user interface on the television screen for the user to input and select a role on the user interface. The list and sample audio parameters are generated to generate the user setting parameters, and then the television terminal feeds back the user setting parameters to the server 1. The server 1 synthesizes the first audio data into the second audio data according to the user setting parameter, wherein the synthesis process of the second audio data requires a text speech engine and a vocal cancellation program, specifically Generating new audio data corresponding to the subtitle data according to the text speech engine (specifically according to user setting parameters), and performing vocal cancellation on the first audio data according to the vocal cancellation program, and then The new audio data and the vocal-removed first audio data are combined into the second audio data.

The first sending module 40 is configured to send the second audio data to the television terminal to control the second audio data and the caption data to be played in the television terminal.

In this embodiment, the server 1 sends the synthesized second audio data corresponding to the role list to the television terminal. It can be understood that, in addition to extracting audio data and subtitle data from the video file, the television terminal extracts video data from the video file, and at this time, the television terminal receives the second In the case of audio data, the video data is synchronized with the second audio data and the subtitle data, and finally played.

The server 1 provided by the present invention first receives the first audio data and the caption data sent by the television terminal, and performs recognition processing to generate a character list and sample audio parameters, and then sends the role list and the sample audio parameters to The television terminal, when receiving the user setting parameter fed back by the television terminal, synthesizing the first audio data into second audio data according to the user setting parameter, and finally transmitting the second audio data to the The television terminal controls the second audio data and the subtitle data to be played at the television terminal. In this way, according to the language requirements of different users, the audio that can be understood by the user can be provided correspondingly, and the user's personalized requirement for the character dialogue can be satisfied, thereby avoiding the defect of the character dialogue and the plot only by using the subtitle, thereby improving the user. Watch the experience of TV.

In an embodiment, as shown in FIG. 9, on the basis of the foregoing FIG. 8, the generation processing module 20 includes:

An obtaining unit 201, configured to extract a subtitle timestamp from the subtitle data;

The searching unit 202 is configured to search, according to the subtitle timestamp, a time segment in which the first audio data appears;

In this embodiment, referring to FIG. 3, the server 1 extracts a subtitle timestamp from the subtitle data, and finds a time segment in which the character dubbing appears according to the subtitle timestamp, and calls the speech recognition module to perform recognition processing, and statistics are performed. A plurality of audio data having a higher frequency appear in the time segment for the user to select.

The categorizing unit 203 is configured to perform spectrum analysis on the first audio data in the time segment, and perform categorization to generate a role list.

In an optional embodiment, referring to FIG. 10, the categorizing unit 203 includes:

The obtaining subunit 2031 is configured to respectively acquire first audio data in the first time segment and the second time segment;

a determining subunit 2032, configured to determine whether a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment are consistent;

a first categorization sub-unit 2033, configured to: when determining a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment, the first time segment and the second time The first audio data within the segment is classified into the same role;

a second collation sub-unit 2034, configured to: when determining that a spectrum range and/or a spectrum amplitude of the first audio data in the first time segment and the second time segment are inconsistent, then the first time segment and the first time segment The first audio data within the two time segments is classified into different roles.

The generating unit 204 is configured to generate a sample audio parameter corresponding to the character list by using a voice synthesis technology.

In an embodiment, as shown in FIG. 11, on the basis of the foregoing FIG. 9, the generating unit 204 includes:

An extracting subunit 2041, configured to extract, from each of the roles in the role list, a predetermined number of subtitle timestamps from the subtitle data;

The generating subunit 2042 is configured to generate, by the text speech engine, a predetermined number of sample audio parameters corresponding to the predetermined number of subtitle timestamps, to send to the television terminal for preview selection.

In an embodiment, as shown in FIG. 12, on the basis of the foregoing FIG. 8, the synthesis processing module 30 includes:

The sending unit 301 is configured to send the generated role list and sample audio parameters to the television terminal;

The receiving unit 302 is configured to receive user setting parameters that are feedback by the television terminal according to the role list and the sample audio parameters.

In this embodiment, the server 1 sends the generated role list and sample audio parameters to the television terminal, and the television terminal presents a user interface on the television screen for the user to input and select a role list on the user interface. And the sample audio parameters to generate the user setting parameters, and then the television terminal feeds back the user setting parameters to the server 1.

The synthesizing unit 303 is configured to perform audio filtering on the first audio data, and synthesize the second audio data corresponding to the role list by using a text speech engine and combining the user setting parameters.

In this embodiment, referring to FIG. 7 , specifically, the new audio data corresponding to the subtitle data may be generated according to the text speech engine (specifically according to a setting parameter of the user), and according to the vocal elimination program, The first audio data is subjected to vocal cancellation, and then the new audio data and the vocal-removed first audio data are synthesized into second audio data corresponding to the character list.

The present invention also provides a television broadcast control system 100. Referring to FIG. 13, in an embodiment, the television broadcast control system 100 includes a television terminal 2 and a server 1 as described above. Referring to FIG. 14, the television terminal 2 include:

a second sending module 50, configured to send first audio data and caption data to the server 1;

The second receiving module 60 is configured to receive a role list and sample audio parameters generated by the server 1 after the first audio data and the caption data are identified and processed;

The feedback module 70 is configured to generate a user setting parameter according to the role list and the sample audio parameter, and feed back the user setting parameter to the server 1;

The obtaining module 80 is configured to acquire second audio data that is synthesized by the server 1 when the user setting parameter is received, where the first audio data is synthesized;

The synchronous play module 90 is configured to synchronously play the second audio data, the video data, and the caption data.

In this embodiment, when receiving the second audio data corresponding to the role list synthesized by the server 1, the television terminal 2 synchronizes the second audio data with the video data and the caption data. Finally, the playback is performed, so that the audio of the video file is pre-processed by the server 1 to synthesize a language that the user can understand, which can enhance the user's viewing experience; in addition, the user can also provide various character audio selections, thereby further enhancing the user's viewing experience. User experience.

For the principle of the embodiment, please refer to the foregoing embodiments, and details are not described herein again.

The above are only the preferred embodiments of the present invention, and are not intended to limit the scope of the invention, and the equivalent structure or equivalent process transformations made by the description of the present invention and the drawings are directly or indirectly applied to other related technical fields. The same is included in the scope of patent protection of the present invention.

Claims

A television broadcast control method, characterized in that the television broadcast control method comprises the following steps:

Receiving, by the server, first audio data and subtitle data sent by the television terminal;

Performing recognition processing on the first audio data and the caption data to generate a character list and sample audio parameters;

Transmitting the role list and sample audio parameters to the television terminal, and synthesizing the first audio data upon receiving a user setting parameter fed back by the television terminal according to the character list and sample audio parameters Is the second audio data;

Transmitting the second audio data to the television terminal to control the second audio data and the subtitle data to be played at the television terminal;

The step of performing the identification process on the first audio data and the caption data to generate a role list and sample audio parameters includes:

The server extracts a subtitle timestamp from the subtitle data;

And finding, according to the subtitle timestamp, a time segment in which the first audio data appears;

Performing spectrum analysis on the first audio data in the time segment, and performing categorization to generate a role list;

Generating a sample audio parameter corresponding to the character list by using a speech synthesis technology;

The television terminal extracts the first audio data and the subtitle data from a video file, and sends the first audio data and subtitle data to the server.
The television broadcast control method according to claim 1, wherein the step of generating a sample audio parameter corresponding to the character list by using a speech synthesis technology comprises:

Extracting a predetermined number of subtitle timestamps from the subtitle data for each character in the role list;

A predetermined number of sample audio parameters are generated by the text-to-speech engine corresponding to the predetermined number of subtitle timestamps for transmission to the television terminal for preview selection.
The television broadcast control method according to claim 1, wherein said character list and sample audio parameters are transmitted to said television terminal, and said television terminal is received according to said character list and sample When the user sets the parameters of the audio parameter feedback, the step of synthesizing the first audio data into the second audio data includes:

Sending the generated role list and sample audio parameters to the television terminal;

Receiving user setting parameters that are feedback by the television terminal according to the role list and the sample audio parameters;

Performing audio filtering on the first audio data, synthesizing the second audio data corresponding to the role list by using a text speech engine and combining the user setting parameters;

The television terminal receives a role list and a sample audio parameter selected by the user through the user interface to generate the user setting parameter, and feeds back the user setting parameter to the server.
The television broadcast control method according to claim 1, wherein the step of performing spectrum analysis on the first audio data in the time segment and performing categorization to generate a role list comprises:

Acquiring first audio data in the first time segment and the second time segment, respectively;

Determining whether a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment are consistent;

If yes, classifying the first audio data in the first time segment and the second time segment into the same role;

If not, the first audio data in the first time segment and the second time segment are classified into different roles.
A server, wherein the server comprises:

a first receiving module, configured to receive first audio data and subtitle data sent by the television terminal;

a generating processing module, configured to perform recognition processing on the first audio data and the caption data, and generate a role list and sample audio parameters;

a synthesis processing module, configured to send the role list and sample audio parameters to the television terminal, and when receiving the user setting parameter that is reported by the television terminal according to the role list and the sample audio parameter, The first audio data is synthesized into the second audio data;

And a first sending module, configured to send the second audio data to the television terminal to control the second audio data and the subtitle data to be played at the television terminal.
The server according to claim 5, wherein the generation processing module comprises:

An obtaining unit, configured to extract a subtitle timestamp from the subtitle data;

a searching unit, configured to find a time segment in which the first audio data appears according to the subtitle time stamp;

a categorizing unit, configured to perform spectrum analysis on the first audio data in the time segment, and perform categorization to generate a role list;

a generating unit, configured to generate a sample audio parameter corresponding to the role list by using a voice synthesis technology;

The television terminal extracts the first audio data and the subtitle data from a video file, and sends the first audio data and subtitle data to the server.
The server according to claim 6, wherein the generating unit comprises:

Extracting a subunit, configured to extract a predetermined number of subtitle timestamps from the subtitle data for each role in the role list;

And generating a subunit, configured to generate, by the text and speech engine, a predetermined number of sample audio parameters corresponding to the predetermined number of subtitle timestamps, to send to the television terminal for preview selection.
The server according to claim 6, wherein said synthesis processing module comprises:

a sending unit, configured to send the generated role list and sample audio parameters to the television terminal;

a receiving unit, configured to receive user setting parameters that are feedback by the television terminal according to the role list and sample audio parameters;

a synthesizing unit, configured to perform audio filtering on the first audio data, and synthesize the second audio data corresponding to the role list by using a text speech engine and combining the user setting parameters;

The television terminal receives a role list and a sample audio parameter selected by the user through the user interface to generate the user setting parameter, and feeds back the user setting parameter to the server.
The server according to claim 6, wherein said categorizing unit comprises:

Obtaining a subunit, configured to respectively acquire first audio data in the first time segment and the second time segment;

a determining subunit, configured to determine whether a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment are consistent;

a first categorization subunit, configured to: when determining a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment, the first time segment and the second time segment The first audio data is classified into the same role;

a second categorization subunit, configured to: when determining that a spectrum range and/or a spectrum amplitude of the first audio data in the first time segment and the second time segment are inconsistent, the first time segment and the second time The first audio data within the segment is classified into different roles.
A television broadcast control system, comprising: a television terminal and a server, the television terminal comprising:

a second sending module, configured to send first audio data and subtitle data to the server;

a second receiving module, configured to receive a role list and sample audio parameters generated by the server after the first audio data and the caption data are identified and processed;

a feedback module, configured to generate a user setting parameter according to the role list and the sample audio parameter, and feed back the user setting parameter to the server;

An acquiring module, configured to acquire second subtitle data that is synthesized by the server when the user setting parameter is received, where the first audio data is synthesized;

a synchronous play module, configured to synchronously play the second audio data, the video data, and the caption data;

The television terminal extracts the video data, the first audio data, and the caption data from a video file;

The server includes:

a first receiving module, configured to receive first audio data and subtitle data sent by the television terminal;

a generating processing module, configured to perform recognition processing on the first audio data and the caption data, and generate a role list and sample audio parameters;

a synthesis processing module, configured to send the role list and sample audio parameters to the television terminal, and when receiving the user setting parameter that is reported by the television terminal according to the role list and the sample audio parameter, The first audio data is synthesized into the second audio data;

And a first sending module, configured to send the second audio data to the television terminal to control the second audio data and the subtitle data to be played at the television terminal.
The television broadcast control system according to claim 10, wherein the generating processing module comprises:

An obtaining unit, configured to extract a subtitle timestamp from the subtitle data;

a searching unit, configured to find a time segment in which the first audio data appears according to the subtitle time stamp;

a categorizing unit, configured to perform spectrum analysis on the first audio data in the time segment, and perform categorization to generate a role list;

And a generating unit, configured to generate a sample audio parameter corresponding to the character list by using a speech synthesis technology.
The television broadcast control system according to claim 11, wherein the generating unit comprises:

Extracting a subunit, configured to extract a predetermined number of subtitle timestamps from the subtitle data for each role in the role list;

And generating a subunit, configured to generate, by the text and speech engine, a predetermined number of sample audio parameters corresponding to the predetermined number of subtitle timestamps, to send to the television terminal for preview selection.
The television broadcast control system according to claim 11, wherein the synthesis processing module comprises:

a sending unit, configured to send the generated role list and sample audio parameters to the television terminal;

a receiving unit, configured to receive user setting parameters that are feedback by the television terminal according to the role list and sample audio parameters;

a synthesizing unit, configured to perform audio filtering on the first audio data, and synthesize the second audio data corresponding to the role list by using a text speech engine and combining the user setting parameters;

The television terminal receives a role list and a sample audio parameter selected by the user through the user interface to generate the user setting parameter, and feeds back the user setting parameter to the server.
The television broadcast control system according to claim 11, wherein the categorizing unit comprises:

Obtaining a subunit, configured to respectively acquire first audio data in the first time segment and the second time segment;

a determining subunit, configured to determine whether a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment are consistent;

a first categorization subunit, configured to: when determining a spectrum range and a spectrum amplitude of the first audio data in the first time segment and the second time segment, the first time segment and the second time segment The first audio data is classified into the same role;

a second categorization subunit, configured to: when determining that a spectrum range and/or a spectrum amplitude of the first audio data in the first time segment and the second time segment are inconsistent, the first time segment and the second time The first audio data within the segment is classified into different roles.