CN105227966A

CN105227966A - To televise control method, server and control system of televising

Info

Publication number: CN105227966A
Application number: CN201510633934.3A
Authority: CN
Inventors: 戚炎兴
Original assignee: Shenzhen TCL New Technology Co Ltd
Current assignee: Shenzhen TCL New Technology Co Ltd
Priority date: 2015-09-29
Filing date: 2015-09-29
Publication date: 2016-01-06
Also published as: WO2017054488A1

Abstract

The invention discloses one to televise control method, comprise the following steps: server receives the first voice data and caption data that television terminal sends; Identifying processing is carried out to the first voice data and caption data, generates character list and sample audio frequency parameter; Character list and sample audio frequency parameter are sent to television terminal, and when receiving user's parameters that television terminal feeds back according to character list and sample audio frequency parameter, the first voice data are synthesized second audio data; Second audio data is sent to television terminal, to control second audio data and caption data is play at television terminal.The invention also discloses a kind of server and control system of televising.The present invention can according to the language needs of different user, and correspondence provides the audio frequency that can be easily absorbed on a cognitive level by the user, to avoid by captions to understand the defect of personage to the bletilla story of a play or opera, thus to improve the experience sense that user watches TV.

Description

To televise control method, server and control system of televising

Technical field

The present invention relates to TV technology, particularly relate to one and to televise control method, server and control system of televising.

Background technology

Current television terminal, when carrying out video file and playing, usually switches personage according to track in video file and caption data and dubs and captions, the language oneself understood can be selected to play to facilitate different users.But at least there is following defect in this video playback mode:

Most of video file may only provide a kind of voice, but provide two or more captions, in this case, user can only listen to the Default sound provided in video file simultaneously, and when user does not understand this default language, just personage's dialogue and the story of a play or opera can only be understood by viewing captions.Like this, the audiovisual experience effect of user can be reduced.

Foregoing, only for auxiliary understanding technical scheme of the present invention, does not represent and admits that foregoing is prior art.

Summary of the invention

Main purpose of the present invention is to provide one to televise control method, server and control system of televising, be intended to the language needs according to different user, correspondence provides the audio frequency that can be easily absorbed on a cognitive level by the user, to avoid by captions to understand the defect of personage to the bletilla story of a play or opera, thus the experience sense that user watches TV to be improved.

For achieving the above object, the invention provides one and to televise control method, described in control method of televising comprise the following steps:

Server receives the first voice data and the caption data of television terminal transmission;

Identifying processing is carried out to described first voice data and caption data, generates character list and sample audio frequency parameter;

Described character list and sample audio frequency parameter are sent to described television terminal, and when receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter, described first voice data are synthesized second audio data;

Described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.

Preferably, describedly carry out identifying processing to described first voice data and caption data, the step generating character list and sample audio frequency parameter comprises:

Described server extracts caption timestamps from described caption data;

According to described caption timestamps, find out the time slice that described first voice data occurs;

Spectrum analysis is carried out to described first voice data in described time slice, and carries out classification generation character list;

Utilize speech synthesis technique, generate the sample audio frequency parameter corresponding with described character list;

Wherein, described television terminal extracts described first voice data and described caption data from video file, and described first voice data and caption data are sent to described server.

Preferably, describedly utilize speech synthesis technique, the step generating the sample audio frequency parameter corresponding with described character list comprises:

For each role in described character list, from described caption data, extract the caption timestamps of predetermined quantity;

By text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

Preferably, described described character list and sample audio frequency parameter are sent to described television terminal, and when receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter, the step that described first voice data synthesizes second audio data is comprised:

The described character list generated and sample audio frequency parameter are sent to described television terminal;

Receive user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter;

Filtered audio is carried out to described first voice data, by text voice engine and in conjunction with described user's parameters, synthesizes the described second audio data corresponding with described character list;

Wherein, described television terminal receives the character list and sample audio frequency parameter that user selected by user interface, to generate described user's parameters, and described user's parameters is fed back to described server.

Preferably, described spectrum analysis is carried out to described first voice data in described time slice, and carries out sorting out the step generating character list and comprise:

Obtain the first voice data in very first time fragment and the second time slice respectively;

Judge that whether spectral range and the spectrum amplitude of described very first time fragment and the first voice data in the second time slice be consistent;

If so, then the first voice data in described very first time fragment and the second time slice is classified as same role;

If not, then the first voice data in described very first time fragment and the second time slice is classified as different role.

In addition, for achieving the above object, the present invention also provides a kind of server, and described server comprises:

First receiver module, for receiving the first voice data and the caption data of television terminal transmission;

Generating process module, for carrying out identifying processing to described first voice data and caption data, generates character list and sample audio frequency parameter;

Synthesis processing module, for described character list and sample audio frequency parameter are sent to described television terminal, and when receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter, described first voice data is synthesized second audio data;

First sending module, for described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.

Preferably, described generating process module comprises:

Acquiring unit, for extracting caption timestamps from described caption data;

Search unit, for according to described caption timestamps, find out the time slice that described first voice data occurs;

Sorting out unit, for carrying out spectrum analysis to described first voice data in described time slice, and carrying out classification generation character list;

Generation unit, for utilizing speech synthesis technique, generates the sample audio frequency parameter corresponding with described character list;

Preferably, described generation unit comprises:

Extract subelement, for for each role in described character list, from described caption data, extract the caption timestamps of predetermined quantity;

Generate subelement, for passing through text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

Preferably, described synthesis processing module comprises:

Transmitting element, for being sent to described television terminal by the described character list generated and sample audio frequency parameter;

Receiving element, for receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter;

Synthesis unit, for carrying out filtered audio to described first voice data, by text voice engine and in conjunction with described user's parameters, synthesizes the described second audio data corresponding with described character list;

Preferably, described classification unit comprises:

Obtain subelement, for obtaining the first voice data in very first time fragment and the second time slice respectively;

Whether judgment sub-unit is consistent for the spectral range and spectrum amplitude judging described very first time fragment and the first voice data in the second time slice;

First sorts out subelement, for when the spectral range and spectrum amplitude that judge described very first time fragment and the first voice data in the second time slice are consistent, then the first voice data in described very first time fragment and the second time slice is classified as same role;

Second sorts out subelement, for when the spectral range and/or spectrum amplitude that judge the first voice data in described very first time fragment and the second time slice are inconsistent, then the first voice data in described very first time fragment and the second time slice is classified as different role.

In addition, for achieving the above object, the present invention also provides one to televise control system, described in control system of televising comprise television terminal and server as above, described television terminal comprises:

Second sending module, for sending the first voice data and caption data to server;

Second receiver module, for receiving after described server carries out identifying processing to described first voice data and caption data, the character list of generation and sample audio frequency parameter;

Feedback module, for generating user's parameters according to described character list and sample audio frequency parameter, and feeds back to described server by described user's parameters;

Acquisition module, for obtaining described server when receiving described user's parameters, by the second audio data of described first voice data synthesis;

Synchronous playing module, for synchronously playing described second audio data, video data and caption data;

Wherein, described television terminal extracts described video data, described first voice data and described caption data from video file.

Control method of televising provided by the invention, server and control system of televising, first the first voice data and the caption data of television terminal transmission is received by server, and carry out identifying processing, to generate character list and sample audio frequency parameter, then described character list and sample audio frequency parameter are sent to described television terminal, when receiving user's parameters of described television terminal feedback, according to described user's parameters, described first voice data is synthesized second audio data, the most described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.Like this, can according to the language needs of different user, correspondence provides the audio frequency that can be easily absorbed on a cognitive level by the user, and can also meet the individual requirement of user to personage's dialogue, thus can avoid by captions to understand the defect of personage to the bletilla story of a play or opera, and then the experience sense that user watches TV can only being improved.

Accompanying drawing explanation

Fig. 1 is that the present invention televises the schematic flow sheet of control method one embodiment;

Fig. 2 is that in Fig. 1, step carries out identifying processing to described first voice data, generates the refinement schematic flow sheet of character list and sample audio frequency parameter;

Fig. 3 is the waveform schematic diagram of caption timestamps and the first voice data;

Fig. 4 is that in Fig. 2, step carries out spectrum analysis to described first voice data in described time slice, and carries out sorting out the refinement schematic flow sheet generating character list;

Fig. 5 is that in Fig. 2, step utilizes speech synthesis technique, generates the refinement schematic flow sheet of the sample audio frequency parameter corresponding with described character list;

Fig. 6 is that in Fig. 1, described character list and sample audio frequency parameter are sent to described television terminal by step, and when receiving user's parameters of described television terminal feedback, according to described user's parameters, described first voice data is synthesized the refinement schematic flow sheet of second audio data;

The synthetic waveform schematic diagram that Fig. 7 is second audio data;

Fig. 8 is the high-level schematic functional block diagram of server one embodiment of the present invention;

Fig. 9 is the refinement high-level schematic functional block diagram of generating process module in Fig. 8;

Figure 10 is the refinement high-level schematic functional block diagram sorting out unit in Fig. 9;

Figure 11 is the refinement high-level schematic functional block diagram of generation unit in Fig. 9;

Figure 12 is the refinement high-level schematic functional block diagram of synthesizing processing module in Fig. 8;

Figure 13 is that the present invention televises the high-level schematic functional block diagram of control system one embodiment;

Figure 14 is the refinement high-level schematic functional block diagram of television terminal in Figure 13.

The realization of the object of the invention, functional characteristics and advantage will in conjunction with the embodiments, are described further with reference to accompanying drawing.

Embodiment

Should be appreciated that specific embodiment described herein only in order to explain the present invention, be not intended to limit the present invention.

The invention provides one to televise control method, with reference to Fig. 1, in one embodiment, the control method of televising of described television terminal comprises the following steps:

Step S10, server receives the first voice data and the caption data of television terminal transmission;

In the present embodiment, the Voice & Video of TV is play, and completed by television terminal and server collaboration, described television terminal completes arrangement and the transmission of the data such as audio frequency, captions, and provides user interface to carry out optimum configurations for user.And server receives the data such as audio frequency, captions that television terminal sends, and complete the process of audio frequency, caption data, show to be transferred to television terminal after Composite tone data.

In the present embodiment, user by the remote controller of television terminal open dub function is set time, television terminal then carries out fast decoding to video file, extract audio frequency that user selects or the captions that default audio and user select or default subtitle, and voice data and caption data packing are sent to server.

Step S20, carries out identifying processing to described first voice data and caption data, generates character list and sample audio frequency parameter;

Described server carries out identifying processing to described first voice data and caption data, generates character list and sample audio frequency parameter.Wherein, the generation of described character list, first can choose the timestamp of predetermined quantity, as chosen three sections of timestamps, then respectively discriminance analysis is carried out to the voice data in described three sections of timestamps, and the similar voice that pronounce in each timestamp are classified as a class and the different role such as role 1, role 2.Concrete classifying method, can carry out differentiation statistics according to audible spectrum.And sample audio frequency is default fixed-audio, can be the audio frequency of different sexes, and the sample audio frequency that different sexes different frequency is corresponding, such as, choose one section of specific audio frequency " selecting this section of voice to be dubbing of character? " and male voice treble audio is provided respectively, male voice middle pitch audio frequency, male voice audio bass, female voice treble audio, female voice middle pitch audio frequency, the sample audio frequency such as female voice audio bass, certainly, in other embodiments, the frequency range of audio frequency can also be segmented further, be not limited to the height in the present embodiment, in, low three kinds of audioranges.In addition, described sample audio frequency can also be the audio frequency that the famous personnel of dubbing or specialty dub personnel.

Step S30, is sent to described television terminal by described character list and sample audio frequency parameter, and when receiving user's parameters of described television terminal feedback, described first voice data is synthesized second audio data;

In the present embodiment, the described character list generated and sample audio frequency parameter are sent to described television terminal by server, described television terminal presents user interface on the tv screen, input on a user interface for user and select character list and sample audio frequency parameter, thus generate described user's parameters, then described user's parameters is fed back to described server by described television terminal.Described first voice data is synthesized second audio data according to described user's parameters by described server, wherein, the building-up process of described second audio data, need to use text voice engine and voice elimination program, the new voice data (specifically different according to user's parameters) of corresponding described caption data specifically can be produced according to described text voice engine, and according to described voice elimination program, described first voice data is carried out voice elimination, then described new voice data and the first voice data eliminated through voice are synthesized second audio data.

Step S40, is sent to described television terminal by described second audio data, to control described second audio data and described caption data is play at described television terminal.

In the present embodiment, described server, by the described second audio data corresponding with described character list of synthesis, is sent to described television terminal.Be understandable that, described television terminal except extracting except voice data, caption data from described video file, also can extract video data from described video file, now, television terminal is when receiving described second audio data, described video data and described second audio data and caption data synchronously can be processed, finally play.

Control method of televising provided by the invention, first the first voice data and the caption data of television terminal transmission is received by server, and carry out identifying processing, to generate character list and sample audio frequency parameter, then described character list and sample audio frequency parameter are sent to described television terminal, when receiving user's parameters of described television terminal feedback, according to described user's parameters, described first voice data is synthesized second audio data, the most described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.Like this, can according to the language needs of different user, correspondence provides the audio frequency that can be easily absorbed on a cognitive level by the user, and can also meet the individual requirement of user to personage's dialogue, thus can avoid by captions to understand the defect of personage to the bletilla story of a play or opera, and then the experience sense that user watches TV can only being improved.

In one embodiment, as shown in Figure 2, on the basis shown in above-mentioned Fig. 1, described step S20 comprises:

Step S201, described server extracts caption timestamps from described caption data;

Step S202, according to described caption timestamps, finds out the time slice that described first voice data occurs;

In the present embodiment, with reference to Fig. 3, server extracts caption timestamps from described caption data, and find out according to described caption timestamps the time slice that role dubs appearance, and call sound identification module and carry out identifying processing, count several higher voice datas of frequency of occurrence in described time slice and select for user.

Be understandable that, TV dubs middle existence a lot of roles, and dubbing usually of high priest role is more, and those frequency of occurrences lower dub may be also more, if all selected by user, then can the operation burden of adding users.

Step S203, carries out spectrum analysis to described first voice data in described time slice, and carries out classification generation character list;

In the present embodiment, by carrying out spectrum analysis to the first voice data in described time slice, utilizing spectral range and spectrum amplitude, finding out the audio frequency that frequency spectrum is close, and be classified as same class and generate character list.

In an embodiment, as shown in Figure 4, described step S203 can specifically comprise:

Step S2031, obtains the first voice data in very first time fragment and the second time slice respectively;

Step S2032, judge the spectral range of described very first time fragment and the first voice data in the second time slice and spectrum amplitude whether consistent;

In the present embodiment, for two time slices as very first time fragment and the second time slice, server obtains the first voice data in very first time fragment and the second time slice respectively, and respectively the spectral range of the first voice data in very first time fragment and the second time slice and spectrum amplitude are analyzed, judge that the spectral range of the first voice data in very first time fragment and spectrum amplitude be whether consistent with the spectral range of the first voice data in the second time slice and spectrum amplitude.

Step S2033, if so, then classifies as same role by the first voice data in described very first time fragment and the second time slice;

Step S2034, if not, then classifies as different role by the first voice data in described very first time fragment and the second time slice.

In the present embodiment, if server judges consistent, then the first voice data in described very first time fragment and the second time slice is classified as same role, if inconsistent, then the first voice data in described very first time fragment and the second time slice is classified as different role.

Be understandable that, when whether the spectral range of the first voice data in two time slices and spectrum amplitude be consistent, when the similarity that can set the two is more than or equal to 90%, then judge consistent.Certainly, in other embodiments, the value of similarity is not limited to the present embodiment, but can choose reasonable according to actual needs.

Using the audible spectrum of one of them time slice as benchmark, and be defined as role 1, then compare with the audible spectrum in each time slice follow-up, if judge, the feature of two frequency spectrums is close, then the audio frequency in two time slices is classified as role 1; If judge, the feature of two frequency spectrums is not mated, then the audio frequency in follow-up time slice is classified as role 2, until the audible spectrum identification in all time slices completes.Finally, add up the number of times that role occurs, the role that wherein occurrence number is more is then main character.

In the present embodiment, do not need the particular content identifying voice data, because the content of voice data provides in caption data, the present embodiment mainly carries out the analysis of audible spectrum to the voice data in timestamp.Because the pronunciation of each character there are differences on frequency spectrum, the pronunciation frequency spectrum as male voice mainly concentrates on low frequency region, and the pronunciation frequency spectrum of female voice then concentrates on medium-high frequency region.In addition, in the pronunciation frequency spectrum between role, the spectrum amplitude of each Frequency point also there are differences.Therefore, can the pronunciation audio frequency between role be distinguished in conjunction with spectral range and spectrum amplitude.

Step S204, utilizes speech synthesis technique, generates the sample audio frequency parameter corresponding with described character list.

In the present embodiment, for the timestamp 00:01:02:100 ~ 00:01:05:100 read, this timestamp represents in the voice data in this time slice to have character audio frequency, speech recognition is carried out to the voice data of this time slice, the audio frequency of one of them character can be identified.

In the present embodiment, sample audio frequency is default fixed-audio, can be the audio frequency of different sexes, and the sample audio frequency that different sexes different frequency is corresponding, such as, choose one section of specific audio frequency " selecting this section of voice to be dubbing of character? " and male voice treble audio is provided respectively, male voice middle pitch audio frequency, male voice audio bass, female voice treble audio, female voice middle pitch audio frequency, the sample audio frequency such as female voice audio bass, certainly, in other embodiments, the frequency range of audio frequency can also be segmented further, be not limited to the height in the present embodiment, in, low three kinds of audioranges.In the present embodiment, after the character list that television terminal sends at server and sample audio frequency parameter, eject user interface mode and select for user, wherein, character list be above in role's categorization results of statistics; And sample parameter refer to each role sort out in timestamp parameter and the sample audio frequency of preview can be selected for user.By timestamp parameter, user can preview first wife sound and sample audio frequency.

In one embodiment, as shown in Figure 5, on the basis shown in above-mentioned Fig. 2, described step S204 comprises:

Step S2041, for each role in described character list, extracts the caption timestamps of predetermined quantity from described caption data;

Step S2042, by text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

In the present embodiment, user can preview role first wife's sound and selectable sample audio frequency, television terminal is when receiving the sample audio frequency of user's parameters and user's selection, the parameter of correspondence is sent to described server, server then utilizes text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

Such as: in character audio identifying, the role that statistics obtains 3 similar pronunciations sorts out, and server provides 3 timestamps for each role sorts out, and the sample audio frequency of generation is sent to described television terminal simultaneously.Now, user can sort out for each role the described timestamp selecting to provide, and to play the audio frequency of corresponding time at described television terminal, identifies this role sort out representative personage to make user.In addition, the sample audio frequency that all right preview audition text voice engine of user produces, to select and to determine suitable sample audio frequency parameter.

In one embodiment, as shown in Figure 6, on the basis shown in above-mentioned Fig. 1, described step S30 comprises:

Step S301, is sent to described television terminal by the described character list generated and sample audio frequency parameter;

Step S302, receives user's parameters of described television terminal feedback;

In the present embodiment, the described character list generated and sample audio frequency parameter are sent to described television terminal by server, television terminal is by presenting user interface on the tv screen, input on a user interface for user and select character list and sample audio frequency parameter, thus generate described user's parameters, then described user's parameters is fed back to described server by described television terminal.

Step S303, carries out filtered audio to described first voice data, by text voice engine and in conjunction with described user's parameters, synthesizes the described second audio data corresponding with described character list.

In the present embodiment, with reference to Fig. 7, the new voice data (specifically different according to the parameters of user) of corresponding described caption data can be produced according to described text voice engine, and according to described voice elimination program, described first voice data is carried out voice elimination, then described new voice data is synthesized the second audio data corresponding with described character list with the first voice data eliminated through voice.

Wherein, existing voice removing method mainly utilizes voice in two sound channels in left and right to pronounce identical feature, two, left and right sound channel is subtracted each other, thus remove part identical in two sound channels, but this method not only causes larger loss (especially in low frequency part) to background sound, and when voice pronunciation is not identical in two sound channels, voice cannot be eliminated well.The application adopts the method for band pass filter, in the frequency band range of band pass filter, only need reach the amplitude reducing original pronunciation, not affect distinguishing of Composite tone, thus can retain low frequency and HFS preferably.In addition, the extraneous voice data of timestamp is not also had any impact.

The present invention also provides a kind of server 1, and with reference to Fig. 8, in one embodiment, described server 1 comprises:

First receiver module 10, for receiving the first voice data and the caption data of television terminal transmission;

In the present embodiment, the Voice & Video of TV is play, and cooperated with server 1 by television terminal, described television terminal completes arrangement and the transmission of the data such as audio frequency, captions, and provides user interface to carry out optimum configurations for user.And server 1 receives the data such as audio frequency, captions that television terminal sends, and complete the process of audio frequency, caption data, show to be transferred to television terminal after Composite tone data.

In the present embodiment, user by the remote controller of television terminal open dub function is set time, television terminal then carries out fast decoding to video file, extract audio frequency that user selects or the captions that default audio and user select or default subtitle, and voice data and caption data packing are sent to server 1.

Generating process module 20, for carrying out identifying processing to described first voice data and caption data, generates character list and sample audio frequency parameter;

Described server 1 carries out identifying processing to described first voice data and caption data, generates character list and sample audio frequency parameter.Wherein, the generation of described character list, first can choose the timestamp of predetermined quantity, as chosen three sections of timestamps, then respectively discriminance analysis is carried out to the voice data in described three sections of timestamps, and the similar voice that pronounce in each timestamp are classified as a class and the different role such as role 1, role 2.Concrete classifying method, can carry out differentiation statistics according to audible spectrum.And sample audio frequency is default fixed-audio, can be the audio frequency of different sexes, and the sample audio frequency that different sexes different frequency is corresponding, such as, choose one section of specific audio frequency " selecting this section of voice to be dubbing of character? " and male voice treble audio is provided respectively, male voice middle pitch audio frequency, male voice audio bass, female voice treble audio, female voice middle pitch audio frequency, the sample audio frequency such as female voice audio bass, certainly, in other embodiments, the frequency range of audio frequency can also be segmented further, be not limited to the height in the present embodiment, in, low three kinds of audioranges.In addition, described sample audio frequency can also be the audio frequency that the famous personnel of dubbing or specialty dub personnel.

Synthesis processing module 30, for described character list and sample audio frequency parameter are sent to described television terminal, and when receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter, described first voice data is synthesized second audio data;

In the present embodiment, the described character list generated and sample audio frequency parameter are sent to described television terminal by server 1, described television terminal presents user interface on the tv screen, input on a user interface for user and select character list and sample audio frequency parameter, thus generate described user's parameters, then described user's parameters is fed back to described server 1 by described television terminal.Described first voice data is synthesized second audio data according to described user's parameters by described server 1, wherein, the building-up process of described second audio data, need to use text voice engine and voice elimination program, the new voice data (specifically different according to user's parameters) of corresponding described caption data specifically can be produced according to described text voice engine, and according to described voice elimination program, described first voice data is carried out voice elimination, then described new voice data and the first voice data eliminated through voice are synthesized second audio data.

First sending module 40, for described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.

In the present embodiment, described server 1, by the described second audio data corresponding with described character list of synthesis, is sent to described television terminal.Be understandable that, described television terminal except extracting except voice data, caption data from described video file, also can extract video data from described video file, now, television terminal is when receiving described second audio data, described video data and described second audio data and caption data synchronously can be processed, finally play.

Server 1 provided by the invention, first by receiving the first voice data and the caption data of television terminal transmission, and carry out identifying processing, to generate character list and sample audio frequency parameter, then described character list and sample audio frequency parameter are sent to described television terminal, when receiving user's parameters of described television terminal feedback, according to described user's parameters, described first voice data is synthesized second audio data, the most described second audio data is sent to described television terminal, to control described second audio data and described caption data is play at described television terminal.Like this, can according to the language needs of different user, correspondence provides the audio frequency that can be easily absorbed on a cognitive level by the user, and can also meet the individual requirement of user to personage's dialogue, thus can avoid by captions to understand the defect of personage to the bletilla story of a play or opera, and then the experience sense that user watches TV can only being improved.

In one embodiment, as shown in Figure 9, on the basis shown in above-mentioned Fig. 8, described generating process module 20 comprises:

Acquiring unit 201, for extracting caption timestamps from described caption data;

Search unit 202, for according to described caption timestamps, find out the time slice that described first voice data occurs;

In the present embodiment, with reference to Fig. 3, server 1 extracts caption timestamps from described caption data, and find out according to described caption timestamps the time slice that role dubs appearance, and call sound identification module and carry out identifying processing, count several higher voice datas of frequency of occurrence in described time slice and select for user.

Sorting out unit 203, for carrying out spectrum analysis to described first voice data in described time slice, and carrying out classification generation character list;

In an embodiment, with reference to Figure 10, described classification unit 203 comprises:

Obtain subelement 2031, for obtaining the first voice data in very first time fragment and the second time slice respectively;

Whether judgment sub-unit 2032 is consistent for the spectral range and spectrum amplitude judging described very first time fragment and the first voice data in the second time slice;

First sorts out subelement 2033, for when the spectral range and spectrum amplitude that judge described very first time fragment and the first voice data in the second time slice are consistent, then the first voice data in described very first time fragment and the second time slice is classified as same role;

Second sorts out subelement 2034, for when the spectral range and/or spectrum amplitude that judge the first voice data in described very first time fragment and the second time slice are inconsistent, then the first voice data in described very first time fragment and the second time slice is classified as different role.

Generation unit 204, for utilizing speech synthesis technique, generates the sample audio frequency parameter corresponding with described character list.

In the present embodiment, sample audio frequency is default fixed-audio, can be the audio frequency of different sexes, and the sample audio frequency that different sexes different frequency is corresponding, such as, choose one section of specific audio frequency " selecting this section of voice to be dubbing of character? " and male voice treble audio is provided respectively, male voice middle pitch audio frequency, male voice audio bass, female voice treble audio, female voice middle pitch audio frequency, the sample audio frequency such as female voice audio bass, certainly, in other embodiments, the frequency range of audio frequency can also be segmented further, be not limited to the height in the present embodiment, in, low three kinds of audioranges.In the present embodiment, after the character list that television terminal sends at server 1 and sample audio frequency parameter, eject user interface mode and select for user, wherein, character list be above in role's categorization results of statistics; And sample parameter refer to each role sort out in timestamp parameter and the sample audio frequency of preview can be selected for user.By timestamp parameter, user can preview first wife sound and sample audio frequency.

In one embodiment, as shown in figure 11, on the basis shown in above-mentioned Fig. 9, described generation unit 204 comprises:

Extract subelement 2041, for for each role in described character list, from described caption data, extract the caption timestamps of predetermined quantity;

Generate subelement 2042, for passing through text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

In the present embodiment, user can preview role first wife's sound and selectable sample audio frequency, television terminal is when receiving the sample audio frequency of user's parameters and user's selection, the parameter of correspondence is sent to described server 1, server 1 utilizes text voice engine, the caption timestamps of corresponding described predetermined quantity generates the sample audio frequency parameter of predetermined quantity, carries out preview selection to send to described television terminal.

Such as: in character audio identifying, the role that statistics obtains 3 similar pronunciations sorts out, and server 1 provides 3 timestamps for each role sorts out, and the sample audio frequency of generation is sent to described television terminal simultaneously.Now, user can sort out for each role the described timestamp selecting to provide, and to play the audio frequency of corresponding time at described television terminal, identifies this role sort out representative personage to make user.In addition, the sample audio frequency that all right preview audition text voice engine of user produces, to select and to determine suitable sample audio frequency parameter.

In one embodiment, as shown in figure 12, on the basis shown in above-mentioned Fig. 8, described synthesis processing module 30 comprises:

Transmitting element 301, for being sent to described television terminal by the described character list generated and sample audio frequency parameter;

Receiving element 302, for receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter;

In the present embodiment, the described character list generated and sample audio frequency parameter are sent to described television terminal by server 1, television terminal is by presenting user interface on the tv screen, input on a user interface for user and select character list and sample audio frequency parameter, thus generate described user's parameters, then described user's parameters is fed back to described server 1 by described television terminal.

Synthesis unit 303, for carrying out filtered audio to described first voice data, by text voice engine and in conjunction with described user's parameters, synthesizes the described second audio data corresponding with described character list.

In the present embodiment, with reference to Fig. 7, the new voice data (specifically different according to the parameters of user) of corresponding described caption data specifically can be produced according to described text voice engine, and according to described voice elimination program, described first voice data is carried out voice elimination, then described new voice data is synthesized the second audio data corresponding with described character list with the first voice data eliminated through voice.

The present invention also provides one to televise control system 100, with reference to Figure 13, in one embodiment, described in control system 100 of televising comprise television terminal 2 and server 1 as above, with reference to Figure 14, described television terminal 2 comprises:

Second sending module 50, for sending the first voice data and caption data to server 1;

In the present embodiment, the Voice & Video of TV is play, and cooperated with server 1 by television terminal 2, described television terminal 2 completes arrangement and the transmission of the data such as audio frequency, captions, and provides user interface to carry out optimum configurations for user.And server 1 receives the data such as audio frequency, captions that television terminal 2 sends, and complete the process of audio frequency, caption data, show to be transferred to television terminal 2 after Composite tone data.

In the present embodiment, user by the remote controller of television terminal 2 open dub function is set time, television terminal 2 carries out fast decoding to video file, obtain video data, the first voice data and caption data, therefrom extract audio frequency that user selects or the captions that default audio and user select or default subtitle, and voice data and caption data are packaged into are sent to server 1.

Second receiver module 60, for receiving after described server 1 carries out identifying processing to described first voice data and caption data, the character list of generation and sample audio frequency parameter;

In the present embodiment, the generation of described character list, the timestamp of predetermined quantity first can be chosen by server 1, as chosen three sections of timestamps, then respectively discriminance analysis is carried out to the voice data in described three sections of timestamps, and the similar voice that pronounce in each timestamp are classified as a class and the different role such as role 1, role 2.Concrete classifying method, can carry out differentiation statistics according to audible spectrum.And sample audio frequency is default fixed-audio, can be the audio frequency of different sexes, and the sample audio frequency that different sexes different frequency is corresponding, such as, choose one section of specific audio frequency " selecting this section of voice to be dubbing of character? " and male voice treble audio is provided respectively, male voice middle pitch audio frequency, male voice audio bass, female voice treble audio, female voice middle pitch audio frequency, the sample audio frequency such as female voice audio bass, certainly, in other embodiments, the frequency range of audio frequency can also be segmented further, be not limited to the height in the present embodiment, in, low three kinds of audioranges.In addition, described sample audio frequency can also be the fixed-audio that the famous personnel of dubbing or specialty dub personnel.

Feedback module 70, for generating user's parameters according to described character list and sample audio frequency parameter, and feeds back to described server 1 by described user's parameters;

In the present embodiment, television terminal 2 is when receiving described character list and the sample audio frequency parameter of server 1 generation, user interface can be presented on the tv screen, input on a user interface for user and select character list and sample audio frequency parameter, thus generate described user's parameters.

Acquisition module 80, for obtaining described server 1 when receiving described user's parameters, by the second audio data of described first voice data synthesis;

In the present embodiment, described first voice data is synthesized second audio data according to described user's parameters by described server 1, wherein, the building-up process of described second audio data, the new voice data (specifically different according to user's parameters) of corresponding described caption data specifically can be produced according to described text voice engine, and according to described voice elimination program, described first voice data is carried out voice elimination, then described new voice data and the first voice data eliminated through voice are synthesized second audio data.

Synchronous playing module 90, for synchronously playing described second audio data, video data and caption data;

In the present embodiment, television terminal 2 is when receiving corresponding with the described character list described second audio data of described server 1 synthesis, after described second audio data and video data and caption data are synchronously processed, finally play, like this, carry out preliminary treatment by the audio frequency of server 1 pair of video file, the understandable language of synthesis user, can strengthen the viewing experience of user; In addition, the selection of various rolls audio frequency can also be provided for user, thus further enhancing Consumer's Experience sense.

These are only the preferred embodiments of the present invention; not thereby the scope of the claims of the present invention is limited; every utilize specification of the present invention and accompanying drawing content to do equivalent structure or equivalent flow process conversion; or be directly or indirectly used in other relevant technical fields, be all in like manner included in scope of patent protection of the present invention.

Claims

1. to televise a control method, it is characterized in that, described in control method of televising comprise the following steps:

2. to televise as claimed in claim 1 control method, it is characterized in that, describedly carry out identifying processing to described first voice data and caption data, the step generating character list and sample audio frequency parameter comprises:

Described server extracts caption timestamps from described caption data;

3. to televise as claimed in claim 2 control method, it is characterized in that, describedly utilize speech synthesis technique, the step generating the sample audio frequency parameter corresponding with described character list comprises:

4. to televise as claimed in claim 2 control method, it is characterized in that, described described character list and sample audio frequency parameter are sent to described television terminal, and when receiving user's parameters that described television terminal feeds back according to described character list and sample audio frequency parameter, the step that described first voice data synthesizes second audio data is comprised:

5. to televise as claimed in claim 2 control method, it is characterized in that, described spectrum analysis is carried out to described first voice data in described time slice, and carry out sorting out the step generating character list and comprise:

6. a server, is characterized in that, described server comprises:

7. server as claimed in claim 6, it is characterized in that, described generating process module comprises:

Acquiring unit, for extracting caption timestamps from described caption data;

8. server as claimed in claim 7, it is characterized in that, described generation unit comprises:

9. server as claimed in claim 7, it is characterized in that, described synthesis processing module comprises:

10. server as claimed in claim 7, it is characterized in that, described classification unit comprises:

First sorts out subelement, for when the spectral range and spectrum amplitude that judge described very first time fragment and the first voice data in the second time slice are consistent, the first voice data in described very first time fragment and the second time slice is classified as same role;

Second sorts out subelement, for when the spectral range and/or spectrum amplitude that judge the first voice data in described very first time fragment and the second time slice are inconsistent, the first voice data in described very first time fragment and the second time slice is classified as different role.

11. 1 kinds of control system of televising, is characterized in that, described in control system of televising comprise television terminal and the server as described in any one of claim 6 to 10, described television terminal comprises:

Acquisition module, for obtaining described server when receiving described user's parameters, by the second caption data of described first voice data synthesis;