WO2016037440A1 - Video voice conversion method and device and server - Google Patents

Video voice conversion method and device and server Download PDF

Info

Publication number
WO2016037440A1
WO2016037440A1 PCT/CN2014/094217 CN2014094217W WO2016037440A1 WO 2016037440 A1 WO2016037440 A1 WO 2016037440A1 CN 2014094217 W CN2014094217 W CN 2014094217W WO 2016037440 A1 WO2016037440 A1 WO 2016037440A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
speech signal
speech
language
source language
Prior art date
Application number
PCT/CN2014/094217
Other languages
French (fr)
Chinese (zh)
Inventor
秦铎浩
沈国龙
Original Assignee
百度在线网络技术(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 百度在线网络技术(北京)有限公司 filed Critical 百度在线网络技术(北京)有限公司
Publication of WO2016037440A1 publication Critical patent/WO2016037440A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/278Subtitling

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Theoretical Computer Science (AREA)
  • Acoustics & Sound (AREA)
  • Human Computer Interaction (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Signal Processing (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

A video voice conversion method and device and a server, which relate to the technical field of multimedia processing, and are used for reducing the translation cost of voice in a video and improving the translation efficiency and translation accuracy. The method comprises: extracting a voice signal of a source language in a video, and segmenting the voice signal of the source language to obtain at least one segment of a sub voice signal of the source language (101); for each segment of the sub voice signal of the source language, converting the sub voice signal of the source language into a sub voice signal of a target language according to a pre-built voice model (102); and merging each obtained segment of the sub voice signal of the target language with the video to obtain a video containing the voice signal of the target language (103).

Description

视频语音转换方法、装置和服务器Video voice conversion method, device and server
本专利申请要求于2014年09月11日提交的,申请号为201410461061.8,申请人为百度在线网络技术(北京)有限公司,发明名称为“视频语音转换方法、装置和服务器”的中国专利申请的优先权,该申请的全文以引用的方式并入本申请中。This patent application is filed on September 11, 2014, the application number is 201410461061.8, and the applicant is Baidu Online Network Technology (Beijing) Co., Ltd., and the Chinese patent application titled "Video Voice Conversion Method, Device and Server" is preferred. The entire contents of this application are incorporated herein by reference.
技术领域Technical field
本发明实施例涉及多媒体处理技术领域,尤其涉及一种视频语音转换方法、装置和服务器。The embodiments of the present invention relate to the field of multimedia processing technologies, and in particular, to a video voice conversion method, apparatus, and server.
背景技术Background technique
在生活中很多时候会接触到外语类视频,比如好莱坞电影、外语的学习教程视频等,对于外语不好的人来说看这些视频时是需要一些辅助的翻译字幕的,但是很多时候外语类视频是没有字幕的,若观看者听不懂外语,这时的外语类视频对于观看者来讲是没有任何意义的。In life, I often come into contact with foreign language videos, such as Hollywood movies, foreign language tutorial videos, etc. For those who are not good at foreign languages, they need some auxiliary subtitles when watching these videos, but many times foreign language videos. There is no subtitle. If the viewer does not understand the foreign language, the foreign language video at this time does not make any sense to the viewer.
现有技术中为了使人们能够看懂外语类视频,主要采用以下三种方式:一种是预先在外语类视频中添加人工翻译得到的字幕;另外一种是将外语类视频制作为译制片,译制片中的语音为人工国语配音;第三种是在视频播放现场,由同声传译的专家使用速记等方式实时对视频中的语音进行人工翻译并传达翻译结果。In the prior art, in order to enable people to understand foreign language video, the following three methods are mainly used: one is to add a subtitle obtained by human translation in advance in a foreign language video; the other is to make a foreign language video into a translation film. The voice in the translated film is the dubbing of artificial national language; the third is in the video playing scene, the experts in the simultaneous interpretation use the shorthand and other means to manually translate the voice in the video and convey the translation result.
现有技术存在的缺陷在于:上述三种方式均由人工实现语音的翻译转换,成本较高、效率低下,并且准确性也难以得到保证。The shortcoming of the prior art is that all of the above three methods are manually converted and converted, and the cost is high, the efficiency is low, and the accuracy is difficult to be guaranteed.
发明内容Summary of the invention
本发明提供一种视频语音转换方法、装置和服务器,以降低视频中语音的翻译成本,提高翻译效率及其准确性。 The invention provides a video voice conversion method, device and server, so as to reduce translation cost of voice in video, improve translation efficiency and accuracy.
第一方面,本发明实施例提供了一种视频语音转换方法,包括:In a first aspect, an embodiment of the present invention provides a video voice conversion method, including:
提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;
对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
第二方面,本发明实施例还提供了一种视频语音转换装置,包括:In a second aspect, the embodiment of the present invention further provides a video voice conversion device, including:
源语音提取单元,用于提取视频中的源语言的语音信号a source speech extraction unit for extracting a speech signal of a source language in a video
源语音处理单元,用于将所述源语言的语音信号进行分段,得到至少一段源语言的子语音信号;a source speech processing unit, configured to segment the speech signal of the source language to obtain at least one sub-speech signal of a source language;
目标语音转换单元,用于对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;a target speech conversion unit, configured to convert the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model for each sub-speech signal of the source language;
语音视频合并单元,用于将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。And a voice and video merging unit, configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.
第三方面,本发明实施例还提供了一种服务器,包括:In a third aspect, an embodiment of the present invention further provides a server, including:
一个或者多个处理器;One or more processors;
存储器;Memory
一个或者多个模块,所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,进行如下操作:One or more modules, the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:
提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;
对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
本发明实施例中,提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号,对于每段源语言的子语音信 号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,然后将得到的各段目标语言的子语音信号与该视频进行合并,得到包含目标语言的语音信号的视频,可见,本方案通过语音模型实现了自动翻译转换视频中的语音信号的目的,无需人工参与,降低了成本并提高了翻译转换效率,同时可以避免人工翻译转换所带来的准确性较低的问题,通过自动翻译转换使得其结果准确性可以得到较好保证。In the embodiment of the present invention, a voice signal of a source language in a video is extracted, and a voice signal of the source language is segmented to obtain at least one sub-voice signal of a source language, and a sub-voice signal for each source language is obtained. No., converting the sub-speech signal of the source language into a sub-speech signal of the target language according to the pre-established speech model, and then combining the obtained sub-speech signals of the target languages of the segment with the video to obtain a speech signal including the target language. The video shows that the scheme realizes the purpose of automatically translating and converting the voice signal in the video through the voice model, without manual participation, reducing the cost and improving the translation conversion efficiency, and avoiding the accuracy brought by the manual translation conversion. Low problems, through the automatic translation conversion, the accuracy of the results can be better guaranteed.
附图说明DRAWINGS
图1A为本发明实施例一提供的视频语音转换方法的流程示意图;1A is a schematic flowchart of a video voice conversion method according to Embodiment 1 of the present invention;
图1B为本发明实施例一提供的源语言的语音信号分段方法示意图;1B is a schematic diagram of a voice signal segmentation method in a source language according to Embodiment 1 of the present invention;
图2A是本发明实施例二提供的视频语音转换方法的流程示意图;2A is a schematic flowchart of a video voice conversion method according to Embodiment 2 of the present invention;
图2B是本发明实施例二提供的用户选择目标语言类型的界面示意图;2B is a schematic diagram of an interface for selecting a target language type by a user according to Embodiment 2 of the present invention;
图3是本发明实施例三提供的视频语音转换方法的流程示意图;3 is a schematic flowchart of a video voice conversion method according to Embodiment 3 of the present invention;
图4是本发明实施例四提供的视频语音转换装置的结构示意图;4 is a schematic structural diagram of a video voice converting apparatus according to Embodiment 4 of the present invention;
图5是本发明实施例五提供的服务器的硬件结构示意图。FIG. 5 is a schematic structural diagram of hardware of a server according to Embodiment 5 of the present invention.
具体实施方式detailed description
下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部结构。The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. It should also be noted that, for ease of description, only some, but not all, of the structures related to the present invention are shown in the drawings.
实施例一:Embodiment 1:
图1A为本发明实施例一提供的视频语音转换方法的流程图,图1B为本发明实施例一提供的源语言的语音信号的分段示意图。本实施例可适用于需要将视频中的源语言的语音信号转化为目标语言的语音信号的情况,该方法可以由视频语音转换装置来执行,该装置可以设置在服务器中。该方法具体包括如下操作: 1A is a flowchart of a video voice conversion method according to Embodiment 1 of the present invention, and FIG. 1B is a schematic diagram of segmentation of a voice signal in a source language according to Embodiment 1 of the present invention. This embodiment is applicable to a case where a voice signal of a source language in a video needs to be converted into a voice signal of a target language, the method can be performed by a video voice converting device, and the device can be set in a server. The method specifically includes the following operations:
101:提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;101: Extract a voice signal of a source language in the video, and segment the voice signal of the source language to obtain a sub-voice signal of at least one source language;
这里,在视频中的源语言的语音信号较长时,按照一定方法将该源语言的语音信号进行分段可能得到多段源语言的子语音信号,在视频中的源语言的语音信号较短时,按照一定方法将该源语言的语音信号进行分段可能仅得到一段源语言的子语音信号。Here, when the voice signal of the source language in the video is long, segmenting the voice signal of the source language according to a certain method may obtain a sub-speech signal of a multi-segment source language, and when the voice signal of the source language in the video is short Segmenting the speech signal of the source language according to a certain method may only obtain a sub-speech signal of a source language.
102:对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;Step 102: Convert, for each sub-speech signal of the source language, the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
103:将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。103: Combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.
具体的,操作101中提取视频中的源语言的语音信号,具体实现可以如下:Specifically, in operation 101, the voice signal of the source language in the video is extracted, and the specific implementation may be as follows:
提取视频中的音频信号,根据语音信号的频率特征从该音频信号中提取出源语言的语音信号。例如,首先获得提取得到的音频信号的频率信息,然后提取频率在300~3400HZ范围内的音频信号作为语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal. For example, the frequency information of the extracted audio signal is first obtained, and then the audio signal having a frequency in the range of 300 to 3400 Hz is extracted as a voice signal.
具体的,操作101中将该源语言的语音信号进行分段,具体实现可以如下:根据该源语言的语音信号的振幅进行分段。例如,可以将每两次振幅为0的时间点之间的信号划分为一段子语音信号,如图1B所示,将时间点00:01与时间点00:03:73之间的信号划分为一段子语音信号;具体实现流程可以如下:Specifically, in operation 101, the voice signal of the source language is segmented, and the specific implementation may be as follows: segmenting according to the amplitude of the voice signal of the source language. For example, the signal between each time point of amplitude 0 is divided into a sub-speech signal, as shown in FIG. 1B, the signal between time point 00:01 and time point 00:03:73 is divided into A sub-voice signal; the specific implementation process can be as follows:
A、查找源语言的语音信号中第一次出现的振幅为0的信号的时间点,将第一次出现的振幅为0的信号的时间点作为开始时间点t0;A. Find the time point of the first occurrence of the signal with the amplitude 0 in the speech signal of the source language, and use the time point of the first occurrence of the signal with the amplitude of 0 as the start time point t0;
B、查找源语言的语音信号中当前的开始时间点t0后的第一次出现的振幅为0的信号的时间点,将当前的开始时间点t0后的第一次出现的振幅为0的信号的时间点作为结束时间点t1;B. Find the time point of the signal of the first occurrence amplitude of 0 after the current start time point t0 in the speech signal of the source language, and the signal of the first occurrence amplitude of 0 after the current start time point t0 Time point as the end time point t1;
C、将当前的开始时间点t0和结束时间点t1之间的语音信号划分为一段子语音信号;C. dividing the voice signal between the current start time point t0 and the end time point t1 into a sub-speech signal;
D、判断是否还有剩余的语音信号,若是则继续查找源语言的语音信号中当前的结束时间点t1后的第一次出现的振幅为0的信号的时间点,将当前的结束时间点t1后的第一次出现的振幅为0的信号的时间点作为开始时间点t0,并返 回步骤B,否则本流程结束。D. judging whether there is any remaining speech signal, and if so, continuing to search for the time point of the first occurrence of the signal with amplitude 0 after the current end time point t1 in the speech signal of the source language, and the current end time point t1 After the first occurrence of the signal with the amplitude of 0, the time point is taken as the starting time point t0, and Go back to step B, otherwise the process ends.
较佳的,为了从带有噪声的语音信号中提取尽可能纯净的语音信号,进而提高语言翻译转换的精确度,在操作101中提取视频中的源语言的语音信号之后、将该源语言的语音信号进行分段之前,进一步包括:将该源语言的语音信号进行去噪处理。具体的,去噪处理可以通过语音增强算法实现,语音增强算法包括但不限于:基于谱相减的语音增强算法、基于小波分析的语音增强算法、基于独立分量分析的语音增强方法、基于神经网络的语音增强方法等。Preferably, in order to extract the speech signal as pure as possible from the noisy speech signal, thereby improving the accuracy of the language translation conversion, after extracting the speech signal of the source language in the video in operation 101, the source language is Before the segmentation of the voice signal, the method further includes: performing a denoising process on the voice signal of the source language. Specifically, the denoising process can be implemented by a speech enhancement algorithm, including but not limited to: a speech enhancement algorithm based on spectral subtraction, a speech enhancement algorithm based on wavelet analysis, a speech enhancement method based on independent component analysis, and a neural network based on neural network. Voice enhancement methods, etc.
具体的,操作102中对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,具体实现可以如下:Specifically, in the operation 102, the sub-speech signal of each source language is converted into the sub-speech signal of the target language according to the pre-established speech model, and the specific implementation may be as follows:
对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。例如,在源语言为英文,目标语言为中文时,对于每段英文的子语音信号,将该段英文的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段英文的子语音信号对应的英文的子文本数据(英文字符),将与该段英文的子语音信号对应的英文的子文本数据翻译为中文的子文本数据(中文字符),采用语音合成技术将该中文的子文本数据合成为中文的子语音信号。For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique. For example, when the source language is English and the target language is Chinese, for each sub-speech voice signal, the sub-speech signal of the segment of English is input into a pre-established speech model, and the output of the speech model and the sub-section of the English language are obtained. The sub-text data (English characters) corresponding to the speech signal is translated into Chinese sub-text data (Chinese characters) corresponding to the sub-speech signal of the English sub-speech, and the Chinese language is synthesized by speech synthesis technology. The subtext data is synthesized into a sub-speech signal of Chinese.
上述语音模型是通过预先的数据训练得到的、用于实现根据输入的语音信号得到与该语音信号对应的文本数据的数据模型。较佳的,可以预先针对不同的领域分别生成语音模型,例如分别针对军事领域、科技领域、文艺领域等分别生成语音模型;相应的,在操作102中使用的语音模型可以是当前视频所属的领域对应的语音模型,从而提高所得到的文本数据的精确度。比如,若当前视频属于军事领域,则使用军事领域对应的语音模型,若当前视频属于技术领域,则使用技术领域对应的语音模型,等等。The above-mentioned speech model is a data model obtained by pre-data training for realizing text data corresponding to the speech signal according to the input speech signal. Preferably, the speech model may be separately generated for different domains in advance, for example, respectively, for the military field, the technical field, the literary field, etc. respectively; correspondingly, the speech model used in operation 102 may be the field to which the current video belongs. Corresponding speech models to improve the accuracy of the resulting text data. For example, if the current video belongs to the military field, the voice model corresponding to the military domain is used, and if the current video belongs to the technical field, the corresponding voice model of the technical field is used, and the like.
具体的,上述采用语音合成技术将该目标语言的子文本数据合成为目标语 言的子语音信号,具体实现可以如下:Specifically, the speech synthesis technology is used to synthesize the sub-text data of the target language into a target language. The sub-voice signal of the speech can be implemented as follows:
采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解的文本数据,该处理过程中可以包括文本规整、词的切分、语法分析和语义分析等自然语言处理过程;然后,对该文本数据进行韵律处理,得到合成后的子语音信号的音段特征,该音频特征包括音高、音长、音强中的至少一个,使合成后的子语音信号能正确表达语意;最后,采用声学处理技术,根据计算机能够理解的文本数据得到具有该音段特征的目标语言的子语音信号。举例说明,声学处理技术可以是LPC(线性预测编码)技术,PSOLA(基音同步叠加)合成技术、基于LMA声道模型的语音合成技术等。The natural language processing technology is used to process the sub-text data of the target language into text data that can be understood by the computer, and the processing may include natural language processing such as text regularization, word segmentation, syntax analysis and semantic analysis; The text data is subjected to prosody processing to obtain a segment feature of the synthesized sub-speech signal, the audio feature including at least one of pitch, length, and intensity, so that the synthesized sub-speech signal can correctly express semantics; Using acoustic processing techniques, a sub-speech signal of a target language having the characteristics of the segment is obtained from text data that the computer can understand. For example, the acoustic processing technique may be LPC (Linear Predictive Coding) technology, PSOLA (Pitch Synchronous Overlay) synthesis technology, speech synthesis technology based on LMA channel model, and the like.
进一步的,在操作101中对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳(包括开始时间和结束时间),使得操作102中转换得到的每段目标语言的子语音信号中也包含对应的源语言的子语音信号的时间戳;相应的,在操作103中将得到的各段目标语言的子语音信号与视频进行合并,具体实现可以如下:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。例如,假设共有三段目标语言的子语音信号,第一段目标语言的子语音信号对应的时间戳为00:10:00-00:20:00,第二段目标语言的子语音信号对应的时间戳为00:30:00-00:40:00,第三段目标语言的子语音信号对应的时间戳为00:50:00-00:60:00,那么,将第一段目标语言的子语音信号合并到视频中的播放位置00:10:00-00:20:00处,将第二段目标语言的子语音信号合并到视频中的播放位置00:30:00-00:40:00处,将第三段目标语言的子语音信号合并到视频中的播放位置00:50:00-00:60:00处。Further, when the speech signal of the source language is segmented in operation 101, the time stamp (including the start time and the end time) of the sub-speech signal of each piece of the source language is retained, so that the sub-segment of each target language converted in operation 102 is obtained. The voice signal also includes a time stamp of the corresponding sub-speech signal of the source language; correspondingly, the obtained sub-speech signal of each segment of the target language is combined with the video in operation 103, and the specific implementation may be as follows: for each target language The sub-speech signal combines the sub-speech signal of the segment target language into a play position corresponding to the time stamp of the sub-speech signal of the segment target language in the video. For example, if a sub-speech signal of three target languages is shared, the time stamp corresponding to the sub-speech signal of the first-stage target language is 00:10:00-00:20:00, and the sub-speech signal of the second-stage target language corresponds to The timestamp is 00:30:00-00:40:00, and the time stamp corresponding to the sub-speech signal of the third target language is 00:50:00-00:60:00, then the first target language is The sub-speech signal is merged into the playback position 00:10:00-00:20:00 in the video, and the sub-speech signal of the second target language is merged into the play position in the video 00:30:00-00:40: At 00, the sub-speech signal of the third segment target language is merged into the playback position 00:50:00-00:60:00 in the video.
本实施例的技术方案中,提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号,对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,然后将得到的各段目标语言的子语音信号与该视频进行合并,得到包含目标语言的语音信号的视频,可见,本方案通过语音模型实现了自动翻译转换视频中的语音信号的目的,无需人工参与,降低了成本并提高了翻译 转换效率,同时可以避免人工翻译转换所带来的准确性较低的问题,通过自动翻译转换使得其结果准确性可以得到较好保证。In the technical solution of the embodiment, the voice signal of the source language in the video is extracted, and the voice signal of the source language is segmented to obtain at least one sub-voice signal of the source language, and for each sub-voice signal of the source language, according to The pre-established speech model converts the sub-speech signal of the source language into a sub-speech signal of the target language, and then combines the obtained sub-speech signals of the target languages into a video of the speech signal of the target language. It can be seen that the scheme realizes the purpose of automatically translating the voice signal in the converted video through the voice model, without manual participation, reducing the cost and improving the translation. Conversion efficiency, while avoiding the problem of low accuracy caused by manual translation conversion, the accuracy of the results can be better ensured by automatic translation conversion.
实施例二:Embodiment 2:
图2A为本发明实施例二提供的视频语音转换方法,图2B为本发明实施例二中的用户选择目标语言类型的界面示意图。本实施例可适用于在播放视频前将视频中的源语言的语音信号转化为目标语言的语音信号的情况,该方法可以由视频语音转换装置和视频播放装置来执行,视频语音转换装置和视频播放装置可以设置在同一服务器中也可以设置在不同服务器中。该方法具体包括如下操作:2A is a video voice conversion method according to Embodiment 2 of the present invention, and FIG. 2B is a schematic diagram of an interface for selecting a target language type by a user according to Embodiment 2 of the present invention. This embodiment is applicable to a case where a voice signal of a source language in a video is converted into a voice signal of a target language before playing a video, and the method can be performed by a video voice converting device and a video playing device, and the video voice converting device and the video The playback device can be set in the same server or in different servers. The method specifically includes the following operations:
201:视频语音转换装置根据设置信息确定需要转换的至少一种目标语言;201: The video voice converting device determines, according to the setting information, at least one target language that needs to be converted;
202:视频语音转换装置对于每种需要转换的目标语言,分别执行如下操作:提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为当前目标语言的子语音信号;将得到的各段当前目标语言的子语音信号与该视频进行合并,得到包含当前目标语言的语音信号的视频,并存储该视频;202: The video voice converting device performs the following operations for each target language that needs to be converted: extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language. For each sub-speech signal of the source language, converting the sub-speech signal of the source language into the sub-speech signal of the current target language according to the pre-established speech model; and obtaining the sub-speech signal of each segment of the current target language and the video Merging, obtaining a video containing a voice signal of the current target language, and storing the video;
本操作可以参见实施例一的具体描述,这里不再赘述。For details, refer to the detailed description of the first embodiment, and details are not described herein again.
203:视频语音播放装置接收到视频播放请求,该播放请求中包含用户选择或自动选定的目标语言类型;203: The video voice playback device receives a video play request, where the play request includes a target language type selected or automatically selected by the user;
其中,用户选择目标语言类型的示例可以参见图2B,用户可以在“同声传译”的菜单中选择普通话或四川话作为目标语言类型;For example, the user can select the target language type as shown in FIG. 2B, and the user can select Mandarin or Sichuan dialect as the target language type in the menu of “simultaneous interpretation”;
204:视频语音播放装置从视频语音转换装置获取包含播放请求中的目标语言类型对应的目标语言的语音信号的视频,并将获取到的视频发送给终端设备进行播放。204: The video voice playback device acquires a video of the voice signal of the target language corresponding to the target language type in the play request from the video voice conversion device, and sends the acquired video to the terminal device for playing.
本实施例的技术方案中,在播放视频前,对于预先设置的每种目标语言,按照实施例一的方法将视频中的源语言的语音信号转化为目标语言的语音信号,得到包含目标语言的语音信号的视频;在接收到包含用户选择或自动选定的目标语言类型的播放请求时,获取包含播放请求中的目标语言类型对应的目 标语言的语音信号的视频并播放该视频。可见,采用本方案可以满足对同一视频采用不同语言进行播放的需求,并且由于对视频中语音信号的翻译转换在播放之前完成,用户在提交播放请求后无需等待翻译转换的时间,使得系统响应视频播放请求的速度较快,用户体验较好。In the technical solution of the embodiment, before playing the video, for each target language set in advance, the voice signal of the source language in the video is converted into the voice signal of the target language according to the method of the first embodiment, and the target language is obtained. a video of a voice signal; when receiving a play request including a target language type selected by the user or automatically selected, acquiring a destination corresponding to the target language type in the play request The video of the speech signal of the language is played and the video is played. It can be seen that the solution can meet the requirement of playing the same video in different languages, and since the translation and conversion of the voice signal in the video is completed before the play, the user does not need to wait for the translation conversion time after submitting the play request, so that the system responds to the video. The playback request is faster and the user experience is better.
实施例三:Embodiment 3:
图3为本发明实施例三提供的视频语音转换方法。本实施例可适用于接收到播放请求后实时将视频中的源语言的语音信号转化为目标语言的语音信号的情况,该方法可以由视频语音转换装置和视频播放装置来执行,视频语音转换装置和视频播放装置可以设置在同一服务器或不同服务器中。该方法具体包括如下操作:FIG. 3 is a schematic diagram of a video voice conversion method according to Embodiment 3 of the present invention. The embodiment is applicable to the case where the voice signal of the source language in the video is converted into the voice signal of the target language in real time after receiving the play request, and the method can be performed by the video voice converting device and the video playing device, and the video voice converting device And the video playback device can be set on the same server or different servers. The method specifically includes the following operations:
301:视频语音播放装置接收到视频播放请求,该播放请求中包含用户选择或自动选定的目标语言类型;301: The video voice playback device receives a video play request, where the play request includes a target language type selected or automatically selected by the user;
其中,用户选择目标语言类型的示例可以参见图2B,用户可以在“同声传译”的菜单中选择普通话或四川话作为目标语言类型;For example, the user can select the target language type as shown in FIG. 2B, and the user can select Mandarin or Sichuan dialect as the target language type in the menu of “simultaneous interpretation”;
302:视频语音转换装置执行如下操作:提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为视频播放请求中的目标语言类型对应的目标语言的子语音信号;将得到的各段目标语言的子语音信号与该视频进行合并,得到包含该目标语言的语音信号的视频;302: The video voice converting device performs the following operations: extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain at least one sub-voice signal of a source language; for each sub-speech signal of the source language Converting the sub-speech signal of the source language into a sub-speech signal of the target language corresponding to the target language type in the video playback request according to the pre-established speech model; and combining the obtained sub-speech signals of the target language of each segment with the video , obtaining a video containing a voice signal of the target language;
本操作可以参见实施例一的具体描述,这里不再赘述。For details, refer to the detailed description of the first embodiment, and details are not described herein again.
303:视频语音播放装置将视频语音转换装置得到的包含目标语言的语音信号的视频,发送给终端设备进行播放。303: The video voice playback device sends the video containing the voice signal of the target language obtained by the video voice conversion device to the terminal device for playing.
本实施例的技术方案中,在接收到视频播放请求后,按照实施例一的方法将视频中的源语言的语音信号转化为视频播放请求所指示的目标语言的语音信号,得到包含目标语言的语音信号的视频,并播放该视频。可见,采用本方案可以满足对同一视频采用不同语言进行播放的需求,并且由于对视频中语音信号的翻译转换在接收到播放请求执行,无需预先进行针对不同目标语言的翻译 转换以及视频存储,从而可以节省系统资源。In the technical solution of the embodiment, after receiving the video play request, the voice signal of the source language in the video is converted into the voice signal of the target language indicated by the video play request according to the method of the first embodiment, and the target language is obtained. Video the voice signal and play the video. It can be seen that the scheme can meet the requirement of playing the same video in different languages, and since the translation of the voice signal in the video is performed, the translation request is performed without performing pre-translation for different target languages. Conversion and video storage save system resources.
实施例四:Embodiment 4:
图4为本发明实施例四提供的视频语音转换装置的结构示意图。具体的,该装置包括:FIG. 4 is a schematic structural diagram of a video voice converting apparatus according to Embodiment 4 of the present invention. Specifically, the device includes:
源语音提取单元401,用于提取视频中的源语言的语音信号The source speech extracting unit 401 is configured to extract a voice signal of a source language in the video.
源语音处理单元402,用于将所述源语言的语音信号进行分段,得到至少一段源语言的子语音信号;The source speech processing unit 402 is configured to segment the speech signal of the source language to obtain at least one sub-speech signal of the source language;
目标语音转换单元403,用于对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;The target speech converting unit 403 is configured to convert the sub-speech signal of the source language into the sub-speech signal of the target language according to the pre-established speech model for each sub-speech signal of the source language;
语音视频合并单元404,用于将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。The voice and video merging unit 404 is configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video including a voice signal of the target language.
进一步的,所述源语音提取单元401,具体用于:Further, the source voice extraction unit 401 is specifically configured to:
提取视频中的音频信号,根据语音信号的频率特征从所述音频信号中提取出源语言的语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
进一步的,所述源语音处理单元402具体用于:Further, the source voice processing unit 402 is specifically configured to:
根据该源语言的语音信号的振幅进行分段。Segmentation is performed according to the amplitude of the speech signal of the source language.
进一步的,所述源语音处理单元402还用于:Further, the source voice processing unit 402 is further configured to:
在将该源语言的语音信号进行分段之前,将该源语言的语音信号进行去噪处理。The speech signal of the source language is subjected to denoising processing before segmenting the speech signal of the source language.
进一步的,所述目标语音转换单元403具体用于:Further, the target voice converting unit 403 is specifically configured to:
对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
进一步的,所述目标语音转换单元403具体用于:按照如下方式采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号:Further, the target speech converting unit 403 is specifically configured to synthesize the sub-text data of the target language into sub-speech signals of the target language by using a speech synthesis technology as follows:
采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解 的文本数据;对该文本数据进行韵律处理,得到合成后的子语音信号的音段特征;采用声学处理技术,根据所述计算机能够理解的文本数据得到具有所述音段特征的目标语言的子语音信号。The natural language processing technology is used to process the sub-text data of the target language into a computer capable of understanding Text data; performing rhythm processing on the text data to obtain a segment feature of the synthesized sub-speech signal; using an acoustic processing technique, obtaining a target language sub-characteristic according to the text data that the computer can understand voice signal.
进一步的,所述源语音处理单元402在对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳;Further, the source speech processing unit 402 retains a time stamp of each sub-speech signal of the source language when segmenting the speech signal of the source language;
所述语音视频合并单元404具体用于:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。The voice and video merging unit 404 is specifically configured to, for each sub-speech signal of the target language, merge the sub-speech signal of the segment target language into a play position corresponding to a time stamp of the sub-speech signal of the segment target language in the video. .
上述软件升级装置可执行本发明实施例所提供的软件升级方法,具备执行方法相应的功能模块和有益效果。The above software upgrade device can execute the software upgrade method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.
本发明实施例还提供一种服务器,该服务器包括上述视频语音转换装置。该服务器具体可以是PC(Personal Computer,个人计算机)、笔记本电脑等设备。The embodiment of the invention further provides a server, which comprises the above video voice conversion device. The server may specifically be a PC (Personal Computer), a notebook computer, or the like.
实施例五:Embodiment 5:
请参阅图5,为本发明实施例五提供的一种服务器的硬件结构示意图。该服务器包括:FIG. 5 is a schematic structural diagram of a hardware of a server according to Embodiment 5 of the present invention. The server includes:
一个或者多个处理器510,图5中以一个处理器510为例;One or more processors 510, one processor 510 is taken as an example in FIG. 5;
存储器520;以及存储在所述存储器520中的一个或者多个模块,比如图4中的源语音提取单元401、源语音处理单元402、目标语音转换单元403和语音视频合并单元404。The memory 520; and one or more modules stored in the memory 520, such as the source speech extraction unit 401, the source speech processing unit 402, the target speech conversion unit 403, and the voice and video combining unit 404 in FIG.
所述服务器还可以包括:输入装置530和输出装置540。所述服务器中的处理器510、存储器520、输入装置530和输出装置540可以通过总线或其他方式连接,图5中以通过总线连接为例。The server may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 in the server may be connected by a bus or other means, and the bus connection is taken as an example in FIG.
存储器520作为一种计算机可读存储介质,可用于存储软件程序、计算机可执行程序以及模块,如本发明实施例中的对象信息的提供方法对应的程序指令/模块(例如,附图4所示的源语音提取单元401、源语音处理单元402、目标语音转换单元403和语音视频合并单元404)。处理器510通过运行存储在存储器520中的软件程序、指令以及模块,从而执行服务器的各种功能应用以及数 据处理,即实现上述方法实施例中的视频语音转换方法。The memory 520 is used as a computer readable storage medium, and can be used to store a software program, a computer executable program, and a module, such as a program instruction/module corresponding to the method for providing object information in the embodiment of the present invention (for example, as shown in FIG. 4 Source speech extraction unit 401, source speech processing unit 402, target speech conversion unit 403, and voice and video merge unit 404). The processor 510 executes various functional applications and numbers of the server by running software programs, instructions, and modules stored in the memory 520. According to the processing, the video voice conversion method in the foregoing method embodiment is implemented.
存储器520可包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序;存储数据区可存储根据终端设备的使用所创建的数据等。此外,存储器520可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他非易失性固态存储器件。在一些实例中,存储器520可进一步包括相对于处理器510远程设置的存储器,这些远程存储器可以通过网络连接至终端设备。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to usage of the terminal device, and the like. Further, the memory 520 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, flash memory device, or other nonvolatile solid state storage device. In some examples, memory 520 can further include memory remotely located relative to processor 510, which can be connected to the terminal device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
输入装置530可用于接收输入的数字或字符信息,以及产生与终端的用户设置以及功能控制有关的键信号输入。输出装置540可包括显示屏等显示设备。 Input device 530 can be used to receive input digital or character information and to generate key signal inputs related to user settings and function control of the terminal. The output device 540 can include a display device such as a display screen.
存储在所述存储器520中的一个或者多个模块被所述一个或者多个处理器510执行时,执行如下操作:When one or more modules stored in the memory 520 are executed by the one or more processors 510, the following operations are performed:
提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;
对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
进一步地,所述提取视频中的源语言的语音信号,可优选包括:Further, the extracting the voice signal of the source language in the video may preferably include:
提取视频中的音频信号,根据语音信号的频率特征从所述音频信号中提取出源语言的语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
进一步地,所述将该源语言的语音信号进行分段,可优选包括:根据该源语言的语音信号的振幅进行分段。Further, the segmenting the voice signal of the source language may preferably include: segmenting according to an amplitude of a voice signal of the source language.
进一步地,在提取视频中的源语言的语音信号之后、将该源语言的语音信号进行分段之前,可优选包括:将该源语言的语音信号进行去噪处理。Further, after extracting the voice signal of the source language in the video and before segmenting the voice signal of the source language, preferably, the voice signal of the source language is subjected to denoising processing.
进一步地,所述对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,可优选包括:Further, the sub-speech signal of each source language is converted into the sub-speech signal of the target language according to the pre-established speech model, which may preferably include:
对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立 的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。For each sub-speech signal of the source language, the sub-speech signal input of the segment source language is pre-established a speech model, the sub-text data of the source language corresponding to the sub-speech signal of the segment source language outputted by the speech model is obtained, and the sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into a target language The sub-text data is synthesized into sub-speech signals of the target language by using speech synthesis technology.
进一步地,所述采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号,可优选包括:Further, the synthesizing the sub-text data of the target language into the sub-speech signal of the target language by using a speech synthesis technology may preferably include:
采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解的文本数据;对该文本数据进行韵律处理,得到合成后的子语音信号的音段特征;采用声学处理技术,根据所述计算机能够理解的文本数据得到具有所述音段特征的目标语言的子语音信号。Using natural language processing technology to process the sub-text data of the target language into text data that can be understood by the computer; prosody processing the text data to obtain a segment feature of the synthesized sub-speech signal; using acoustic processing techniques, according to the Text data that the computer can understand results in a sub-speech signal of the target language having the segment features.
进一步地,在对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳;在将每段源语言的子语音信号转换为目标语言的子语音信号时将当前段源语言的子语音信号的时间戳添加到转换后的对应的目标语言的子语音信号中;Further, the time stamp of the sub-speech signal of each piece of the source language is retained when segmenting the speech signal of the source language; the current segment source language is used when converting the sub-speech signal of each source language into the sub-speech signal of the target language The time stamp of the sub-speech signal is added to the converted sub-speech signal of the corresponding target language;
所述将得到的各段目标语言的子语音信号与所述视频进行合并,可优选包括:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。Combining the obtained sub-speech signals of the target language of the segment with the video may preferably include: combining, for each sub-speech signal of the target language, the sub-speech signal of the segment target language into the target in the video. The playback position corresponding to the timestamp of the sub-speech signal of the language.
实施例六:Example 6:
本实施例还提供一种非易失性计算机存储介质,所述计算机存储介质存储有一个或者多个模块,当所述一个或者多个模块被一个服务器执行时,使得所述服务器执行如下操作:The embodiment further provides a non-volatile computer storage medium storing one or more modules, when the one or more modules are executed by one server, causing the server to perform the following operations:
提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;
对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。 The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
进一步地,所述提取视频中的源语言的语音信号,可优选包括:Further, the extracting the voice signal of the source language in the video may preferably include:
提取视频中的音频信号,根据语音信号的频率特征从所述音频信号中提取出源语言的语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
进一步地,所述将该源语言的语音信号进行分段,可优选包括:根据该源语言的语音信号的振幅进行分段。Further, the segmenting the voice signal of the source language may preferably include: segmenting according to an amplitude of a voice signal of the source language.
进一步地,在提取视频中的源语言的语音信号之后、将该源语言的语音信号进行分段之前,可优选包括:将该源语言的语音信号进行去噪处理。Further, after extracting the voice signal of the source language in the video and before segmenting the voice signal of the source language, preferably, the voice signal of the source language is subjected to denoising processing.
进一步地,所述对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,可优选包括:Further, the sub-speech signal of each source language is converted into the sub-speech signal of the target language according to the pre-established speech model, which may preferably include:
对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
进一步地,所述采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号,可优选包括:Further, the synthesizing the sub-text data of the target language into the sub-speech signal of the target language by using a speech synthesis technology may preferably include:
采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解的文本数据;对该文本数据进行韵律处理,得到合成后的子语音信号的音段特征;采用声学处理技术,根据所述计算机能够理解的文本数据得到具有所述音段特征的目标语言的子语音信号。Using natural language processing technology to process the sub-text data of the target language into text data that can be understood by the computer; prosody processing the text data to obtain a segment feature of the synthesized sub-speech signal; using acoustic processing techniques, according to the Text data that the computer can understand results in a sub-speech signal of the target language having the segment features.
进一步地,在对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳;在将每段源语言的子语音信号转换为目标语言的子语音信号时将当前段源语言的子语音信号的时间戳添加到转换后的对应的目标语言的子语音信号中;Further, the time stamp of the sub-speech signal of each piece of the source language is retained when segmenting the speech signal of the source language; the current segment source language is used when converting the sub-speech signal of each source language into the sub-speech signal of the target language The time stamp of the sub-speech signal is added to the converted sub-speech signal of the corresponding target language;
所述将得到的各段目标语言的子语音信号与所述视频进行合并,可优选包括:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。Combining the obtained sub-speech signals of the target language of the segment with the video may preferably include: combining, for each sub-speech signal of the target language, the sub-speech signal of the segment target language into the target in the video. The playback position corresponding to the timestamp of the sub-speech signal of the language.
通过以上关于实施方式的描述,所属领域的技术人员可以清楚地了解到, 本发明可借助软件及必需的通用硬件来实现,当然也可以通过硬件实现,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在计算机可读存储介质中,如计算机的软盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、闪存(FLASH)、硬盘或光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。Through the above description of the embodiments, those skilled in the art can clearly understand that The invention can be implemented by means of software and the necessary general hardware, and of course can also be implemented by hardware, but in many cases the former is a better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk of a computer. , Read-Only Memory (ROM), Random Access Memory (RAM), Flash (FLASH), hard disk or optical disk, etc., including a number of instructions to make a computer device (can be a personal computer) The server, or network device, etc.) performs the methods described in various embodiments of the present invention.
值得注意的是,上述视频语音转换装置的实施例中,所包括的各个单元和模块只是按照功能逻辑进行划分的,但并不局限于上述的划分,只要能够实现相应的功能即可;另外,各功能单元的具体名称也只是为了便于相互区分,并不用于限制本发明的保护范围。注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。 It should be noted that, in the embodiment of the video-to-speech conversion device, the units and modules included in the video-to-speech conversion device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be implemented; The specific names of the respective functional units are also for convenience of distinguishing from each other and are not intended to limit the scope of protection of the present invention. Note that the above are only the preferred embodiments of the present invention and the technical principles applied thereto. Those skilled in the art will appreciate that the present invention is not limited to the specific embodiments described herein, and that various modifications, changes and substitutions may be made without departing from the scope of the invention. Therefore, the present invention has been described in detail by the above embodiments, but the present invention is not limited to the above embodiments, and other equivalent embodiments may be included without departing from the inventive concept. The scope is determined by the scope of the appended claims.

Claims (15)

  1. 一种视频语音转换方法,其特征在于,包括:A video voice conversion method, comprising:
    提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;
    对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
    将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
  2. 根据权利要求1所述的方法,其特征在于,所述提取视频中的源语言的语音信号,具体包括:The method according to claim 1, wherein the extracting the voice signal of the source language in the video comprises:
    提取视频中的音频信号,根据语音信号的频率特征从所述音频信号中提取出源语言的语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
  3. 根据权利要求1或2所述的方法,其特征在于,所述将该源语言的语音信号进行分段,具体包括:根据该源语言的语音信号的振幅进行分段。The method according to claim 1 or 2, wherein the segmenting the voice signal of the source language comprises: segmenting according to an amplitude of a voice signal of the source language.
  4. 根据权利要求1或2或3所述的方法,其特征在于,在提取视频中的源语言的语音信号之后、将该源语言的语音信号进行分段之前,进一步包括:将该源语言的语音信号进行去噪处理。The method according to claim 1 or 2 or 3, wherein after the voice signal of the source language in the video is extracted and the voice signal of the source language is segmented, the method further comprises: the voice of the source language The signal is denoised.
  5. 根据权利要求1-4中任一所述的方法,其特征在于,所述对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号,具体包括:The method according to any one of claims 1 to 4, wherein the sub-speech signal of each source language is converted into a sub-speech signal of the source language according to a pre-established speech model. Voice signal, including:
    对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
  6. 根据权利要求5所述的方法,其特征在于,所述采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号,具体包括:The method according to claim 5, wherein the synthesizing the sub-text data of the target language into a sub-speech signal of the target language by using a speech synthesis technique comprises:
    采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解的文本数据;对该文本数据进行韵律处理,得到合成后的子语音信号的音段特 征;采用声学处理技术,根据所述计算机能够理解的文本数据得到具有所述音段特征的目标语言的子语音信号。The natural language processing technology is used to process the sub-text data of the target language into text data that can be understood by the computer; the prosody processing is performed on the text data to obtain the segment of the synthesized sub-speech signal. Using an acoustic processing technique, a sub-speech signal of a target language having the segment characteristics is obtained according to text data that the computer can understand.
  7. 根据权利要求1-6中任一所述的方法,其特征在于,进一步包括:在对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳;在将每段源语言的子语音信号转换为目标语言的子语音信号时将当前段源语言的子语音信号的时间戳添加到转换后的对应的目标语言的子语音信号中;The method of any of claims 1-6, further comprising: retaining a timestamp of each sub-speech signal of the source language when segmenting the speech signal of the source language; When the sub-speech signal is converted into the sub-speech signal of the target language, the time stamp of the sub-speech signal of the current segment source language is added to the converted sub-speech signal of the corresponding target language;
    所述将得到的各段目标语言的子语音信号与所述视频进行合并,具体包括:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。And combining the obtained sub-speech signals of the target language of the segment with the video, specifically: combining, for each sub-speech signal of the target language, the sub-speech signal of the segment target language into the target language in the video The playback position of the sub-speech signal's timestamp.
  8. 一种视频语音转换装置,其特征在于,包括:A video voice conversion device, comprising:
    源语音提取单元,用于提取视频中的源语言的语音信号;a source speech extraction unit, configured to extract a speech signal of a source language in the video;
    源语音处理单元,用于将所述源语言的语音信号进行分段,得到至少一段源语言的子语音信号;a source speech processing unit, configured to segment the speech signal of the source language to obtain at least one sub-speech signal of a source language;
    目标语音转换单元,用于对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;a target speech conversion unit, configured to convert the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model for each sub-speech signal of the source language;
    语音视频合并单元,用于将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。And a voice and video merging unit, configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.
  9. 根据权利要求8所述的装置,其特征在于,所述源语音提取单元,具体用于:The device according to claim 8, wherein the source speech extraction unit is specifically configured to:
    提取视频中的音频信号,根据语音信号的频率特征从所述音频信号中提取出源语言的语音信号。An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
  10. 根据权利要求8或9所述的装置,其特征在于,所述源语音处理单元具体用于:The device according to claim 8 or 9, wherein the source speech processing unit is specifically configured to:
    根据该源语言的语音信号的振幅进行分段。Segmentation is performed according to the amplitude of the speech signal of the source language.
  11. 根据权利要求8或9或10所述的装置,其特征在于,所述源语音处理单元还用于:The device according to claim 8 or 9 or 10, wherein the source speech processing unit is further configured to:
    在将该源语言的语音信号进行分段之前,将该源语言的语音信号进行去噪处理。 The speech signal of the source language is subjected to denoising processing before segmenting the speech signal of the source language.
  12. 根据权利要求8-11中任一所述的装置,其特征在于,所述目标语音转换单元具体用于:The device according to any one of claims 8-11, wherein the target speech conversion unit is specifically configured to:
    对于每段源语言的子语音信号,将该段源语言的子语音信号输入预先建立的语音模型,得到该语音模型输出的与该段源语言的子语音信号对应的源语言的子文本数据,将与该段源语言的子语音信号对应的源语言的子文本数据翻译为目标语言的子文本数据,采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号。For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
  13. 根据权利要求12所述的装置,其特征在于,所述目标语音转换单元具体用于:按照如下方式采用语音合成技术将该目标语言的子文本数据合成为目标语言的子语音信号:The apparatus according to claim 12, wherein the target speech converting unit is specifically configured to synthesize the sub-text data of the target language into a sub-speech signal of the target language by using a speech synthesis technique as follows:
    采用自然语言处理技术将该目标语言的子文本数据处理为计算机能够理解的文本数据;对该文本数据进行韵律处理,得到合成后的子语音信号的音段特征;采用声学处理技术,根据所述计算机能够理解的文本数据得到具有所述音段特征的目标语言的子语音信号。Using natural language processing technology to process the sub-text data of the target language into text data that can be understood by the computer; prosody processing the text data to obtain a segment feature of the synthesized sub-speech signal; using acoustic processing techniques, according to the Text data that the computer can understand results in a sub-speech signal of the target language having the segment features.
  14. 根据权利要求8-13中任一所述的装置,其特征在于,所述源语音处理单元在对源语言的语音信号分段时保留每段源语言的子语音信号的时间戳;所述目标语音转换单元在将每段源语言的子语音信号转换为目标语言的子语音信号时将当前段源语言的子语音信号的时间戳添加到转换后的对应的目标语言的子语音信号中;The apparatus according to any one of claims 8-13, wherein said source speech processing unit retains a time stamp of each sub-speech signal of the source language when segmenting the speech signal of the source language; said target The voice conversion unit adds a time stamp of the sub-speech signal of the current segment source language to the converted sub-speech signal of the corresponding target language when converting the sub-speech signal of each source language into the sub-speech signal of the target language;
    所述语音视频合并单元具体用于:对于每段目标语言的子语音信号,将该段目标语言的子语音信号合并到视频中该段目标语言的子语音信号的时间戳所对应的播放位置。The voice and video merging unit is specifically configured to: for each sub-speech signal of the target language, merge the sub-speech signal of the segment target language into a play position corresponding to a time stamp of the sub-speech signal of the segment target language in the video.
  15. 一种服务器,其特征在于,包括:A server, comprising:
    一个或者多个处理器;One or more processors;
    存储器;Memory
    一个或者多个模块,所述一个或者多个模块存储在所述存储器中,当被所述一个或者多个处理器执行时,进行如下操作:One or more modules, the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:
    提取视频中的源语言的语音信号,将该源语言的语音信号进行分段,得到 至少一段源语言的子语音信号;Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain At least one sub-speech signal of the source language;
    对于每段源语言的子语音信号,根据预先建立的语音模型将该源语言的子语音信号转换为目标语言的子语音信号;For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;
    将得到的各段目标语言的子语音信号与所述视频进行合并,得到包含目标语言的语音信号的视频。 The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
PCT/CN2014/094217 2014-09-11 2014-12-18 Video voice conversion method and device and server WO2016037440A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410461061.8 2014-09-11
CN201410461061.8A CN104252861B (en) 2014-09-11 2014-09-11 Video speech conversion method, device and server

Publications (1)

Publication Number Publication Date
WO2016037440A1 true WO2016037440A1 (en) 2016-03-17

Family

ID=52187705

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/094217 WO2016037440A1 (en) 2014-09-11 2014-12-18 Video voice conversion method and device and server

Country Status (2)

Country Link
CN (1) CN104252861B (en)
WO (1) WO2016037440A1 (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10652622B2 (en) 2017-06-27 2020-05-12 At&T Intellectual Property I, L.P. Method and apparatus for providing content based upon a selected language
CN111639233A (en) * 2020-05-06 2020-09-08 广东小天才科技有限公司 Learning video subtitle adding method and device, terminal equipment and storage medium
CN114630179A (en) * 2022-03-17 2022-06-14 维沃移动通信有限公司 Audio extraction method and electronic equipment
KR102440890B1 (en) * 2021-03-05 2022-09-06 주식회사 한글과컴퓨터 Video automatic dubbing apparatus that automatically dubs the video dubbed with the voice of the first language to the voice of the second language and operating method thereof
US11582527B2 (en) 2018-02-26 2023-02-14 Google Llc Automated voice translation dubbing for prerecorded video
CN111639233B (en) * 2020-05-06 2024-05-17 广东小天才科技有限公司 Learning video subtitle adding method, device, terminal equipment and storage medium

Families Citing this family (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105159870B (en) * 2015-06-26 2018-06-29 徐信 A kind of accurate processing system and method for completing continuous natural-sounding textual
CN105828101B (en) * 2016-03-29 2019-03-08 北京小米移动软件有限公司 Generate the method and device of subtitle file
CN106328176B (en) * 2016-08-15 2019-04-30 广州酷狗计算机科技有限公司 A kind of method and apparatus generating song audio
CN106649295A (en) * 2017-01-04 2017-05-10 携程旅游网络技术(上海)有限公司 Text translation method for mobile terminal
CN107241616B (en) * 2017-06-09 2018-10-26 腾讯科技(深圳)有限公司 video lines extracting method, device and storage medium
CN107688792B (en) * 2017-09-05 2020-06-05 语联网(武汉)信息技术有限公司 Video translation method and system
CN108090051A (en) * 2017-12-20 2018-05-29 深圳市沃特沃德股份有限公司 The interpretation method and translator of continuous long voice document
CN108447486B (en) * 2018-02-28 2021-12-03 科大讯飞股份有限公司 Voice translation method and device
CN109119063B (en) * 2018-08-31 2019-11-22 腾讯科技(深圳)有限公司 Video dubs generation method, device, equipment and storage medium
CN109325147A (en) * 2018-09-30 2019-02-12 联想(北京)有限公司 A kind of information processing method and device
CN110119514A (en) * 2019-04-02 2019-08-13 杭州灵沃盛智能科技有限公司 The instant translation method of information, device and system
CN110232907B (en) * 2019-07-24 2021-11-02 出门问问(苏州)信息科技有限公司 Voice synthesis method and device, readable storage medium and computing equipment
CN110534085B (en) * 2019-08-29 2022-02-25 北京百度网讯科技有限公司 Method and apparatus for generating information
CN110659387A (en) * 2019-09-20 2020-01-07 上海掌门科技有限公司 Method and apparatus for providing video
WO2021109000A1 (en) * 2019-12-03 2021-06-10 深圳市欢太科技有限公司 Data processing method and apparatus, electronic device, and storage medium
CN117560459B (en) * 2024-01-11 2024-04-16 深圳市志泽科技有限公司 Audio/video conversion method based on conversion wire

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000358202A (en) * 1999-06-16 2000-12-26 Toshiba Corp Video audio recording and reproducing device and method for generating and recording sub audio data for the device
US20030216922A1 (en) * 2002-05-20 2003-11-20 International Business Machines Corporation Method and apparatus for performing real-time subtitles translation
CN1774715A (en) * 2003-04-14 2006-05-17 皇家飞利浦电子股份有限公司 System and method for performing automatic dubbing on an audio-visual stream
CN201319640Y (en) * 2008-12-01 2009-09-30 深圳市同洲电子股份有限公司 Digital television receiving terminal capable of synchronously translating in real time
CN202026434U (en) * 2011-04-29 2011-11-02 广东九联科技股份有限公司 Voice conversion STB (set top box)
CN103854648A (en) * 2012-12-08 2014-06-11 上海能感物联网有限公司 Chinese and foreign language voiced image data bidirectional reversible voice converting and subtitle labeling method

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4087400B2 (en) * 2005-09-15 2008-05-21 株式会社東芝 Spoken dialogue translation apparatus, spoken dialogue translation method, and spoken dialogue translation program
CN102903361A (en) * 2012-10-15 2013-01-30 Itp创新科技有限公司 Instant call translation system and instant call translation method

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000358202A (en) * 1999-06-16 2000-12-26 Toshiba Corp Video audio recording and reproducing device and method for generating and recording sub audio data for the device
US20030216922A1 (en) * 2002-05-20 2003-11-20 International Business Machines Corporation Method and apparatus for performing real-time subtitles translation
CN1774715A (en) * 2003-04-14 2006-05-17 皇家飞利浦电子股份有限公司 System and method for performing automatic dubbing on an audio-visual stream
CN201319640Y (en) * 2008-12-01 2009-09-30 深圳市同洲电子股份有限公司 Digital television receiving terminal capable of synchronously translating in real time
CN202026434U (en) * 2011-04-29 2011-11-02 广东九联科技股份有限公司 Voice conversion STB (set top box)
CN103854648A (en) * 2012-12-08 2014-06-11 上海能感物联网有限公司 Chinese and foreign language voiced image data bidirectional reversible voice converting and subtitle labeling method

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10652622B2 (en) 2017-06-27 2020-05-12 At&T Intellectual Property I, L.P. Method and apparatus for providing content based upon a selected language
US11582527B2 (en) 2018-02-26 2023-02-14 Google Llc Automated voice translation dubbing for prerecorded video
CN111639233A (en) * 2020-05-06 2020-09-08 广东小天才科技有限公司 Learning video subtitle adding method and device, terminal equipment and storage medium
CN111639233B (en) * 2020-05-06 2024-05-17 广东小天才科技有限公司 Learning video subtitle adding method, device, terminal equipment and storage medium
KR102440890B1 (en) * 2021-03-05 2022-09-06 주식회사 한글과컴퓨터 Video automatic dubbing apparatus that automatically dubs the video dubbed with the voice of the first language to the voice of the second language and operating method thereof
CN114630179A (en) * 2022-03-17 2022-06-14 维沃移动通信有限公司 Audio extraction method and electronic equipment

Also Published As

Publication number Publication date
CN104252861A (en) 2014-12-31
CN104252861B (en) 2018-04-13

Similar Documents

Publication Publication Date Title
WO2016037440A1 (en) Video voice conversion method and device and server
US9552807B2 (en) Method, apparatus and system for regenerating voice intonation in automatically dubbed videos
CN109119063B (en) Video dubs generation method, device, equipment and storage medium
KR102598824B1 (en) Automated voice translation dubbing for prerecorded videos
CN110517689B (en) Voice data processing method, device and storage medium
US10991380B2 (en) Generating visual closed caption for sign language
CN105704538A (en) Method and system for generating audio and video subtitles
US11545134B1 (en) Multilingual speech translation with adaptive speech synthesis and adaptive physiognomy
WO2018120821A1 (en) Method and device for producing presentation
US20240064383A1 (en) Method and Apparatus for Generating Video Corpus, and Related Device
JP2012181358A (en) Text display time determination device, text display system, method, and program
Song et al. Talking face generation with multilingual tts
US9666211B2 (en) Information processing apparatus, information processing method, display control apparatus, and display control method
CN112908292A (en) Text voice synthesis method and device, electronic equipment and storage medium
WO2018120820A1 (en) Presentation production method and apparatus
US20230326369A1 (en) Method and apparatus for generating sign language video, computer device, and storage medium
CN109858005B (en) Method, device, equipment and storage medium for updating document based on voice recognition
CN112233661B (en) Video content subtitle generation method, system and equipment based on voice recognition
Sun et al. Pre-avatar: An automatic presentation generation framework leveraging talking avatar
CN113948062A (en) Data conversion method and computer storage medium
US11182417B1 (en) Method and system for facilitating conversion of content based on user preferences
JP2016024378A (en) Information processor, control method and program thereof
Remael et al. From Translation Studies and audiovisual translation to media accessibility
Jiang SDW-ASL: A Dynamic System to Generate Large Scale Dataset for Continuous American Sign Language
US11636131B1 (en) Methods and systems for facilitating conversion of content for transfer and storage of content

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14901806

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 14901806

Country of ref document: EP

Kind code of ref document: A1