WO2016037440A1

WO2016037440A1 - Video voice conversion method and device and server

Info

Publication number: WO2016037440A1
Application number: PCT/CN2014/094217
Authority: WO
Inventors: 秦铎浩; 沈国龙
Original assignee: 百度在线网络技术（北京）有限公司
Priority date: 2014-09-11
Filing date: 2014-12-18
Publication date: 2016-03-17
Also published as: CN104252861A; CN104252861B

Abstract

A video voice conversion method and device and a server, which relate to the technical field of multimedia processing, and are used for reducing the translation cost of voice in a video and improving the translation efficiency and translation accuracy. The method comprises: extracting a voice signal of a source language in a video, and segmenting the voice signal of the source language to obtain at least one segment of a sub voice signal of the source language (101); for each segment of the sub voice signal of the source language, converting the sub voice signal of the source language into a sub voice signal of a target language according to a pre-built voice model (102); and merging each obtained segment of the sub voice signal of the target language with the video to obtain a video containing the voice signal of the target language (103).

Description

Video voice conversion method, device and server

This patent application is filed on September 11, 2014, the application number is 201410461061.8, and the applicant is Baidu Online Network Technology (Beijing) Co., Ltd., and the Chinese patent application titled "Video Voice Conversion Method, Device and Server" is preferred. The entire contents of this application are incorporated herein by reference.

Technical field

The embodiments of the present invention relate to the field of multimedia processing technologies, and in particular, to a video voice conversion method, apparatus, and server.

Background technique

In life, I often come into contact with foreign language videos, such as Hollywood movies, foreign language tutorial videos, etc. For those who are not good at foreign languages, they need some auxiliary subtitles when watching these videos, but many times foreign language videos. There is no subtitle. If the viewer does not understand the foreign language, the foreign language video at this time does not make any sense to the viewer.

In the prior art, in order to enable people to understand foreign language video, the following three methods are mainly used: one is to add a subtitle obtained by human translation in advance in a foreign language video; the other is to make a foreign language video into a translation film. The voice in the translated film is the dubbing of artificial national language; the third is in the video playing scene, the experts in the simultaneous interpretation use the shorthand and other means to manually translate the voice in the video and convey the translation result.

The shortcoming of the prior art is that all of the above three methods are manually converted and converted, and the cost is high, the efficiency is low, and the accuracy is difficult to be guaranteed.

Summary of the invention

The invention provides a video voice conversion method, device and server, so as to reduce translation cost of voice in video, improve translation efficiency and accuracy.

In a first aspect, an embodiment of the present invention provides a video voice conversion method, including:

Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;

For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;

The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.

In a second aspect, the embodiment of the present invention further provides a video voice conversion device, including:

a source speech extraction unit for extracting a speech signal of a source language in a video

a source speech processing unit, configured to segment the speech signal of the source language to obtain at least one sub-speech signal of a source language;

a target speech conversion unit, configured to convert the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model for each sub-speech signal of the source language;

And a voice and video merging unit, configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.

In a third aspect, an embodiment of the present invention further provides a server, including:

One or more processors;

Memory

One or more modules, the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:

In the embodiment of the present invention, a voice signal of a source language in a video is extracted, and a voice signal of the source language is segmented to obtain at least one sub-voice signal of a source language, and a sub-voice signal for each source language is obtained. No., converting the sub-speech signal of the source language into a sub-speech signal of the target language according to the pre-established speech model, and then combining the obtained sub-speech signals of the target languages of the segment with the video to obtain a speech signal including the target language. The video shows that the scheme realizes the purpose of automatically translating and converting the voice signal in the video through the voice model, without manual participation, reducing the cost and improving the translation conversion efficiency, and avoiding the accuracy brought by the manual translation conversion. Low problems, through the automatic translation conversion, the accuracy of the results can be better guaranteed.

DRAWINGS

1A is a schematic flowchart of a video voice conversion method according to Embodiment 1 of the present invention;

1B is a schematic diagram of a voice signal segmentation method in a source language according to Embodiment 1 of the present invention;

2A is a schematic flowchart of a video voice conversion method according to Embodiment 2 of the present invention;

2B is a schematic diagram of an interface for selecting a target language type by a user according to Embodiment 2 of the present invention;

3 is a schematic flowchart of a video voice conversion method according to Embodiment 3 of the present invention;

4 is a schematic structural diagram of a video voice converting apparatus according to Embodiment 4 of the present invention;

FIG. 5 is a schematic structural diagram of hardware of a server according to Embodiment 5 of the present invention.

detailed description

The present invention will be further described in detail below with reference to the accompanying drawings and embodiments. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. It should also be noted that, for ease of description, only some, but not all, of the structures related to the present invention are shown in the drawings.

Embodiment 1:

1A is a flowchart of a video voice conversion method according to Embodiment 1 of the present invention, and FIG. 1B is a schematic diagram of segmentation of a voice signal in a source language according to Embodiment 1 of the present invention. This embodiment is applicable to a case where a voice signal of a source language in a video needs to be converted into a voice signal of a target language, the method can be performed by a video voice converting device, and the device can be set in a server. The method specifically includes the following operations:

101: Extract a voice signal of a source language in the video, and segment the voice signal of the source language to obtain a sub-voice signal of at least one source language;

Here, when the voice signal of the source language in the video is long, segmenting the voice signal of the source language according to a certain method may obtain a sub-speech signal of a multi-segment source language, and when the voice signal of the source language in the video is short Segmenting the speech signal of the source language according to a certain method may only obtain a sub-speech signal of a source language.

Step 102: Convert, for each sub-speech signal of the source language, the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;

103: Combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.

Specifically, in operation 101, the voice signal of the source language in the video is extracted, and the specific implementation may be as follows:

An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal. For example, the frequency information of the extracted audio signal is first obtained, and then the audio signal having a frequency in the range of 300 to 3400 Hz is extracted as a voice signal.

Specifically, in operation 101, the voice signal of the source language is segmented, and the specific implementation may be as follows: segmenting according to the amplitude of the voice signal of the source language. For example, the signal between each time point of amplitude 0 is divided into a sub-speech signal, as shown in FIG. 1B, the signal between time point 00:01 and time point 00:03:73 is divided into A sub-voice signal; the specific implementation process can be as follows:

A. Find the time point of the first occurrence of the signal with the amplitude 0 in the speech signal of the source language, and use the time point of the first occurrence of the signal with the amplitude of 0 as the start time point t0;

B. Find the time point of the signal of the first occurrence amplitude of 0 after the current start time point t0 in the speech signal of the source language, and the signal of the first occurrence amplitude of 0 after the current start time point t0 Time point as the end time point t1;

C. dividing the voice signal between the current start time point t0 and the end time point t1 into a sub-speech signal;

D. judging whether there is any remaining speech signal, and if so, continuing to search for the time point of the first occurrence of the signal with amplitude 0 after the current end time point t1 in the speech signal of the source language, and the current end time point t1 After the first occurrence of the signal with the amplitude of 0, the time point is taken as the starting time point t0, and Go back to step B, otherwise the process ends.

Preferably, in order to extract the speech signal as pure as possible from the noisy speech signal, thereby improving the accuracy of the language translation conversion, after extracting the speech signal of the source language in the video in operation 101, the source language is Before the segmentation of the voice signal, the method further includes: performing a denoising process on the voice signal of the source language. Specifically, the denoising process can be implemented by a speech enhancement algorithm, including but not limited to: a speech enhancement algorithm based on spectral subtraction, a speech enhancement algorithm based on wavelet analysis, a speech enhancement method based on independent component analysis, and a neural network based on neural network. Voice enhancement methods, etc.

Specifically, in the operation 102, the sub-speech signal of each source language is converted into the sub-speech signal of the target language according to the pre-established speech model, and the specific implementation may be as follows:

For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique. For example, when the source language is English and the target language is Chinese, for each sub-speech voice signal, the sub-speech signal of the segment of English is input into a pre-established speech model, and the output of the speech model and the sub-section of the English language are obtained. The sub-text data (English characters) corresponding to the speech signal is translated into Chinese sub-text data (Chinese characters) corresponding to the sub-speech signal of the English sub-speech, and the Chinese language is synthesized by speech synthesis technology. The subtext data is synthesized into a sub-speech signal of Chinese.

The above-mentioned speech model is a data model obtained by pre-data training for realizing text data corresponding to the speech signal according to the input speech signal. Preferably, the speech model may be separately generated for different domains in advance, for example, respectively, for the military field, the technical field, the literary field, etc. respectively; correspondingly, the speech model used in operation 102 may be the field to which the current video belongs. Corresponding speech models to improve the accuracy of the resulting text data. For example, if the current video belongs to the military field, the voice model corresponding to the military domain is used, and if the current video belongs to the technical field, the corresponding voice model of the technical field is used, and the like.

Specifically, the speech synthesis technology is used to synthesize the sub-text data of the target language into a target language. The sub-voice signal of the speech can be implemented as follows:

The natural language processing technology is used to process the sub-text data of the target language into text data that can be understood by the computer, and the processing may include natural language processing such as text regularization, word segmentation, syntax analysis and semantic analysis; The text data is subjected to prosody processing to obtain a segment feature of the synthesized sub-speech signal, the audio feature including at least one of pitch, length, and intensity, so that the synthesized sub-speech signal can correctly express semantics; Using acoustic processing techniques, a sub-speech signal of a target language having the characteristics of the segment is obtained from text data that the computer can understand. For example, the acoustic processing technique may be LPC (Linear Predictive Coding) technology, PSOLA (Pitch Synchronous Overlay) synthesis technology, speech synthesis technology based on LMA channel model, and the like.

Further, when the speech signal of the source language is segmented in operation 101, the time stamp (including the start time and the end time) of the sub-speech signal of each piece of the source language is retained, so that the sub-segment of each target language converted in operation 102 is obtained. The voice signal also includes a time stamp of the corresponding sub-speech signal of the source language; correspondingly, the obtained sub-speech signal of each segment of the target language is combined with the video in operation 103, and the specific implementation may be as follows: for each target language The sub-speech signal combines the sub-speech signal of the segment target language into a play position corresponding to the time stamp of the sub-speech signal of the segment target language in the video. For example, if a sub-speech signal of three target languages is shared, the time stamp corresponding to the sub-speech signal of the first-stage target language is 00:10:00-00:20:00, and the sub-speech signal of the second-stage target language corresponds to The timestamp is 00:30:00-00:40:00, and the time stamp corresponding to the sub-speech signal of the third target language is 00:50:00-00:60:00, then the first target language is The sub-speech signal is merged into the playback position 00:10:00-00:20:00 in the video, and the sub-speech signal of the second target language is merged into the play position in the video 00:30:00-00:40: At 00, the sub-speech signal of the third segment target language is merged into the playback position 00:50:00-00:60:00 in the video.

In the technical solution of the embodiment, the voice signal of the source language in the video is extracted, and the voice signal of the source language is segmented to obtain at least one sub-voice signal of the source language, and for each sub-voice signal of the source language, according to The pre-established speech model converts the sub-speech signal of the source language into a sub-speech signal of the target language, and then combines the obtained sub-speech signals of the target languages into a video of the speech signal of the target language. It can be seen that the scheme realizes the purpose of automatically translating the voice signal in the converted video through the voice model, without manual participation, reducing the cost and improving the translation. Conversion efficiency, while avoiding the problem of low accuracy caused by manual translation conversion, the accuracy of the results can be better ensured by automatic translation conversion.

Embodiment 2:

2A is a video voice conversion method according to Embodiment 2 of the present invention, and FIG. 2B is a schematic diagram of an interface for selecting a target language type by a user according to Embodiment 2 of the present invention. This embodiment is applicable to a case where a voice signal of a source language in a video is converted into a voice signal of a target language before playing a video, and the method can be performed by a video voice converting device and a video playing device, and the video voice converting device and the video The playback device can be set in the same server or in different servers. The method specifically includes the following operations:

201: The video voice converting device determines, according to the setting information, at least one target language that needs to be converted;

202: The video voice converting device performs the following operations for each target language that needs to be converted: extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language. For each sub-speech signal of the source language, converting the sub-speech signal of the source language into the sub-speech signal of the current target language according to the pre-established speech model; and obtaining the sub-speech signal of each segment of the current target language and the video Merging, obtaining a video containing a voice signal of the current target language, and storing the video;

For details, refer to the detailed description of the first embodiment, and details are not described herein again.

203: The video voice playback device receives a video play request, where the play request includes a target language type selected or automatically selected by the user;

For example, the user can select the target language type as shown in FIG. 2B, and the user can select Mandarin or Sichuan dialect as the target language type in the menu of “simultaneous interpretation”;

204: The video voice playback device acquires a video of the voice signal of the target language corresponding to the target language type in the play request from the video voice conversion device, and sends the acquired video to the terminal device for playing.

In the technical solution of the embodiment, before playing the video, for each target language set in advance, the voice signal of the source language in the video is converted into the voice signal of the target language according to the method of the first embodiment, and the target language is obtained. a video of a voice signal; when receiving a play request including a target language type selected by the user or automatically selected, acquiring a destination corresponding to the target language type in the play request The video of the speech signal of the language is played and the video is played. It can be seen that the solution can meet the requirement of playing the same video in different languages, and since the translation and conversion of the voice signal in the video is completed before the play, the user does not need to wait for the translation conversion time after submitting the play request, so that the system responds to the video. The playback request is faster and the user experience is better.

Embodiment 3:

FIG. 3 is a schematic diagram of a video voice conversion method according to Embodiment 3 of the present invention. The embodiment is applicable to the case where the voice signal of the source language in the video is converted into the voice signal of the target language in real time after receiving the play request, and the method can be performed by the video voice converting device and the video playing device, and the video voice converting device And the video playback device can be set on the same server or different servers. The method specifically includes the following operations:

301: The video voice playback device receives a video play request, where the play request includes a target language type selected or automatically selected by the user;

302: The video voice converting device performs the following operations: extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain at least one sub-voice signal of a source language; for each sub-speech signal of the source language Converting the sub-speech signal of the source language into a sub-speech signal of the target language corresponding to the target language type in the video playback request according to the pre-established speech model; and combining the obtained sub-speech signals of the target language of each segment with the video , obtaining a video containing a voice signal of the target language;

303: The video voice playback device sends the video containing the voice signal of the target language obtained by the video voice conversion device to the terminal device for playing.

In the technical solution of the embodiment, after receiving the video play request, the voice signal of the source language in the video is converted into the voice signal of the target language indicated by the video play request according to the method of the first embodiment, and the target language is obtained. Video the voice signal and play the video. It can be seen that the scheme can meet the requirement of playing the same video in different languages, and since the translation of the voice signal in the video is performed, the translation request is performed without performing pre-translation for different target languages. Conversion and video storage save system resources.

Embodiment 4:

FIG. 4 is a schematic structural diagram of a video voice converting apparatus according to Embodiment 4 of the present invention. Specifically, the device includes:

The source speech extracting unit 401 is configured to extract a voice signal of a source language in the video.

The source speech processing unit 402 is configured to segment the speech signal of the source language to obtain at least one sub-speech signal of the source language;

The target speech converting unit 403 is configured to convert the sub-speech signal of the source language into the sub-speech signal of the target language according to the pre-established speech model for each sub-speech signal of the source language;

The voice and video merging unit 404 is configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video including a voice signal of the target language.

Further, the source voice extraction unit 401 is specifically configured to:

An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.

Further, the source voice processing unit 402 is specifically configured to:

Segmentation is performed according to the amplitude of the speech signal of the source language.

Further, the source voice processing unit 402 is further configured to:

The speech signal of the source language is subjected to denoising processing before segmenting the speech signal of the source language.

Further, the target voice converting unit 403 is specifically configured to:

For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.

Further, the target speech converting unit 403 is specifically configured to synthesize the sub-text data of the target language into sub-speech signals of the target language by using a speech synthesis technology as follows:

The natural language processing technology is used to process the sub-text data of the target language into a computer capable of understanding Text data; performing rhythm processing on the text data to obtain a segment feature of the synthesized sub-speech signal; using an acoustic processing technique, obtaining a target language sub-characteristic according to the text data that the computer can understand voice signal.

Further, the source speech processing unit 402 retains a time stamp of each sub-speech signal of the source language when segmenting the speech signal of the source language;

The voice and video merging unit 404 is specifically configured to, for each sub-speech signal of the target language, merge the sub-speech signal of the segment target language into a play position corresponding to a time stamp of the sub-speech signal of the segment target language in the video. .

The above software upgrade device can execute the software upgrade method provided by the embodiment of the present invention, and has the corresponding functional modules and beneficial effects of the execution method.

The embodiment of the invention further provides a server, which comprises the above video voice conversion device. The server may specifically be a PC (Personal Computer), a notebook computer, or the like.

Embodiment 5:

FIG. 5 is a schematic structural diagram of a hardware of a server according to Embodiment 5 of the present invention. The server includes:

One or more processors 510, one processor 510 is taken as an example in FIG. 5;

The memory 520; and one or more modules stored in the memory 520, such as the source speech extraction unit 401, the source speech processing unit 402, the target speech conversion unit 403, and the voice and video combining unit 404 in FIG.

The server may further include: an input device 530 and an output device 540. The processor 510, the memory 520, the input device 530, and the output device 540 in the server may be connected by a bus or other means, and the bus connection is taken as an example in FIG.

The memory 520 is used as a computer readable storage medium, and can be used to store a software program, a computer executable program, and a module, such as a program instruction/module corresponding to the method for providing object information in the embodiment of the present invention (for example, as shown in FIG. 4 Source speech extraction unit 401, source speech processing unit 402, target speech conversion unit 403, and voice and video merge unit 404). The processor 510 executes various functional applications and numbers of the server by running software programs, instructions, and modules stored in the memory 520. According to the processing, the video voice conversion method in the foregoing method embodiment is implemented.

The memory 520 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application required for at least one function; the storage data area may store data created according to usage of the terminal device, and the like. Further, the memory 520 may include a high speed random access memory, and may also include a nonvolatile memory such as at least one magnetic disk storage device, flash memory device, or other nonvolatile solid state storage device. In some examples, memory 520 can further include memory remotely located relative to processor 510, which can be connected to the terminal device over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Input device 530 can be used to receive input digital or character information and to generate key signal inputs related to user settings and function control of the terminal. The output device 540 can include a display device such as a display screen.

When one or more modules stored in the memory 520 are executed by the one or more processors 510, the following operations are performed:

Further, the extracting the voice signal of the source language in the video may preferably include:

Further, the segmenting the voice signal of the source language may preferably include: segmenting according to an amplitude of a voice signal of the source language.

Further, after extracting the voice signal of the source language in the video and before segmenting the voice signal of the source language, preferably, the voice signal of the source language is subjected to denoising processing.

Further, the sub-speech signal of each source language is converted into the sub-speech signal of the target language according to the pre-established speech model, which may preferably include:

For each sub-speech signal of the source language, the sub-speech signal input of the segment source language is pre-established a speech model, the sub-text data of the source language corresponding to the sub-speech signal of the segment source language outputted by the speech model is obtained, and the sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into a target language The sub-text data is synthesized into sub-speech signals of the target language by using speech synthesis technology.

Further, the synthesizing the sub-text data of the target language into the sub-speech signal of the target language by using a speech synthesis technology may preferably include:

Using natural language processing technology to process the sub-text data of the target language into text data that can be understood by the computer; prosody processing the text data to obtain a segment feature of the synthesized sub-speech signal; using acoustic processing techniques, according to the Text data that the computer can understand results in a sub-speech signal of the target language having the segment features.

Further, the time stamp of the sub-speech signal of each piece of the source language is retained when segmenting the speech signal of the source language; the current segment source language is used when converting the sub-speech signal of each source language into the sub-speech signal of the target language The time stamp of the sub-speech signal is added to the converted sub-speech signal of the corresponding target language;

Combining the obtained sub-speech signals of the target language of the segment with the video may preferably include: combining, for each sub-speech signal of the target language, the sub-speech signal of the segment target language into the target in the video. The playback position corresponding to the timestamp of the sub-speech signal of the language.

Example 6:

The embodiment further provides a non-volatile computer storage medium storing one or more modules, when the one or more modules are executed by one server, causing the server to perform the following operations:

Through the above description of the embodiments, those skilled in the art can clearly understand that The invention can be implemented by means of software and the necessary general hardware, and of course can also be implemented by hardware, but in many cases the former is a better implementation. Based on such understanding, the technical solution of the present invention, which is essential or contributes to the prior art, may be embodied in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk of a computer. , Read-Only Memory (ROM), Random Access Memory (RAM), Flash (FLASH), hard disk or optical disk, etc., including a number of instructions to make a computer device (can be a personal computer) The server, or network device, etc.) performs the methods described in various embodiments of the present invention.

It should be noted that, in the embodiment of the video-to-speech conversion device, the units and modules included in the video-to-speech conversion device are only divided according to functional logic, but are not limited to the above-mentioned division, as long as the corresponding functions can be implemented; The specific names of the respective functional units are also for convenience of distinguishing from each other and are not intended to limit the scope of protection of the present invention. Note that the above are only the preferred embodiments of the present invention and the technical principles applied thereto. Those skilled in the art will appreciate that the present invention is not limited to the specific embodiments described herein, and that various modifications, changes and substitutions may be made without departing from the scope of the invention. Therefore, the present invention has been described in detail by the above embodiments, but the present invention is not limited to the above embodiments, and other equivalent embodiments may be included without departing from the inventive concept. The scope is determined by the scope of the appended claims.

Claims

A video voice conversion method, comprising:

Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain a sub-voice signal of at least one source language;

For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;

The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.
The method according to claim 1, wherein the extracting the voice signal of the source language in the video comprises:

An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
The method according to claim 1 or 2, wherein the segmenting the voice signal of the source language comprises: segmenting according to an amplitude of a voice signal of the source language.
The method according to claim 1 or 2 or 3, wherein after the voice signal of the source language in the video is extracted and the voice signal of the source language is segmented, the method further comprises: the voice of the source language The signal is denoised.
The method according to any one of claims 1 to 4, wherein the sub-speech signal of each source language is converted into a sub-speech signal of the source language according to a pre-established speech model. Voice signal, including:

For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
The method according to claim 5, wherein the synthesizing the sub-text data of the target language into a sub-speech signal of the target language by using a speech synthesis technique comprises:

The natural language processing technology is used to process the sub-text data of the target language into text data that can be understood by the computer; the prosody processing is performed on the text data to obtain the segment of the synthesized sub-speech signal. Using an acoustic processing technique, a sub-speech signal of a target language having the segment characteristics is obtained according to text data that the computer can understand.
The method of any of claims 1-6, further comprising: retaining a timestamp of each sub-speech signal of the source language when segmenting the speech signal of the source language; When the sub-speech signal is converted into the sub-speech signal of the target language, the time stamp of the sub-speech signal of the current segment source language is added to the converted sub-speech signal of the corresponding target language;

And combining the obtained sub-speech signals of the target language of the segment with the video, specifically: combining, for each sub-speech signal of the target language, the sub-speech signal of the segment target language into the target language in the video The playback position of the sub-speech signal's timestamp.
A video voice conversion device, comprising:

a source speech extraction unit, configured to extract a speech signal of a source language in the video;

a source speech processing unit, configured to segment the speech signal of the source language to obtain at least one sub-speech signal of a source language;

a target speech conversion unit, configured to convert the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model for each sub-speech signal of the source language;

And a voice and video merging unit, configured to combine the obtained sub-speech signals of the target language of each segment with the video to obtain a video that includes a voice signal of the target language.
The device according to claim 8, wherein the source speech extraction unit is specifically configured to:

An audio signal in the video is extracted, and a speech signal of the source language is extracted from the audio signal according to a frequency characteristic of the speech signal.
The device according to claim 8 or 9, wherein the source speech processing unit is specifically configured to:

Segmentation is performed according to the amplitude of the speech signal of the source language.
The device according to claim 8 or 9 or 10, wherein the source speech processing unit is further configured to:

The speech signal of the source language is subjected to denoising processing before segmenting the speech signal of the source language.
The device according to any one of claims 8-11, wherein the target speech conversion unit is specifically configured to:

For each sub-speech signal of the source language, inputting the sub-speech signal of the source language into a pre-established speech model, and obtaining sub-text data of the source language corresponding to the sub-speech signal of the segment source language output by the speech model, The sub-text data of the source language corresponding to the sub-speech signal of the segment source language is translated into the sub-text data of the target language, and the sub-text data of the target language is synthesized into the sub-speech signal of the target language by using a speech synthesis technique.
The apparatus according to claim 12, wherein the target speech converting unit is specifically configured to synthesize the sub-text data of the target language into a sub-speech signal of the target language by using a speech synthesis technique as follows:

Using natural language processing technology to process the sub-text data of the target language into text data that can be understood by the computer; prosody processing the text data to obtain a segment feature of the synthesized sub-speech signal; using acoustic processing techniques, according to the Text data that the computer can understand results in a sub-speech signal of the target language having the segment features.
The apparatus according to any one of claims 8-13, wherein said source speech processing unit retains a time stamp of each sub-speech signal of the source language when segmenting the speech signal of the source language; said target The voice conversion unit adds a time stamp of the sub-speech signal of the current segment source language to the converted sub-speech signal of the corresponding target language when converting the sub-speech signal of each source language into the sub-speech signal of the target language;

The voice and video merging unit is specifically configured to: for each sub-speech signal of the target language, merge the sub-speech signal of the segment target language into a play position corresponding to a time stamp of the sub-speech signal of the segment target language in the video.
A server, comprising:

One or more processors;

Memory

One or more modules, the one or more modules being stored in the memory, and when executed by the one or more processors, performing the following operations:

Extracting a voice signal of a source language in the video, and segmenting the voice signal of the source language to obtain At least one sub-speech signal of the source language;

For each sub-speech signal of the source language, converting the sub-speech signal of the source language into a sub-speech signal of the target language according to a pre-established speech model;

The obtained sub-speech signals of the target language of each segment are combined with the video to obtain a video containing a speech signal of the target language.