WO2021109000A1

WO2021109000A1 - Data processing method and apparatus, electronic device, and storage medium

Info

Publication number: WO2021109000A1
Application number: PCT/CN2019/122789
Authority: WO
Inventors: 邓淑清
Original assignee: 深圳市欢太科技有限公司; Oppo广东移动通信有限公司
Priority date: 2019-12-03
Filing date: 2019-12-03
Publication date: 2021-06-10
Also published as: CN114503192A

Abstract

A data processing method and apparatus, an electronic device, and a storage medium, the method comprising: acquiring first voice data (201); dividing the first voice data into at least two segments of second voice data (202); acquiring third voice data corresponding to each segment of second voice data in the at least two segments of second voice data (203), the third voice data being obtained by translating the corresponding second voice data; and inserting the corresponding third voice data in each first set position in the first voice data so as to obtain fourth voice data corresponding to the first voice data (204), each first set position corresponding to the end position of each segment of second voice data in the at least two segments of second voice data.

Description

Data processing method, device, electronic equipment and storage medium

Technical field

This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.

Background technique

In the scenario of simultaneous interpretation, when the live meeting is played back, the original voice of the speaker and the synthesized voice of the translation interfere with each other, which hinders the audience's understanding of the content of the speech and affects the efficiency of information transmission.

Summary of the invention

To solve related technical problems, embodiments of the present application provide a data processing method, device, electronic equipment, and storage medium.

The embodiment of the application provides a data processing method, including:

Obtain the first voice data;

Dividing the first voice data into at least two pieces of second voice data;

Acquiring third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;

Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,

Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.

Wherein, in the above solution, the dividing the first voice data into at least two pieces of second voice data further includes:

A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition;

The inserting corresponding third voice data at each first set position in the first voice data includes:

Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;

Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.

In the above solution, the inserting a piece of fifth voice data at each first set position in the first voice data includes:

Generate the fifth voice data of the set duration;

Inserting the fifth voice data of the set duration in each first set position in the first voice data.

Determining the duration of each piece of second voice data in the at least two pieces of second voice data;

Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;

Insert corresponding fifth voice data at each first set position in the first voice data.

In the above solution, the voice frames constituting the set waveform of the fifth voice data include any one of the following:

Mute frame

White noise frame.

In the above solution, the method further includes:

Acquiring first image data matching the first voice data;

Dividing the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data;

Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,

Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.

In the above solution, the inserting corresponding third image data at each second set position in the first image data includes:

Determining the duration of each piece of second image data in the at least two pieces of second image data;

Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;

Inserting corresponding third image data at each second set position in the first image data.

Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;

The corresponding third image data is inserted at each second set position in the first image data.

The embodiment of the present application also provides a data processing device, including:

The first acquiring unit is configured to acquire first voice data;

A dividing unit configured to divide the first voice data into at least two pieces of second voice data;

The second acquiring unit is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;

The inserting unit is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,

An embodiment of the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the computer program, the foregoing Steps of any data processing method.

The embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the computer instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.

The data processing method, device, electronic equipment, and storage medium provided in the embodiments of the present application, after collecting the first voice data corresponding to the original sound, insert the second voice data corresponding to the translated synthesized sound into the first voice data at intervals, and obtain The fourth voice data can achieve the effect of alternate playback of the original sound and the translated synthesized sound during playback, thereby achieving alternate interpretation and avoiding the auditory interference caused to the audience when the original sound and the translated synthesized sound are played at the same time. Help the audience to understand the content of the speaker more quickly and accurately, thereby improving the efficiency of information transmission.

Description of the drawings

Figure 1 is a schematic diagram of the system architecture of the application of data processing methods in related technologies;

FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the application;

3 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application;

4 is a schematic flowchart of a data processing method according to an application embodiment of this application;

FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application;

FIG. 6 is a schematic diagram of the hardware composition structure of an electronic device according to an embodiment of the application.

Detailed ways

Before describing in detail the technical solutions of the embodiments of the present application, first, a brief description of the system to which the caption display method in the related art is applied will be described.

Figure 1 is a schematic diagram of the system architecture of the application of the data processing method in the related art. As shown in FIG. 1, the system may include: a machine simultaneous interpretation server, a speech recognition server, a translation server, a mobile terminal delivery server, a viewer mobile terminal, a PC (Personal Computer) client, and a display screen.

In practical applications, a speaker can speak at a conference through a PC client, and project documents, such as presentation software (PPT, PowerPoint) documents, to the display screen, and show them to the user through the display screen. In the process of speaking in the conference, the PC client collects the speaker’s audio and sends the collected audio to the machine simultaneous interpretation server. The machine simultaneous interpretation server recognizes the audio data through the voice recognition server to obtain the recognized text. Then the translation server translates the recognized text to obtain the translation result; the machine simultaneous interpretation server generates the translated synthesized sound according to the translation result and sends it to the PC client, and sends the translated synthesized sound to the viewer's mobile terminal through the mobile terminal issuing server , To display the translation results for users, so as to realize the translation of the speaker's speech content into the language required by the user and display it, so as to realize simultaneous interpretation in the meeting.

In the related technical solution, after the conference speech is over, the audience's mobile terminal will play back the live conference. The playback audio is recorded with the speaker’s original sound played outside, and contains two audio tracks, corresponding to the speaker’s voice. And translation synthesis sound, so during playback, the speaker’s voice and the translation synthesis sound are played synchronously. The two overlap and interfere with each other. On the one hand, the audience unconsciously listens to the audio of the two languages at the same time, which brings the audience a sense of hearing. Discomfort, on the other hand, the mixture of bilinguals also hinders the audience's understanding of the content of the speech and affects the efficiency of information transmission.

Based on this, in various embodiments of the present application, data processing is performed on the replayed audio, the voice data corresponding to the speaker’s voice is divided into multiple voice data segments, and the corresponding voice data segments are inserted between the multiple voice data segments. Translate the synthesized voice, and on this basis, realize the effect of consecutive interpretation during the playback of the live meeting, which helps the audience to better understand the content of the speaker and improve the efficiency of information transmission.

The application will be further described in detail below in conjunction with the drawings and specific embodiments.

The embodiment of the present application provides a data processing method. FIG. 2 is a schematic flowchart of the data processing method of the embodiment of the present application. As shown in Figure 2, the method includes:

Step 201: Acquire first voice data.

Here, the first voice data is voice data in the original language, that is, voice data that has not been translated. In practical applications, when a speaker speaks, the voice of the speaker can be collected through a voice collection module, such as a microphone, to obtain the first voice data.

Step 202: Divide the first voice data into at least two segments of second voice data.

Usually, when the speaker finishes a sentence, a short pause will be made, which corresponds to the waveform of the first voice data, which is reflected in the silent band whose duration is greater than the set duration threshold. Therefore, by detecting the waveform of the first voice data, The mute band in the first voice data is determined, and any position in the mute band is used as a separation position to divide the first voice data into at least two pieces of second voice data.

Considering that there will be a certain amount of noise at the conference site, here, the silent band can be understood as the band where the average amplitude of the waveform is less than the set average threshold. In practical applications, after the first voice data is acquired, the first voice data can be preprocessed, including filtering, noise reduction, gain amplification, and other preprocessing, and then the preprocessed first voice data can be divided.

Step 203: Acquire third voice data corresponding to each of the at least two pieces of second voice data.

Here, the third voice data is obtained by translating the corresponding second voice data. For each segment of the second voice data divided from the first voice data, the second voice data is converted into text through voice recognition technology, and then machine translation technology is used to process the text to translate the text in the original language into the target language. Finally, perform speech synthesis based on the text of the target language to obtain the third speech data corresponding to the second speech data.

In practical applications, machine translation technology can be used to translate from the original language to the target language. The available machine translation technologies include: Rule-Based machine translation technology or Corpus-Based machine translation technology.

Step 204: Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data.

Wherein, each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.

For example, the first voice data is sequentially divided into second voice data 1, second voice data 2, and second voice data 3, corresponding to third voice data 1, third voice data 2, and third voice data 3 respectively, then After the insertion operation is completed, the fourth voice data corresponding to the first voice data is obtained. The composition of the fourth voice data is the second voice data 1, the third voice data 1, the second voice data 2, and the third voice data in sequence. 2. The second voice data 3 and the third voice data 3.

In this way, when the live meeting is played back, the fourth voice data is played to achieve the effect of alternate interpretation.

In an embodiment, the dividing the first voice data into at least two pieces of second voice data further includes:

A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition.

Here, after dividing the first voice data into at least two pieces of second voice data, first insert a piece of fifth voice data in each of the first positions in the first voice data to obtain the sixth voice data corresponding to the first voice data data. For example, if the first voice data is sequentially divided into second voice data 1, second voice data 2, and second voice data 3, then the composition of the obtained sixth voice data is second voice data 1, fifth voice data, respectively, in sequence. Voice data 1, second voice data 2, fifth voice data 2, second voice data 3, and fifth voice data 3.

Wherein, the voice frames constituting the set waveform of the fifth voice data include any one of the following:

Mute frame

White noise frame.

Among them, the waveform of the silent frame is a straight line, and the waveform of the white noise frame satisfies that: in a wider frequency range, the noise energy contained in each frequency band of equal bandwidth is equal. The waveforms of the above two speech frames can be clearly distinguished from the speech frames collected when the speaker is speaking.

Here, the duration of the fifth voice data can be significantly longer than the duration of any silent band or white noise band in the second voice data. In practical applications, it can be set to be significantly longer than the duration of any silent band or white noise band in the second voice data. The duration threshold of the fifth voice data can be any duration greater than the duration threshold.

Here, the fifth voice data is inserted into the first voice data to facilitate the subsequent insertion of the third voice data into the first voice data to locate the insertion position, that is, the first set position, that is, the purpose is Identify the first set position. Based on this, inserting corresponding third voice data at each first set position in the first voice data includes:

Here, when the third voice data is inserted, a voice data segment with a set waveform is located in the first voice data, according to the voice data segment adjacent to the set waveform and before the voice data segment of the set waveform The third voice data corresponding to the second voice data is determined, and the determined third voice data is used to replace the voice data segment of the set waveform to complete the insertion of the third voice data.

In an embodiment, inserting a piece of fifth voice data at each first set position in the first voice data includes:

Generate the fifth voice data of the set duration;

Here, the duration of all the fifth voice data inserted in the first voice data is the set duration. As mentioned above, a duration threshold that is significantly larger than the length of any silent band or white noise band in the second voice data can be set. The set duration of the fifth voice data can be any duration greater than the duration threshold.

Here, the corresponding fifth voice data is inserted at the end position of any segment of the second voice data of the first voice data, and the duration of the fifth voice data is the same as the duration of the second voice data. In practical applications, the third voice data is obtained by translating the second voice data. The duration of the third voice data is usually close to the duration of the second voice data. Therefore, the corresponding fifth voice data is determined here according to the duration of the second voice data. The duration of the voice data. In this way, when the fifth voice data is replaced with the third voice data, the fifth voice data of similar duration can be quickly located in the first voice data according to the duration of the third voice data, and the The fifth voice data is replaced with third voice data of similar duration.

In an embodiment, as shown in FIG. 3, the method further includes:

Step 301: Acquire first image data matching the first voice data.

In practical applications, in a meeting situation, the first image data may be image data collected synchronously with the first voice data, for example, a meeting scene image that is simultaneously recorded while the speaker is speaking; or, the first image data may be The image data that is synchronously output when the first voice data is collected, for example, a PPT that is synchronously played on the large screen of the conference while the speaker is speaking.

Step 302: Divide the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data.

As described above, the first image data is collected or output synchronously with the first voice data. Therefore, the first image data can be divided into at least two pieces of second image data based on the same time axis as the first voice data.

Step 303: Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,

For example, the first image data is sequentially divided into the second image data 1, the second image data 2 and the second image data 3, corresponding to the third image data 1, the third image data 2 and the third image data 3 respectively, then After the insertion operation is completed, the fourth image data corresponding to the first image data is obtained. The composition of the fourth image data is the second image data 1, the third image data 1, the second image data 2, and the third image data respectively. 2. The second image data 3 and the third image data 3.

In this way, the fourth voice data and the fourth image data can be played synchronously when the live meeting is played back. The video data is formed by the synthesis of the voice data and the image data. Therefore, with the above solution, the video stream can be Ways to achieve the effect of consecutive interpreting.

In an embodiment, the inserting corresponding third image data at each second set position in the first image data includes:

Here, the corresponding third image data is inserted at the end position of any piece of second image data of the first image data, and the duration of the third image data is the same as the duration of the second image data. In practical applications, the second voice data is translated to obtain the third voice data. The duration of the third voice data is usually close to the duration of the second voice data, and the second voice data is played synchronously with the second image data. Therefore, use The duration of the third image data played in synchronization with the third voice data is also close to the duration of the second image data.

In practical applications, the content of the third image data may be an interpreter animation matching the third voice data.

Here, by copying, the third image data has the same data content as the corresponding second image data. In this way, when the fourth voice data is played, when the third voice data is played, the corresponding second voice will be displayed. The data matches the image data. In practical applications, it is equivalent to using the video segment corresponding to the original sound as the video segment corresponding to the translated sound.

In the data processing method provided by the embodiments of the present application, after the first voice data corresponding to the original sound is collected, the second voice data corresponding to the translated synthesized voice is inserted into the first voice data at intervals, and the obtained fourth voice data can be played. Achieve the effect of alternate playback of the original sound and the translated synthesized sound, thereby realizing alternate interpretation, avoiding the auditory interference caused to the audience when the original sound and the translated synthesized sound are played at the same time, and helping the audience to understand more quickly and accurately in the scene of the live conference playback The content of the speaker's speech, thereby improving the efficiency of information transmission.

Figure 4 shows a schematic diagram of the implementation process of the data processing method provided by the application embodiment of the present invention. This application embodiment is applied to a conference scene of simultaneous interpretation, and is used to implement alternate interpretation when the live meeting is played back, as shown in Figure 4 Shown:

Step 401: The microphone collects the original voice of the speaker to obtain the first voice data.

Step 402: Preprocessing the first voice data, including filtering, noise reduction, gain amplification, and so on.

Step 403: Divide the first voice data into at least two pieces of second voice data according to the sentence pause interval in the first voice data, and add a silence frame of a certain duration between each two pieces of second voice data.

Step 404: Perform voice recognition on each piece of second voice data to obtain the corresponding recognized text. For example, the first segment of the second voice data is recognized as "Good morning everyone", and the second segment of the second voice data is recognized as "It's nice to meet everyone".

Step 405: Perform preprocessing on the recognized text to eliminate the problems of partial recognition errors, colloquial expressions, multiple modal words, repeated expressions of partial content in the recognized text, etc., which lead to inconsistent sentences in the recognized text.

Step 406: Perform machine translation on the recognized text. Here, through machine translation technology, "Good morning everyone" is translated into "Good morning, everyone!", and "It's nice to meet everyone" is translated into "Nice to meet you".

Step 407: Based on the translation obtained after the machine translation, a corresponding translation synthesis sound is generated.

Step 408: Insert the translated synthesized voice correspondingly into the interval of the first voice data processed in step 403, and splice to obtain the fourth voice data.

Step 409: Generate fourth image data matching the fourth voice data. The fourth voice data and the fourth image data are played synchronously, and the effect of alternate interpretation can be obtained when the live meeting is played back.

In order to implement the data processing method of the embodiment of the present application, the embodiment of the present application also provides a data processing device, which can be set in an electronic device and used as the machine simultaneous interpretation server in FIG. 1. FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application. As shown in Figure 5, the data processing device includes:

The first obtaining unit 51 is configured to obtain first voice data;

The first dividing unit 52 is configured to divide the first voice data into at least two pieces of second voice data;

The second acquiring unit 53 is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data ；

The first inserting unit 54 is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,

In an embodiment, the first dividing unit 52 is further configured to:

In an embodiment, the first dividing unit 52 inserting a piece of fifth voice data at each first set position in the first voice data includes:

Generate the fifth voice data of the set duration;

In an embodiment, the voice frames constituting the set waveform of the fifth voice data include any one of the following:

Mute frame

White noise frame.

In an embodiment, the device further includes:

The third acquiring unit is configured to acquire first image data matching the first voice data;

The second dividing unit is configured to divide the first image data into at least two pieces of second image data; each piece of second image data and each piece of second voice data in the at least two pieces of second voice data one by one correspond;

The second inserting unit is configured to insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,

In an embodiment, the second inserting unit inserting corresponding third image data at each second set position in the first image data includes:

In practical applications, the first acquiring unit 51, the first dividing unit 52, the second acquiring unit 53, the first inserting unit 54, the second acquiring unit, the second dividing unit, The second plug-in unit may be a processor in an electronic device, such as a central processing unit (CPU, Central Processing Unit), a digital signal processor (DSP, Digital Signal Processor), a micro control unit (MCU, Microcontroller Unit) or a programmable Gate array (FPGA, Field-Programmable Gate Array) and other implementations.

It should be noted that when the data processing device provided in the above embodiment performs data processing, only the division of the above-mentioned program modules is used as an example for illustration. In actual applications, the above-mentioned processing can be allocated by different program modules as needed. That is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above. In addition, the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.

The data processing device in the above embodiment can be correspondingly installed in the data processing system shown in FIG. 1, that is, respectively correspondingly installed on the machine simultaneous interpretation server of the data processing system in FIG. 1, so as to transfer the data of the implementation example of this application. The processing method is applied in the translation system.

Based on the hardware implementation of the foregoing device, an embodiment of the present application also provides an electronic device. FIG. 6 is a schematic diagram of the hardware composition structure of an electronic device according to an embodiment of the application. As shown in FIG. 6, the electronic device 60 includes a memory 63, a processor 62, and a computer program stored on the memory 63 and running on the processor 62; When the processor 62 executes the program, the method provided by the one or more technical solutions described above is implemented.

Specifically, when the processor 62 in the electronic device 60 executes the program, it realizes: acquiring first voice data; dividing the first voice data into at least two pieces of second voice data; acquiring the at least two pieces of second voice data The third voice data corresponding to each piece of second voice data in the data; the third voice data is obtained by translating the corresponding second voice data; each first set position in the first voice data Insert the corresponding third voice data to obtain the fourth voice data corresponding to the first voice data; wherein, each first set position corresponds to each of the at least two pieces of second voice data The end position.

It should be noted that the specific steps implemented when the processor 62 executes the program have been described in detail in the above embodiment, and will not be repeated here.

It can be understood that the electronic device 60 further includes a communication interface 61; various components in the electronic device 60 are coupled together through the bus system 64. It can be understood that the bus system 64 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 64 also includes a power bus, a control bus, and a status signal bus.

It can be understood that the memory in the embodiment of FIG. 6 above may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory. Among them, the non-volatile memory can be a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read-Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), and dynamic random access memory. Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Memory) ). The memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.

The methods disclosed in the foregoing embodiments of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The aforementioned processor may be a general-purpose processor, DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The processor can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. Combining the steps of the method disclosed in the embodiments of the present application, it may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium, and the storage medium is located in a memory. The processor reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.

The embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium. Stored thereon are computer instructions, that is, a computer program, and when the computer instructions are executed by a processor, the method provided by one or more of the above technical solutions is provided.

In the several embodiments provided in this application, it should be understood that the disclosed method and smart device can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.

The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.

In addition, the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.

A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.

Alternatively, if the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

It should be noted that: "first", "second", etc. are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.

In addition, the technical solutions described in the embodiments of the present application can be combined arbitrarily without conflict.

The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application.

Claims

A data processing method, including:

Obtain the first voice data;

Dividing the first voice data into at least two pieces of second voice data;

Acquiring third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;

Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,

Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
The method according to claim 1, wherein the dividing the first voice data into at least two pieces of second voice data further comprises:

A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition;

The inserting corresponding third voice data at each first set position in the first voice data includes:

Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;

Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.
The method according to claim 2, wherein the inserting a piece of fifth voice data at each first set position in the first voice data comprises:

Generate the fifth voice data of the set duration;

Inserting the fifth voice data of the set duration in each first set position in the first voice data.
The method according to claim 2, wherein the inserting a piece of fifth voice data at each first set position in the first voice data comprises:

Determining the duration of each piece of second voice data in the at least two pieces of second voice data;

Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;

Insert corresponding fifth voice data at each first set position in the first voice data.
The method according to any one of claims 2 to 4, wherein the voice frames constituting the set waveform of the fifth voice data include any one of the following:

Mute frame

White noise frame.
The method according to claim 1, wherein the method further comprises:

Acquiring first image data matching the first voice data;

Dividing the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data;

Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,

Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
The method according to claim 6, wherein said inserting corresponding third image data at each second set position in said first image data comprises:

Determining the duration of each piece of second image data in the at least two pieces of second image data;

Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;

Inserting corresponding third image data at each second set position in the first image data.
The method according to claim 6, wherein said inserting corresponding third image data at each second set position in said first image data comprises:

Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;

Inserting corresponding third image data at each second set position in the first image data.
A data processing device includes:

The first acquiring unit is configured to acquire first voice data;

The first dividing unit is configured to divide the first voice data into at least two pieces of second voice data;

The second acquiring unit is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;

The first inserting unit is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,

Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
An electronic device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the computer program, any one of claims 1 to 8 is implemented The steps of the method.
A storage medium having computer instructions stored thereon, and when the computer instructions are executed by a processor, the steps of the method according to any one of claims 1 to 8 are realized.