WO2021109000A1 - Data processing method and apparatus, electronic device, and storage medium - Google Patents

Data processing method and apparatus, electronic device, and storage medium Download PDF

Info

Publication number
WO2021109000A1
WO2021109000A1 PCT/CN2019/122789 CN2019122789W WO2021109000A1 WO 2021109000 A1 WO2021109000 A1 WO 2021109000A1 CN 2019122789 W CN2019122789 W CN 2019122789W WO 2021109000 A1 WO2021109000 A1 WO 2021109000A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
data
image data
voice
piece
Prior art date
Application number
PCT/CN2019/122789
Other languages
French (fr)
Chinese (zh)
Inventor
邓淑清
Original Assignee
深圳市欢太科技有限公司
Oppo广东移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市欢太科技有限公司, Oppo广东移动通信有限公司 filed Critical 深圳市欢太科技有限公司
Priority to PCT/CN2019/122789 priority Critical patent/WO2021109000A1/en
Priority to CN201980100991.1A priority patent/CN114503192A/en
Publication of WO2021109000A1 publication Critical patent/WO2021109000A1/en

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection

Definitions

  • This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.
  • the embodiment of the application provides a data processing method, including:
  • Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
  • the dividing the first voice data into at least two pieces of second voice data further includes:
  • a piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data;
  • the fifth voice data is composed of a voice frame of a set waveform composition;
  • the inserting corresponding third voice data at each first set position in the first voice data includes:
  • the inserting a piece of fifth voice data at each first set position in the first voice data includes:
  • the inserting a piece of fifth voice data at each first set position in the first voice data includes:
  • the voice frames constituting the set waveform of the fifth voice data include any one of the following:
  • each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data;
  • Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
  • the inserting corresponding third image data at each second set position in the first image data includes:
  • the corresponding third image data is inserted at each second set position in the first image data.
  • the embodiment of the present application also provides a data processing device, including:
  • a dividing unit configured to divide the first voice data into at least two pieces of second voice data
  • the second acquiring unit is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;
  • the inserting unit is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data;
  • An embodiment of the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor executes the computer program, the foregoing Steps of any data processing method.
  • the embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the computer instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.
  • FIG. 3 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application.
  • the translation server translates the recognized text to obtain the translation result;
  • the machine simultaneous interpretation server generates the translated synthesized sound according to the translation result and sends it to the PC client, and sends the translated synthesized sound to the viewer's mobile terminal through the mobile terminal issuing server ,
  • the audience's mobile terminal will play back the live conference.
  • the playback audio is recorded with the speaker’s original sound played outside, and contains two audio tracks, corresponding to the speaker’s voice.
  • translation synthesis sound so during playback, the speaker’s voice and the translation synthesis sound are played synchronously. The two overlap and interfere with each other.
  • the audience unconsciously listens to the audio of the two languages at the same time, which brings the audience a sense of hearing.
  • the mixture of bilinguals also hinders the audience's understanding of the content of the speech and affects the efficiency of information transmission.
  • data processing is performed on the replayed audio, the voice data corresponding to the speaker’s voice is divided into multiple voice data segments, and the corresponding voice data segments are inserted between the multiple voice data segments.
  • Translate the synthesized voice and on this basis, realize the effect of consecutive interpretation during the playback of the live meeting, which helps the audience to better understand the content of the speaker and improve the efficiency of information transmission.
  • Step 202 Divide the first voice data into at least two segments of second voice data.
  • a short pause will be made, which corresponds to the waveform of the first voice data, which is reflected in the silent band whose duration is greater than the set duration threshold. Therefore, by detecting the waveform of the first voice data, The mute band in the first voice data is determined, and any position in the mute band is used as a separation position to divide the first voice data into at least two pieces of second voice data.
  • the silent band can be understood as the band where the average amplitude of the waveform is less than the set average threshold.
  • the first voice data can be preprocessed, including filtering, noise reduction, gain amplification, and other preprocessing, and then the preprocessed first voice data can be divided.
  • machine translation technology can be used to translate from the original language to the target language.
  • the available machine translation technologies include: Rule-Based machine translation technology or Corpus-Based machine translation technology.
  • Step 204 Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data.
  • each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
  • the first voice data is sequentially divided into second voice data 1, second voice data 2, and second voice data 3, corresponding to third voice data 1, third voice data 2, and third voice data 3 respectively, then After the insertion operation is completed, the fourth voice data corresponding to the first voice data is obtained.
  • the composition of the fourth voice data is the second voice data 1, the third voice data 1, the second voice data 2, and the third voice data in sequence. 2.
  • the dividing the first voice data into at least two pieces of second voice data further includes:
  • a piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition.
  • the composition of the obtained sixth voice data is second voice data 1, fifth voice data, respectively, in sequence.
  • Voice data 1 second voice data 2, fifth voice data 2, second voice data 3, and fifth voice data 3.
  • the waveform of the silent frame is a straight line
  • the waveform of the white noise frame satisfies that: in a wider frequency range, the noise energy contained in each frequency band of equal bandwidth is equal.
  • the waveforms of the above two speech frames can be clearly distinguished from the speech frames collected when the speaker is speaking.
  • the duration of the fifth voice data can be significantly longer than the duration of any silent band or white noise band in the second voice data. In practical applications, it can be set to be significantly longer than the duration of any silent band or white noise band in the second voice data.
  • the duration threshold of the fifth voice data can be any duration greater than the duration threshold.
  • the fifth voice data is inserted into the first voice data to facilitate the subsequent insertion of the third voice data into the first voice data to locate the insertion position, that is, the first set position, that is, the purpose is Identify the first set position. Based on this, inserting corresponding third voice data at each first set position in the first voice data includes:
  • inserting a piece of fifth voice data at each first set position in the first voice data includes:
  • the duration of all the fifth voice data inserted in the first voice data is the set duration.
  • a duration threshold that is significantly larger than the length of any silent band or white noise band in the second voice data can be set.
  • the set duration of the fifth voice data can be any duration greater than the duration threshold.
  • the corresponding fifth voice data is inserted at the end position of any segment of the second voice data of the first voice data, and the duration of the fifth voice data is the same as the duration of the second voice data.
  • the third voice data is obtained by translating the second voice data.
  • the duration of the third voice data is usually close to the duration of the second voice data. Therefore, the corresponding fifth voice data is determined here according to the duration of the second voice data. The duration of the voice data. In this way, when the fifth voice data is replaced with the third voice data, the fifth voice data of similar duration can be quickly located in the first voice data according to the duration of the third voice data, and the The fifth voice data is replaced with third voice data of similar duration.
  • Step 301 Acquire first image data matching the first voice data.
  • the first image data may be image data collected synchronously with the first voice data, for example, a meeting scene image that is simultaneously recorded while the speaker is speaking; or, the first image data may be The image data that is synchronously output when the first voice data is collected, for example, a PPT that is synchronously played on the large screen of the conference while the speaker is speaking.
  • Step 302 Divide the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data.
  • the first image data is collected or output synchronously with the first voice data. Therefore, the first image data can be divided into at least two pieces of second image data based on the same time axis as the first voice data.
  • Step 303 Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data;
  • Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
  • the fourth voice data and the fourth image data can be played synchronously when the live meeting is played back.
  • the video data is formed by the synthesis of the voice data and the image data. Therefore, with the above solution, the video stream can be Ways to achieve the effect of consecutive interpreting.
  • the inserting corresponding third image data at each second set position in the first image data includes:
  • the corresponding third image data is inserted at the end position of any piece of second image data of the first image data, and the duration of the third image data is the same as the duration of the second image data.
  • the second voice data is translated to obtain the third voice data.
  • the duration of the third voice data is usually close to the duration of the second voice data, and the second voice data is played synchronously with the second image data. Therefore, use The duration of the third image data played in synchronization with the third voice data is also close to the duration of the second image data.
  • the content of the third image data may be an interpreter animation matching the third voice data.
  • the inserting corresponding third image data at each second set position in the first image data includes:
  • the third image data has the same data content as the corresponding second image data.
  • the fourth voice data when the fourth voice data is played, when the third voice data is played, the corresponding second voice will be displayed.
  • the data matches the image data. In practical applications, it is equivalent to using the video segment corresponding to the original sound as the video segment corresponding to the translated sound.
  • the second voice data corresponding to the translated synthesized voice is inserted into the first voice data at intervals, and the obtained fourth voice data can be played.
  • the content of the speaker's speech thereby improving the efficiency of information transmission.
  • Figure 4 shows a schematic diagram of the implementation process of the data processing method provided by the application embodiment of the present invention. This application embodiment is applied to a conference scene of simultaneous interpretation, and is used to implement alternate interpretation when the live meeting is played back, as shown in Figure 4 Shown:
  • Step 401 The microphone collects the original voice of the speaker to obtain the first voice data.
  • Step 402 Preprocessing the first voice data, including filtering, noise reduction, gain amplification, and so on.
  • Step 404 Perform voice recognition on each piece of second voice data to obtain the corresponding recognized text.
  • the first segment of the second voice data is recognized as "Good morning everyone", and the second segment of the second voice data is recognized as "It's nice to meet everyone”.
  • Step 405 Perform preprocessing on the recognized text to eliminate the problems of partial recognition errors, colloquial expressions, multiple modal words, repeated expressions of partial content in the recognized text, etc., which lead to inconsistent sentences in the recognized text.
  • Step 406 Perform machine translation on the recognized text.
  • machine translation technology "Good morning everyone” is translated into “Good morning, everyone!, and “It's nice to meet everyone” is translated into “Nice to meet you”.
  • Step 407 Based on the translation obtained after the machine translation, a corresponding translation synthesis sound is generated.
  • Step 409 Generate fourth image data matching the fourth voice data.
  • the fourth voice data and the fourth image data are played synchronously, and the effect of alternate interpretation can be obtained when the live meeting is played back.
  • FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application. As shown in Figure 5, the data processing device includes:
  • the first dividing unit 52 is configured to divide the first voice data into at least two pieces of second voice data
  • the second acquiring unit 53 is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data ;
  • Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
  • the first dividing unit 52 is further configured to:
  • a piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data;
  • the fifth voice data is composed of a voice frame of a set waveform composition;
  • the first dividing unit 52 inserting a piece of fifth voice data at each first set position in the first voice data includes:
  • the voice frames constituting the set waveform of the fifth voice data include any one of the following:
  • the device further includes:
  • the third acquiring unit is configured to acquire first image data matching the first voice data
  • the second inserting unit is configured to insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data;
  • the second inserting unit inserting corresponding third image data at each second set position in the first image data includes:
  • the second inserting unit inserting corresponding third image data at each second set position in the first image data includes:
  • the second plug-in unit may be a processor in an electronic device, such as a central processing unit (CPU, Central Processing Unit), a digital signal processor (DSP, Digital Signal Processor), a micro control unit (MCU, Microcontroller Unit) or a programmable Gate array (FPGA, Field-Programmable Gate Array) and other implementations.
  • CPU Central Processing Unit
  • DSP Digital Signal Processor
  • MCU Microcontroller Unit
  • FPGA Field-Programmable Gate Array
  • the data processing device provided in the above embodiment performs data processing
  • only the division of the above-mentioned program modules is used as an example for illustration.
  • the above-mentioned processing can be allocated by different program modules as needed. That is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above.
  • the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
  • the data processing device in the above embodiment can be correspondingly installed in the data processing system shown in FIG. 1, that is, respectively correspondingly installed on the machine simultaneous interpretation server of the data processing system in FIG. 1, so as to transfer the data of the implementation example of this application.
  • the processing method is applied in the translation system.
  • FIG. 6 is a schematic diagram of the hardware composition structure of an electronic device according to an embodiment of the application.
  • the electronic device 60 includes a memory 63, a processor 62, and a computer program stored on the memory 63 and running on the processor 62; When the processor 62 executes the program, the method provided by the one or more technical solutions described above is implemented.
  • the electronic device 60 further includes a communication interface 61; various components in the electronic device 60 are coupled together through the bus system 64. It can be understood that the bus system 64 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 64 also includes a power bus, a control bus, and a status signal bus.
  • the memory in the embodiment of FIG. 6 above may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory.
  • the non-volatile memory can be a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read-Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage.
  • the volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache.
  • RAM random access memory
  • SRAM static random access memory
  • SSRAM synchronous static random access memory
  • Synchronous Static Random Access Memory Synchronous Static Random Access Memory
  • DRAM Dynamic Random Access Memory
  • SDRAM Synchronous Dynamic Random Access Memory
  • DDRSDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • SLDRAM synchronous connection dynamic random access memory
  • DRRAM Direct Rambus Random Access Memory
  • the memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
  • the methods disclosed in the foregoing embodiments of the present application may be applied to a processor or implemented by a processor.
  • the processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software.
  • the aforementioned processor may be a general-purpose processor, DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like.
  • the processor can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application.
  • the general-purpose processor may be a microprocessor or any conventional processor or the like.
  • the embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • a storage medium which is specifically a computer storage medium, and more specifically, a computer-readable storage medium.
  • Stored thereon are computer instructions, that is, a computer program, and when the computer instructions are executed by a processor, the method provided by one or more of the above technical solutions is provided.
  • the disclosed method and smart device can be implemented in other ways.
  • the device embodiments described above are merely illustrative.
  • the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented.
  • the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
  • the units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit;
  • the above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
  • the foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
  • the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium.
  • the computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

A data processing method and apparatus, an electronic device, and a storage medium, the method comprising: acquiring first voice data (201); dividing the first voice data into at least two segments of second voice data (202); acquiring third voice data corresponding to each segment of second voice data in the at least two segments of second voice data (203), the third voice data being obtained by translating the corresponding second voice data; and inserting the corresponding third voice data in each first set position in the first voice data so as to obtain fourth voice data corresponding to the first voice data (204), each first set position corresponding to the end position of each segment of second voice data in the at least two segments of second voice data.

Description

数据处理方法、装置、电子设备和存储介质Data processing method, device, electronic equipment and storage medium 技术领域Technical field
本申请涉及同声传译技术,具体涉及一种数据处理方法、装置、电子设备和存储介质。This application relates to simultaneous interpretation technology, in particular to a data processing method, device, electronic equipment and storage medium.
背景技术Background technique
在同声传译场景下,对会议实况进行回放时,发言者的原声和翻译合成声相互干扰,为观众理解发言内容带来阻碍,影响信息传递效率。In the scenario of simultaneous interpretation, when the live meeting is played back, the original voice of the speaker and the synthesized voice of the translation interfere with each other, which hinders the audience's understanding of the content of the speech and affects the efficiency of information transmission.
发明内容Summary of the invention
为解决相关技术问题,本申请实施例提供了一种数据处理方法、装置、电子设备和存储介质。To solve related technical problems, embodiments of the present application provide a data processing method, device, electronic equipment, and storage medium.
本申请实施例提供了一种数据处理方法,包括:The embodiment of the application provides a data processing method, including:
获取第一语音数据;Obtain the first voice data;
将所述第一语音数据划分成至少两段第二语音数据;Dividing the first voice data into at least two pieces of second voice data;
获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;Acquiring third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;
在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,
每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
其中,上述方案中,所述将所述第一语音数据划分成至少两段第二语音数据,还包括:Wherein, in the above solution, the dividing the first voice data into at least two pieces of second voice data further includes:
在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,得到所述第一语音数据对应的第六语音数据;所述第五语音数据由设定波形的语音帧组成;A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition;
所述在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,包括:The inserting corresponding third voice data at each first set position in the first voice data includes:
根据所述设定波形,在所述第六语音数据中定位出每段第二语音数据对应的第五语音数据;Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;
将定位出的每段第二语音数据对应的第五语音数据替换为对应的第三语音数据。Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.
上述方案中,所述在所述第一语音数据中的每个第一设定位置插入 一段第五语音数据,包括:In the above solution, the inserting a piece of fifth voice data at each first set position in the first voice data includes:
生成设定时长的第五语音数据;Generate the fifth voice data of the set duration;
在在所述第一语音数据中的每个第一设定位置插入所述设定时长的第五语音数据。Inserting the fifth voice data of the set duration in each first set position in the first voice data.
上述方案中,所述在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:In the above solution, the inserting a piece of fifth voice data at each first set position in the first voice data includes:
确定所述至少两段第二语音数据中的每段第二语音数据的时长;Determining the duration of each piece of second voice data in the at least two pieces of second voice data;
为所述至少两段第二语音数据中的每段第二语音数据生成对应的第五语音数据;生成的每段第五语音数据的时长与对应的第二语音数据的时长相同;Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;
在所述第一语音数据中的每个第一设定位置插入对应的第五语音数据。Insert corresponding fifth voice data at each first set position in the first voice data.
上述方案中,组成所述第五语音数据的设定波形的语音帧包括以下任意一项:In the above solution, the voice frames constituting the set waveform of the fifth voice data include any one of the following:
静音帧;Mute frame
白噪声帧。White noise frame.
上述方案中,所述方法还包括:In the above solution, the method further includes:
获取与所述第一语音数据匹配的第一图像数据;Acquiring first image data matching the first voice data;
将所述第一图像数据划分成至少两段第二图像数据;每段第二图像数据与所述至少两段第二语音数据中的每段第二语音数据一一对应;Dividing the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,得到所述第一图像数据对应的第四图像数据;其中,Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,
每个第二设定位置分别对应所述至少两段第二图像数据中的每段第二图像数据的结束位置;所述第四图像数据用于与所述第四语音数据同步播放。Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
上述方案中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In the above solution, the inserting corresponding third image data at each second set position in the first image data includes:
确定所述至少两段第二图像数据中的每段第二图像数据的时长;Determining the duration of each piece of second image data in the at least two pieces of second image data;
为所述至少两段第二图像数据中的每段第二图像数据生成对应的第三图像数据;生成的每段第三图像数据的时长与对应的第二图像数据的时长相同;Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
上述方案中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In the above solution, the inserting corresponding third image data at each second set position in the first image data includes:
复制所述至少两段第二图像数据中的每段第二图像数据,得到每段第二图像数据对应的第三图像数据;Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数 据。The corresponding third image data is inserted at each second set position in the first image data.
本申请实施例还提供了一种数据处理装置,包括:The embodiment of the present application also provides a data processing device, including:
第一获取单元,配置为获取第一语音数据;The first acquiring unit is configured to acquire first voice data;
划分单元,配置为将所述第一语音数据划分成至少两段第二语音数据;A dividing unit configured to divide the first voice data into at least two pieces of second voice data;
第二获取单元,配置为获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;The second acquiring unit is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;
插入单元,配置为在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,The inserting unit is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,
每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
本申请实施例还提供了一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现上述任一数据处理方法的步骤。An embodiment of the present application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor. When the processor executes the computer program, the foregoing Steps of any data processing method.
本申请实施例还提供了一种存储介质,其上存储有计算机指令,所述计算机指令被处理器执行时实现上述任一数据处理方法的步骤。The embodiment of the present application also provides a storage medium on which computer instructions are stored, and when the computer instructions are executed by a processor, the steps of any of the foregoing data processing methods are implemented.
本申请实施例提供的数据处理方法、装置、电子设备和存储介质,在采集到原声对应的第一语音数据后,将翻译合成声对应的第二语音数据间隔插入至第一语音数据,得到的第四语音数据在播放时能够达到原声和翻译合成声交替播放的效果,从而实现交替传译,避免了同时播放原声和翻译合成声时对观众造成的听觉干扰,在回放会议实况的场景下,有助于观众更快速准确地理解发言者的发言内容,以此提高了信息传递的效率。The data processing method, device, electronic equipment, and storage medium provided in the embodiments of the present application, after collecting the first voice data corresponding to the original sound, insert the second voice data corresponding to the translated synthesized sound into the first voice data at intervals, and obtain The fourth voice data can achieve the effect of alternate playback of the original sound and the translated synthesized sound during playback, thereby achieving alternate interpretation and avoiding the auditory interference caused to the audience when the original sound and the translated synthesized sound are played at the same time. Help the audience to understand the content of the speaker more quickly and accurately, thereby improving the efficiency of information transmission.
附图说明Description of the drawings
图1为相关技术中数据处理方法应用的系统架构示意图;Figure 1 is a schematic diagram of the system architecture of the application of data processing methods in related technologies;
图2为本申请实施例的数据处理方法的一种流程示意图;FIG. 2 is a schematic flowchart of a data processing method according to an embodiment of the application;
图3为本申请实施例的数据处理方法的另一种流程示意图;3 is a schematic diagram of another flow chart of a data processing method according to an embodiment of the application;
图4为本申请应用实施例的数据处理方法的流程示意图;4 is a schematic flowchart of a data processing method according to an application embodiment of this application;
图5为本申请实施例的数据处理装置的组成结构示意图;FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application;
图6为本申请实施例的电子设备的硬件组成结构示意图。FIG. 6 is a schematic diagram of the hardware composition structure of an electronic device according to an embodiment of the application.
具体实施方式Detailed ways
在对本申请实施例的技术方案进行详细说明之前,首先对相关技术中的字幕显示方法应用的系统进行简单说明。Before describing in detail the technical solutions of the embodiments of the present application, first, a brief description of the system to which the caption display method in the related art is applied will be described.
图1为相关技术中数据处理方法应用的系统架构示意图。如图1所示, 所述系统可包括:机器同传服务端、语音识别服务器、翻译服务器、移动端下发服务器、观众移动端、电脑(PC,Personal Computer)客户端、显示屏幕。Figure 1 is a schematic diagram of the system architecture of the application of the data processing method in the related art. As shown in FIG. 1, the system may include: a machine simultaneous interpretation server, a speech recognition server, a translation server, a mobile terminal delivery server, a viewer mobile terminal, a PC (Personal Computer) client, and a display screen.
实际应用中,发言者可以通过PC客户端进行会议发言,并将展示的文档,如演示文稿软件(PPT,PowerPoint)的文档,投屏到所述显示屏幕,通过显示屏幕展示给用户。在进行会议发言的过程中,PC客户端采集发言者的音频,将采集的音频发送给机器同传服务端,所述机器同传服务端通过语音识别服务器对音频数据进行识别,得到识别文本,再通过翻译服务器对所述识别文本进行翻译,得到翻译结果;机器同传服务端根据翻译结果生成翻译合成声发送给PC客户端,并且通过移动端下发服务器将翻译合成声发送给观众移动端,为用户展示翻译结果,从而实现将发言者的发言内容翻译成用户需要的语种并进行展示,实现会议中的同声传译。In practical applications, a speaker can speak at a conference through a PC client, and project documents, such as presentation software (PPT, PowerPoint) documents, to the display screen, and show them to the user through the display screen. In the process of speaking in the conference, the PC client collects the speaker’s audio and sends the collected audio to the machine simultaneous interpretation server. The machine simultaneous interpretation server recognizes the audio data through the voice recognition server to obtain the recognized text. Then the translation server translates the recognized text to obtain the translation result; the machine simultaneous interpretation server generates the translated synthesized sound according to the translation result and sends it to the PC client, and sends the translated synthesized sound to the viewer's mobile terminal through the mobile terminal issuing server , To display the translation results for users, so as to realize the translation of the speaker's speech content into the language required by the user and display it, so as to realize simultaneous interpretation in the meeting.
相关技术的方案中,在会议发言结束后,观众移动端对会议实况进行回放,回放的音频是在发言者的原声外放的情况下录制的,包含两个音轨,分别对应发言者的声音和翻译合成声,因此在回放时,发言者的声音与翻译合成声同步播放,二者相互交叠,相互干扰,一方面使得观众不自觉地同时收听两个语种的音频,为观众带来听觉不适感,另一方面,双语种的混合也为观众理解发言内容带来阻碍,影响信息传递效率。In the related technical solution, after the conference speech is over, the audience's mobile terminal will play back the live conference. The playback audio is recorded with the speaker’s original sound played outside, and contains two audio tracks, corresponding to the speaker’s voice. And translation synthesis sound, so during playback, the speaker’s voice and the translation synthesis sound are played synchronously. The two overlap and interfere with each other. On the one hand, the audience unconsciously listens to the audio of the two languages at the same time, which brings the audience a sense of hearing. Discomfort, on the other hand, the mixture of bilinguals also hinders the audience's understanding of the content of the speech and affects the efficiency of information transmission.
基于此,在本申请的各种实施例中,对回放的音频进行数据处理,将发言者的声音对应的语音数据划分成多个语音数据段,并在多个语音数据段之间插入对应的翻译合成声,从而在此基础上,在回放会议实况时实现交替传译的效果,有助于观众更好地理解发言者的发言内容,提高信息传递效率。Based on this, in various embodiments of the present application, data processing is performed on the replayed audio, the voice data corresponding to the speaker’s voice is divided into multiple voice data segments, and the corresponding voice data segments are inserted between the multiple voice data segments. Translate the synthesized voice, and on this basis, realize the effect of consecutive interpretation during the playback of the live meeting, which helps the audience to better understand the content of the speaker and improve the efficiency of information transmission.
下面结合附图及具体实施例对本申请作进一步详细的说明。The application will be further described in detail below in conjunction with the drawings and specific embodiments.
本申请实施例提供了一种数据处理方法,图2为本申请实施例的数据处理方法的一种流程示意图。如图2所示,所述方法包括:The embodiment of the present application provides a data processing method. FIG. 2 is a schematic flowchart of the data processing method of the embodiment of the present application. As shown in Figure 2, the method includes:
步骤201:获取第一语音数据。Step 201: Acquire first voice data.
这里,第一语音数据为原语种的语音数据,也即未经过翻译的语音数据。在实际应用中,当发言者发言时,可以通过语音采集模块,如麦克风,对发言者的声音进行采集,得到第一语音数据。Here, the first voice data is voice data in the original language, that is, voice data that has not been translated. In practical applications, when a speaker speaks, the voice of the speaker can be collected through a voice collection module, such as a microphone, to obtain the first voice data.
步骤202:将所述第一语音数据划分成至少两段第二语音数据。Step 202: Divide the first voice data into at least two segments of second voice data.
通常,当发言人说完一句话时,会做短暂停顿,对应到第一语音数据的波形中,则体现为时长大于设定时长阈值的静音波段,因此,通过检测第一语音数据的波形,确定第一语音数据中的静音波段,以静音波段内的任一位置作为分隔位置,将第一语音数据划分成至少两段第二语音数据。Usually, when the speaker finishes a sentence, a short pause will be made, which corresponds to the waveform of the first voice data, which is reflected in the silent band whose duration is greater than the set duration threshold. Therefore, by detecting the waveform of the first voice data, The mute band in the first voice data is determined, and any position in the mute band is used as a separation position to divide the first voice data into at least two pieces of second voice data.
考虑到会议现场会存在一定噪音,这里,静音波段可以理解为波形的幅度均值小于设定均值阈值的波段。在实际应用中,获取到第一语音 数据之后,可以对第一语音数据进行预处理,包括进行滤波降噪、增益放大等预处理,再对预处理后的第一语音数据进行划分操作。Considering that there will be a certain amount of noise at the conference site, here, the silent band can be understood as the band where the average amplitude of the waveform is less than the set average threshold. In practical applications, after the first voice data is acquired, the first voice data can be preprocessed, including filtering, noise reduction, gain amplification, and other preprocessing, and then the preprocessed first voice data can be divided.
步骤203:获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据。Step 203: Acquire third voice data corresponding to each of the at least two pieces of second voice data.
这里,所述第三语音数据通过对对应的第二语音数据进行翻译得到。针对第一语音数据中划分出的每一段第二语音数据,通过语音识别技术将第二语音数据转换为文本,再利用机器翻译技术对文本进行处理,以将原文语种的文本翻译为译文语种的文本,最后基于译文语种的文本进行语音合成,得到该段第二语音数据对应的第三语音数据。Here, the third voice data is obtained by translating the corresponding second voice data. For each segment of the second voice data divided from the first voice data, the second voice data is converted into text through voice recognition technology, and then machine translation technology is used to process the text to translate the text in the original language into the target language. Finally, perform speech synthesis based on the text of the target language to obtain the third speech data corresponding to the second speech data.
在实际应用时,可以通过机器翻译技术进行由原文语种至译文语种的翻译,可采用的机器翻译技术包括:基于规则(Rule-Based)的机器翻译技术或者基于语料库(Corpus-Based)的机器翻译技术。In practical applications, machine translation technology can be used to translate from the original language to the target language. The available machine translation technologies include: Rule-Based machine translation technology or Corpus-Based machine translation technology.
步骤204:在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据。Step 204: Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data.
其中,每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Wherein, each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
例如,第一语音数据依序被划分成第二语音数据1、第二语音数据2和第二语音数据3,分别对应第三语音数据1、第三语音数据2和第三语音数据3,那么在完成插入操作后,得到第一语音数据对应的第四语音数据,第四语音数据的组成依序分别为第二语音数据1、第三语音数据1、第二语音数据2、第三语音数据2、第二语音数据3和第三语音数据3。For example, the first voice data is sequentially divided into second voice data 1, second voice data 2, and second voice data 3, corresponding to third voice data 1, third voice data 2, and third voice data 3 respectively, then After the insertion operation is completed, the fourth voice data corresponding to the first voice data is obtained. The composition of the fourth voice data is the second voice data 1, the third voice data 1, the second voice data 2, and the third voice data in sequence. 2. The second voice data 3 and the third voice data 3.
这样一来,在对会议实况进行回放时,播放第四语音数据,就能够实现交替传译的效果。In this way, when the live meeting is played back, the fourth voice data is played to achieve the effect of alternate interpretation.
在一个实施例中,所述将所述第一语音数据划分成至少两段第二语音数据,还包括:In an embodiment, the dividing the first voice data into at least two pieces of second voice data further includes:
在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,得到所述第一语音数据对应的第六语音数据;所述第五语音数据由设定波形的语音帧组成。A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition.
这里,在将第一语音数据划分成至少两段第二语音数据后,先在第一语音数据中的上述每个第一位置插入一段第五语音数据,得到第一语音数据对应的第六语音数据。例如,第一语音数据依序被划分成第二语音数据1、第二语音数据2和第二语音数据3,那么得到的第六语音数据的组成依序分别为第二语音数据1、第五语音数据1、第二语音数据2、第五语音数据2、第二语音数据3和第五语音数据3。Here, after dividing the first voice data into at least two pieces of second voice data, first insert a piece of fifth voice data in each of the first positions in the first voice data to obtain the sixth voice data corresponding to the first voice data data. For example, if the first voice data is sequentially divided into second voice data 1, second voice data 2, and second voice data 3, then the composition of the obtained sixth voice data is second voice data 1, fifth voice data, respectively, in sequence. Voice data 1, second voice data 2, fifth voice data 2, second voice data 3, and fifth voice data 3.
其中,组成所述第五语音数据的设定波形的语音帧包括以下任意一项:Wherein, the voice frames constituting the set waveform of the fifth voice data include any one of the following:
静音帧;Mute frame
白噪声帧。White noise frame.
其中,静音帧的波形为一条直线,白噪声帧的波形满足:在较宽的频率范围内,各等带宽的频带所含的噪声能量相等。以上两种语音帧的波形都能够明显地与发言者说话时采集到的语音帧区别开来。Among them, the waveform of the silent frame is a straight line, and the waveform of the white noise frame satisfies that: in a wider frequency range, the noise energy contained in each frequency band of equal bandwidth is equal. The waveforms of the above two speech frames can be clearly distinguished from the speech frames collected when the speaker is speaking.
这里,第五语音数据的时长可以明显大于第二语音数据中任意静音波段或白噪声波段的时长,在实际应用中,可以设置一明显大于第二语音数据中任意静音波段或白噪声波段的时长的时长阈值,第五语音数据的时长可以为大于该时长阈值的任意时长。Here, the duration of the fifth voice data can be significantly longer than the duration of any silent band or white noise band in the second voice data. In practical applications, it can be set to be significantly longer than the duration of any silent band or white noise band in the second voice data. The duration threshold of the fifth voice data can be any duration greater than the duration threshold.
这里,在第一语音数据中插入第五语音数据,目的在于方便后续在第一语音数据中插入第三语音数据时,对插入位置,也即第一设定位置进行定位,也即,目的在于为第一设定位置进行标识。基于此,所述在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,包括:Here, the fifth voice data is inserted into the first voice data to facilitate the subsequent insertion of the third voice data into the first voice data to locate the insertion position, that is, the first set position, that is, the purpose is Identify the first set position. Based on this, inserting corresponding third voice data at each first set position in the first voice data includes:
根据所述设定波形,在所述第六语音数据中定位出每段第二语音数据对应的第五语音数据;Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;
将定位出的每段第二语音数据对应的第五语音数据替换为对应的第三语音数据。Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.
这里,在进行第三语音数据的插入时,在第一语音数据中定位出具有设定波形的语音数据段,根据与该设定波形的语音数据段相邻且位于该设定波形的语音数据段之前的第二语音数据,确定出对应的第三语音数据,再用确定出的第三语音数据来替换该设定波形的语音数据段,即可完成第三语音数据的插入。Here, when the third voice data is inserted, a voice data segment with a set waveform is located in the first voice data, according to the voice data segment adjacent to the set waveform and before the voice data segment of the set waveform The third voice data corresponding to the second voice data is determined, and the determined third voice data is used to replace the voice data segment of the set waveform to complete the insertion of the third voice data.
在一个实施例中,所述在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:In an embodiment, inserting a piece of fifth voice data at each first set position in the first voice data includes:
生成设定时长的第五语音数据;Generate the fifth voice data of the set duration;
在在所述第一语音数据中的每个第一设定位置插入所述设定时长的第五语音数据。Inserting the fifth voice data of the set duration in each first set position in the first voice data.
这里,在第一语音数据中插入的所有第五语音数据的时长均为设定时长,前文所述,可以设置一明显大于第二语音数据中任意静音波段或白噪声波段的长度的时长阈值,第五语音数据的设定时长可以为大于该时长阈值的任意时长。Here, the duration of all the fifth voice data inserted in the first voice data is the set duration. As mentioned above, a duration threshold that is significantly larger than the length of any silent band or white noise band in the second voice data can be set. The set duration of the fifth voice data can be any duration greater than the duration threshold.
在一个实施例中,所述在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:In an embodiment, inserting a piece of fifth voice data at each first set position in the first voice data includes:
确定所述至少两段第二语音数据中的每段第二语音数据的时长;Determining the duration of each piece of second voice data in the at least two pieces of second voice data;
为所述至少两段第二语音数据中的每段第二语音数据生成对应的第五语音数据;生成的每段第五语音数据的时长与对应的第二语音数据的时长相同;Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;
在所述第一语音数据中的每个第一设定位置插入对应的第五语音数据。Insert corresponding fifth voice data at each first set position in the first voice data.
这里,在第一语音数据的任意一段第二语音数据的结束位置插入对应的第五语音数据,该第五语音数据的时长与该第二语音数据的时长相同。在实际应用中,对第二语音数据翻译得到第三语音数据,该第三语音数据的时长通常与第二语音数据的时长接近,因此,这里根据第二语音数据的时长来确定对应的第五语音数据的时长,这样一来,在用第三语音数据替换第五语音数据时,能够根据第三语音数据的时长,在第一语音数据中快速定位到时长近似的第五语音数据,并将该第五语音数据替换为时长近似的第三语音数据。Here, the corresponding fifth voice data is inserted at the end position of any segment of the second voice data of the first voice data, and the duration of the fifth voice data is the same as the duration of the second voice data. In practical applications, the third voice data is obtained by translating the second voice data. The duration of the third voice data is usually close to the duration of the second voice data. Therefore, the corresponding fifth voice data is determined here according to the duration of the second voice data. The duration of the voice data. In this way, when the fifth voice data is replaced with the third voice data, the fifth voice data of similar duration can be quickly located in the first voice data according to the duration of the third voice data, and the The fifth voice data is replaced with third voice data of similar duration.
在一个实施例中,如图3所述,所述方法还包括:In an embodiment, as shown in FIG. 3, the method further includes:
步骤301:获取与所述第一语音数据匹配的第一图像数据。Step 301: Acquire first image data matching the first voice data.
在实际应用中,会议场合下,第一图像数据可以为与第一语音数据同步采集的图像数据,例如,在发言者发言的同时,同步录制的会议现场图像;或者,第一图像数据可以为在采集第一语音数据时同步输出的图像数据,例如,在发言者发言的同时,在会议大屏幕上同步播放的PPT。In practical applications, in a meeting situation, the first image data may be image data collected synchronously with the first voice data, for example, a meeting scene image that is simultaneously recorded while the speaker is speaking; or, the first image data may be The image data that is synchronously output when the first voice data is collected, for example, a PPT that is synchronously played on the large screen of the conference while the speaker is speaking.
步骤302:将所述第一图像数据划分成至少两段第二图像数据;每段第二图像数据与所述至少两段第二语音数据中的每段第二语音数据一一对应。Step 302: Divide the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data.
如上文所述,第一图像数据与第一语音数据同步采集或同步输出,因此,可以基于与第一语音数据相同的时间轴,将第一图像数据划分成至少两段第二图像数据。As described above, the first image data is collected or output synchronously with the first voice data. Therefore, the first image data can be divided into at least two pieces of second image data based on the same time axis as the first voice data.
步骤303:在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,得到所述第一图像数据对应的第四图像数据;其中,Step 303: Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,
每个第二设定位置分别对应所述至少两段第二图像数据中的每段第二图像数据的结束位置;所述第四图像数据用于与所述第四语音数据同步播放。Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
例如,第一图像数据依序被划分成第二图像数据1、第二图像数据2和第二图像数据3,分别对应第三图像数据1、第三图像数据2和第三图像数据3,那么在完成插入操作后,得到第一图像数据对应的第四图像数据,第四图像数据的组成依序分别为第二图像数据1、第三图像数据1、第二图像数据2、第三图像数据2、第二图像数据3和第三图像数据3。For example, the first image data is sequentially divided into the second image data 1, the second image data 2 and the second image data 3, corresponding to the third image data 1, the third image data 2 and the third image data 3 respectively, then After the insertion operation is completed, the fourth image data corresponding to the first image data is obtained. The composition of the fourth image data is the second image data 1, the third image data 1, the second image data 2, and the third image data respectively. 2. The second image data 3 and the third image data 3.
这样一来,在对会议实况进行回放时,可以同步播放第四语音数据和第四图像数据,由于语音数据和图像数据的合成便形成了视频数据,因此,采用上述方案,能够通过视频流的方式实现交替传译的效果。In this way, the fourth voice data and the fourth image data can be played synchronously when the live meeting is played back. The video data is formed by the synthesis of the voice data and the image data. Therefore, with the above solution, the video stream can be Ways to achieve the effect of consecutive interpreting.
在一个实施例中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In an embodiment, the inserting corresponding third image data at each second set position in the first image data includes:
确定所述至少两段第二图像数据中的每段第二图像数据的时长;Determining the duration of each piece of second image data in the at least two pieces of second image data;
为所述至少两段第二图像数据中的每段第二图像数据生成对应的第三图像数据;生成的每段第三图像数据的时长与对应的第二图像数据的 时长相同;Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
这里,在第一图像数据的任意一段第二图像数据的结束位置插入对应的第三图像数据,该第三图像数据的时长与该第二图像数据的时长相同。在实际应用中,对第二语音数据翻译得到第三语音数据,该第三语音数据的时长通常与第二语音数据的时长接近,而第二语音数据与第二图像数据同步播放,因此,用于与第三语音数据进行同步播放的第三图像数据的时长也与第二图像数据的时长接近。Here, the corresponding third image data is inserted at the end position of any piece of second image data of the first image data, and the duration of the third image data is the same as the duration of the second image data. In practical applications, the second voice data is translated to obtain the third voice data. The duration of the third voice data is usually close to the duration of the second voice data, and the second voice data is played synchronously with the second image data. Therefore, use The duration of the third image data played in synchronization with the third voice data is also close to the duration of the second image data.
在实际应用中,第三图像数据的内容可以为与第三语音数据匹配的译员动画。In practical applications, the content of the third image data may be an interpreter animation matching the third voice data.
在一个实施例中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In an embodiment, the inserting corresponding third image data at each second set position in the first image data includes:
复制所述至少两段第二图像数据中的每段第二图像数据,得到每段第二图像数据对应的第三图像数据;Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
这里,通过复制,第三图像数据与对应的第二图像数据的数据内容相同,这样,在播放第四语音数据时,当播放到其中的第三语音数据时,会显示与对应的第二语音数据相匹配的图像数据。在实际应用中,相当于将原声对应的视频片段作为翻译声音对应的视频片段。Here, by copying, the third image data has the same data content as the corresponding second image data. In this way, when the fourth voice data is played, when the third voice data is played, the corresponding second voice will be displayed. The data matches the image data. In practical applications, it is equivalent to using the video segment corresponding to the original sound as the video segment corresponding to the translated sound.
本申请实施例提供的数据处理方法,在采集到原声对应的第一语音数据后,将翻译合成声对应的第二语音数据间隔插入至第一语音数据,得到的第四语音数据在播放时能够达到原声和翻译合成声交替播放的效果,从而实现交替传译,避免了同时播放原声和翻译合成声时对观众造成的听觉干扰,在回放会议实况的场景下,有助于观众更快速准确地理解发言者的发言内容,以此提高了信息传递的效率。In the data processing method provided by the embodiments of the present application, after the first voice data corresponding to the original sound is collected, the second voice data corresponding to the translated synthesized voice is inserted into the first voice data at intervals, and the obtained fourth voice data can be played. Achieve the effect of alternate playback of the original sound and the translated synthesized sound, thereby realizing alternate interpretation, avoiding the auditory interference caused to the audience when the original sound and the translated synthesized sound are played at the same time, and helping the audience to understand more quickly and accurately in the scene of the live conference playback The content of the speaker's speech, thereby improving the efficiency of information transmission.
图4示出了本发明应用实施例提供的数据处理方法的实现流程示意图,本应用实施例应用于同声传译的会议场景中,用于在对会议实况进行回放时实现交替传译,如图4所示:Figure 4 shows a schematic diagram of the implementation process of the data processing method provided by the application embodiment of the present invention. This application embodiment is applied to a conference scene of simultaneous interpretation, and is used to implement alternate interpretation when the live meeting is played back, as shown in Figure 4 Shown:
步骤401:麦克风采集发言者的原声,得到第一语音数据。Step 401: The microphone collects the original voice of the speaker to obtain the first voice data.
步骤402:对第一语音数据进行预处理,包括滤波降噪、增益放大等。Step 402: Preprocessing the first voice data, including filtering, noise reduction, gain amplification, and so on.
步骤403:根据第一语音数据中的语句停顿间隔,将第一语音数据划分为至少两段第二语音数据,并在每两段第二语音数据之间加入一定时长的静音帧。Step 403: Divide the first voice data into at least two pieces of second voice data according to the sentence pause interval in the first voice data, and add a silence frame of a certain duration between each two pieces of second voice data.
步骤404:对每段第二语音数据分别进行语音识别,得到对应的识别文本。例如,将第一段第二语音数据识别为“大家早上好”,将第二段第二语音数据识别为“很高兴认识大家”。Step 404: Perform voice recognition on each piece of second voice data to obtain the corresponding recognized text. For example, the first segment of the second voice data is recognized as "Good morning everyone", and the second segment of the second voice data is recognized as "It's nice to meet everyone".
步骤405:对识别文本进行预处理,以消除识别文本中的部分识别错误、口语化表达、语气词较多、部分内容重复表达等导致识别文本中语句不通顺的问题。Step 405: Perform preprocessing on the recognized text to eliminate the problems of partial recognition errors, colloquial expressions, multiple modal words, repeated expressions of partial content in the recognized text, etc., which lead to inconsistent sentences in the recognized text.
步骤406:对识别文本进行机器翻译。这里,通过机器翻译技术,将“大家早上好”翻译为“Good morning,everyone!”,将“很高兴认识大家”翻译为“Nice to meet you”。Step 406: Perform machine translation on the recognized text. Here, through machine translation technology, "Good morning everyone" is translated into "Good morning, everyone!", and "It's nice to meet everyone" is translated into "Nice to meet you".
步骤407:基于机器翻译后得到的译文,生成对应的翻译合成声。Step 407: Based on the translation obtained after the machine translation, a corresponding translation synthesis sound is generated.
步骤408:将翻译合成声对应插入到经步骤403处理后的第一语音数据的间隔中,拼接得到第四语音数据。Step 408: Insert the translated synthesized voice correspondingly into the interval of the first voice data processed in step 403, and splice to obtain the fourth voice data.
步骤409:生成与第四语音数据匹配的第四图像数据。将第四语音数据与第四图像数据同步播放,可以在回放会议实况时得到交替传译的效果。Step 409: Generate fourth image data matching the fourth voice data. The fourth voice data and the fourth image data are played synchronously, and the effect of alternate interpretation can be obtained when the live meeting is played back.
为实现本申请实施例的数据处理方法,本申请实施例还提供了一种数据处理装置,可以设置于电子设备中,作为图1中的机器同传服务端。图5为本申请实施例的数据处理装置的组成结构示意图。如图5所示,所述数据处理装置包括:In order to implement the data processing method of the embodiment of the present application, the embodiment of the present application also provides a data processing device, which can be set in an electronic device and used as the machine simultaneous interpretation server in FIG. 1. FIG. 5 is a schematic diagram of the composition structure of a data processing device according to an embodiment of the application. As shown in Figure 5, the data processing device includes:
第一获取单元51,配置为获取第一语音数据;The first obtaining unit 51 is configured to obtain first voice data;
第一划分单元52,配置为将所述第一语音数据划分成至少两段第二语音数据;The first dividing unit 52 is configured to divide the first voice data into at least two pieces of second voice data;
第二获取单元53,配置为获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;The second acquiring unit 53 is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data ;
第一插入单元54,配置为在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,The first inserting unit 54 is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,
每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
在一实施例中,所述第一划分单元52,还用于:In an embodiment, the first dividing unit 52 is further configured to:
在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,得到所述第一语音数据对应的第六语音数据;所述第五语音数据由设定波形的语音帧组成;A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition;
所述在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,包括:The inserting corresponding third voice data at each first set position in the first voice data includes:
根据所述设定波形,在所述第六语音数据中定位出每段第二语音数据对应的第五语音数据;Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;
将定位出的每段第二语音数据对应的第五语音数据替换为对应的第三语音数据。Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.
在一实施例中,所述第一划分单元52在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:In an embodiment, the first dividing unit 52 inserting a piece of fifth voice data at each first set position in the first voice data includes:
生成设定时长的第五语音数据;Generate the fifth voice data of the set duration;
在在所述第一语音数据中的每个第一设定位置插入所述设定时长的第五语音数据。Inserting the fifth voice data of the set duration in each first set position in the first voice data.
在一实施例中,所述第一划分单元52在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:In an embodiment, the first dividing unit 52 inserting a piece of fifth voice data at each first set position in the first voice data includes:
确定所述至少两段第二语音数据中的每段第二语音数据的时长;Determining the duration of each piece of second voice data in the at least two pieces of second voice data;
为所述至少两段第二语音数据中的每段第二语音数据生成对应的第五语音数据;生成的每段第五语音数据的时长与对应的第二语音数据的时长相同;Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;
在所述第一语音数据中的每个第一设定位置插入对应的第五语音数据。Insert corresponding fifth voice data at each first set position in the first voice data.
在一实施例中,组成所述第五语音数据的设定波形的语音帧包括以下任意一项:In an embodiment, the voice frames constituting the set waveform of the fifth voice data include any one of the following:
静音帧;Mute frame
白噪声帧。White noise frame.
在一实施例中,所述装置还包括:In an embodiment, the device further includes:
第三获取单元,配置为获取与所述第一语音数据匹配的第一图像数据;The third acquiring unit is configured to acquire first image data matching the first voice data;
第二划分单元,配置为将所述第一图像数据划分成至少两段第二图像数据;每段第二图像数据与所述至少两段第二语音数据中的每段第二语音数据一一对应;The second dividing unit is configured to divide the first image data into at least two pieces of second image data; each piece of second image data and each piece of second voice data in the at least two pieces of second voice data one by one correspond;
第二插入单元,配置为在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,得到所述第一图像数据对应的第四图像数据;其中,The second inserting unit is configured to insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,
每个第二设定位置分别对应所述至少两段第二图像数据中的每段第二图像数据的结束位置;所述第四图像数据用于与所述第四语音数据同步播放。Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
在一实施例中,所述第二插入单元在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In an embodiment, the second inserting unit inserting corresponding third image data at each second set position in the first image data includes:
确定所述至少两段第二图像数据中的每段第二图像数据的时长;Determining the duration of each piece of second image data in the at least two pieces of second image data;
为所述至少两段第二图像数据中的每段第二图像数据生成对应的第三图像数据;生成的每段第三图像数据的时长与对应的第二图像数据的时长相同;Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
在一实施例中,所述第二插入单元在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:In an embodiment, the second inserting unit inserting corresponding third image data at each second set position in the first image data includes:
复制所述至少两段第二图像数据中的每段第二图像数据,得到每段 第二图像数据对应的第三图像数据;Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;
在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
实际应用时,所述第一获取单元51、所述第一划分单元52、所述第二获取单元53、所述第一插入单元54、所述第二获取单元、所述第二划分单元、所述第二插入单元可由电子设备中的处理器,比如中央处理器(CPU,Central Processing Unit)、数字信号处理器(DSP,Digital Signal Processor)、微控制单元(MCU,Microcontroller Unit)或可编程门阵列(FPGA,Field-Programmable Gate Array)等实现。In practical applications, the first acquiring unit 51, the first dividing unit 52, the second acquiring unit 53, the first inserting unit 54, the second acquiring unit, the second dividing unit, The second plug-in unit may be a processor in an electronic device, such as a central processing unit (CPU, Central Processing Unit), a digital signal processor (DSP, Digital Signal Processor), a micro control unit (MCU, Microcontroller Unit) or a programmable Gate array (FPGA, Field-Programmable Gate Array) and other implementations.
需要说明的是:上述实施例提供的数据处理装置在进行数据处理时,仅以上述各程序模块的划分进行举例说明,实际应用中,可以根据需要而将上述处理分配由不同的程序模块完成,即将电子设备内部结构划分成不同的程序模块,以完成以上描述的全部或者部分处理。另外,上述实施例提供的装置与数据处理方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when the data processing device provided in the above embodiment performs data processing, only the division of the above-mentioned program modules is used as an example for illustration. In actual applications, the above-mentioned processing can be allocated by different program modules as needed. That is, the internal structure of the electronic device is divided into different program modules to complete all or part of the processing described above. In addition, the device provided in the foregoing embodiment and the data processing method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, which will not be repeated here.
上文实施例中的数据处理装置可以对应设置于图1所示的数据处理系统中,即,分别对应设置在图1数据处理系统的机器同传服务端上,以将本申请实现例的数据处理方法应用在翻译系统中。The data processing device in the above embodiment can be correspondingly installed in the data processing system shown in FIG. 1, that is, respectively correspondingly installed on the machine simultaneous interpretation server of the data processing system in FIG. 1, so as to transfer the data of the implementation example of this application. The processing method is applied in the translation system.
基于上述设备的硬件实现,本申请实施例还提供了一种电子设备。图6为本申请实施例的电子设备的硬件组成结构示意图,如图6所示,电子设备60包括存储器63、处理器62及存储在存储器63上并可在处理器62上运行的计算机程序;处理器62执行所述程序时实现上述一个或多个技术方案提供的方法。Based on the hardware implementation of the foregoing device, an embodiment of the present application also provides an electronic device. FIG. 6 is a schematic diagram of the hardware composition structure of an electronic device according to an embodiment of the application. As shown in FIG. 6, the electronic device 60 includes a memory 63, a processor 62, and a computer program stored on the memory 63 and running on the processor 62; When the processor 62 executes the program, the method provided by the one or more technical solutions described above is implemented.
具体地,电子设备60中的处理器62执行所述程序时实现:获取第一语音数据;将所述第一语音数据划分成至少两段第二语音数据;获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Specifically, when the processor 62 in the electronic device 60 executes the program, it realizes: acquiring first voice data; dividing the first voice data into at least two pieces of second voice data; acquiring the at least two pieces of second voice data The third voice data corresponding to each piece of second voice data in the data; the third voice data is obtained by translating the corresponding second voice data; each first set position in the first voice data Insert the corresponding third voice data to obtain the fourth voice data corresponding to the first voice data; wherein, each first set position corresponds to each of the at least two pieces of second voice data The end position.
需要说明的是,处理器62执行所述程序时实现的具体步骤已在上文实施例中详述,这里不再赘述。It should be noted that the specific steps implemented when the processor 62 executes the program have been described in detail in the above embodiment, and will not be repeated here.
可以理解,电子设备60还包括通信接口61;电子设备60中的各个组件通过总线系统64耦合在一起。可理解,总线系统64配置为实现这些组件之间的连接通信。总线系统64除包括数据总线之外,还包括电源总线、控制总线和状态信号总线等。It can be understood that the electronic device 60 further includes a communication interface 61; various components in the electronic device 60 are coupled together through the bus system 64. It can be understood that the bus system 64 is configured to implement connection and communication between these components. In addition to the data bus, the bus system 64 also includes a power bus, a control bus, and a status signal bus.
可以理解,上文图6实施例中的存储器可以是易失性存储器或非易失性存储器,也可包括易失性和非易失性存储器两者。其中,非易失性存储 器可以是只读存储器(ROM,Read Only Memory)、可编程只读存储器(PROM,Programmable Read-Only Memory)、可擦除可编程只读存储器(EPROM,Erasable Programmable Read-Only Memory)、电可擦除可编程只读存储器(EEPROM,Electrically Erasable Programmable Read-Only Memory)、磁性随机存取存储器(FRAM,ferromagnetic random access memory)、快闪存储器(Flash Memory)、磁表面存储器、光盘、或只读光盘(CD-ROM,Compact Disc Read-Only Memory);磁表面存储器可以是磁盘存储器或磁带存储器。易失性存储器可以是随机存取存储器(RAM,Random Access Memory),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(SRAM,Static Random Access Memory)、同步静态随机存取存储器(SSRAM,Synchronous Static Random Access Memory)、动态随机存取存储器(DRAM,Dynamic Random Access Memory)、同步动态随机存取存储器(SDRAM,Synchronous Dynamic Random Access Memory)、双倍数据速率同步动态随机存取存储器(DDRSDRAM,Double Data Rate Synchronous Dynamic Random Access Memory)、增强型同步动态随机存取存储器(ESDRAM,Enhanced Synchronous Dynamic Random Access Memory)、同步连接动态随机存取存储器(SLDRAM,SyncLink Dynamic Random Access Memory)、直接内存总线随机存取存储器(DRRAM,Direct Rambus Random Access Memory)。本申请实施例描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。It can be understood that the memory in the embodiment of FIG. 6 above may be a volatile memory or a non-volatile memory, and may also include both volatile and non-volatile memory. Among them, the non-volatile memory can be a read-only memory (ROM, Read Only Memory), a programmable read-only memory (PROM, Programmable Read-Only Memory), an erasable programmable read-only memory (EPROM, Erasable Programmable Read- Only Memory, Electrically Erasable Programmable Read-Only Memory (EEPROM, Electrically Erasable Programmable Read-Only Memory), magnetic random access memory (FRAM, ferromagnetic random access memory), flash memory (Flash Memory), magnetic surface memory , CD-ROM, or CD-ROM (Compact Disc Read-Only Memory); magnetic surface memory can be magnetic disk storage or tape storage. The volatile memory may be a random access memory (RAM, Random Access Memory), which is used as an external cache. By way of exemplary but not restrictive description, many forms of RAM are available, such as static random access memory (SRAM, Static Random Access Memory), synchronous static random access memory (SSRAM, Synchronous Static Random Access Memory), and dynamic random access memory. Memory (DRAM, Dynamic Random Access Memory), Synchronous Dynamic Random Access Memory (SDRAM, Synchronous Dynamic Random Access Memory), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM, Double Data Rate Synchronous Dynamic Random Access Memory), enhanced Type synchronous dynamic random access memory (ESDRAM, Enhanced Synchronous Dynamic Random Access Memory), synchronous connection dynamic random access memory (SLDRAM, SyncLink Dynamic Random Access Memory), direct memory bus random access memory (DRRAM, Direct Rambus Random Access Memory) ). The memories described in the embodiments of the present application are intended to include, but are not limited to, these and any other suitable types of memories.
上述本申请实施例揭示的方法可以应用于处理器中,或者由处理器实现。处理器可能是一种集成电路芯片,具有信号的处理能力。在实现过程中,上述方法的各步骤可以通过处理器中的硬件的集成逻辑电路或者软件形式的指令完成。上述的处理器可以是通用处理器、DSP,或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。处理器可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者任何常规的处理器等。结合本申请实施例所公开的方法的步骤,可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于存储介质中,该存储介质位于存储器,处理器读取存储器中的信息,结合其硬件完成前述方法的步骤。The methods disclosed in the foregoing embodiments of the present application may be applied to a processor or implemented by a processor. The processor may be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the above method can be completed by an integrated logic circuit of hardware in the processor or instructions in the form of software. The aforementioned processor may be a general-purpose processor, DSP, or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and the like. The processor can implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of the present application. The general-purpose processor may be a microprocessor or any conventional processor or the like. Combining the steps of the method disclosed in the embodiments of the present application, it may be directly embodied as being executed and completed by a hardware decoding processor, or executed and completed by a combination of hardware and software modules in the decoding processor. The software module may be located in a storage medium, and the storage medium is located in a memory. The processor reads the information in the memory and completes the steps of the foregoing method in combination with its hardware.
本申请实施例还提供了一种存储介质,具体为计算机存储介质,更具体的为计算机可读存储介质。其上存储有计算机指令,即计算机程序,该计算机指令被处理器执行时上述一个或多个技术方案提供的方法。The embodiment of the present application also provides a storage medium, which is specifically a computer storage medium, and more specifically, a computer-readable storage medium. Stored thereon are computer instructions, that is, a computer program, and when the computer instructions are executed by a processor, the method provided by one or more of the above technical solutions is provided.
在本申请所提供的几个实施例中,应该理解到,所揭露的方法和智能设备,可以通过其它的方式实现。以上所描述的设备实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以 有另外的划分方式,如:多个单元或组件可以结合,或可以集成到另一个系统,或一些特征可以忽略,或不执行。另外,所显示或讨论的各组成部分相互之间的耦合、或直接耦合、或通信连接可以是通过一些接口,设备或单元的间接耦合或通信连接,可以是电性的、机械的或其它形式的。In the several embodiments provided in this application, it should be understood that the disclosed method and smart device can be implemented in other ways. The device embodiments described above are merely illustrative. For example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, such as: multiple units or components can be combined, or It can be integrated into another system, or some features can be ignored or not implemented. In addition, the coupling, or direct coupling, or communication connection between the components shown or discussed may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms. of.
上述作为分离部件说明的单元可以是、或也可以不是物理上分开的,作为单元显示的部件可以是、或也可以不是物理单元,即可以位于一个地方,也可以分布到多个网络单元上;可以根据实际的需要选择其中的部分或全部单元来实现本实施例方案的目的。The units described above as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one place or distributed on multiple network units; Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
另外,在本申请各实施例中的各功能单元可以全部集成在一个第二处理单元中,也可以是各单元分别单独作为一个单元,也可以两个或两个以上单元集成在一个单元中;上述集成的单元既可以采用硬件的形式实现,也可以采用硬件加软件功能单元的形式实现。In addition, the functional units in the embodiments of the present application may all be integrated into a second processing unit, or each unit may be individually used as a unit, or two or more units may be integrated into one unit; The above-mentioned integrated unit may be implemented in the form of hardware, or may be implemented in the form of hardware plus software functional units.
本领域普通技术人员可以理解:实现上述方法实施例的全部或部分步骤可以通过程序指令相关的硬件来完成,前述的程序可以存储于一计算机可读取存储介质中,该程序在执行时,执行包括上述方法实施例的步骤;而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。A person of ordinary skill in the art can understand that all or part of the steps in the above method embodiments can be implemented by a program instructing relevant hardware. The foregoing program can be stored in a computer readable storage medium. When the program is executed, it is executed. Including the steps of the foregoing method embodiment; and the foregoing storage medium includes: various media that can store program codes, such as a mobile storage device, ROM, RAM, magnetic disk, or optical disk.
或者,本申请上述集成的单元如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实施例的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机、服务器、或者网络设备等)执行本申请各个实施例所述方法的全部或部分。而前述的存储介质包括:移动存储设备、ROM、RAM、磁碟或者光盘等各种可以存储程序代码的介质。Alternatively, if the above-mentioned integrated unit of the present application is implemented in the form of a software function module and sold or used as an independent product, it may also be stored in a computer readable storage medium. Based on this understanding, the technical solutions of the embodiments of the present application can be embodied in the form of a software product in essence or a part that contributes to the prior art. The computer software product is stored in a storage medium and includes several instructions for A computer device (which may be a personal computer, a server, or a network device, etc.) is allowed to execute all or part of the methods described in the various embodiments of the present application. The aforementioned storage media include: removable storage devices, ROM, RAM, magnetic disks, or optical disks and other media that can store program codes.
需要说明的是:“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。It should be noted that: "first", "second", etc. are used to distinguish similar objects, and not necessarily used to describe a specific sequence or sequence.
另外,本申请实施例所记载的技术方案之间,在不冲突的情况下,可以任意组合。In addition, the technical solutions described in the embodiments of the present application can be combined arbitrarily without conflict.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。The above are only specific implementations of this application, but the protection scope of this application is not limited to this. Any person skilled in the art can easily think of changes or substitutions within the technical scope disclosed in this application. Should be covered within the scope of protection of this application.

Claims (11)

  1. 一种数据处理方法,包括:A data processing method, including:
    获取第一语音数据;Obtain the first voice data;
    将所述第一语音数据划分成至少两段第二语音数据;Dividing the first voice data into at least two pieces of second voice data;
    获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;Acquiring third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;
    在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,Insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,
    每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
  2. 根据权利要求1所述的方法,其中,所述将所述第一语音数据划分成至少两段第二语音数据,还包括:The method according to claim 1, wherein the dividing the first voice data into at least two pieces of second voice data further comprises:
    在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,得到所述第一语音数据对应的第六语音数据;所述第五语音数据由设定波形的语音帧组成;A piece of fifth voice data is inserted into each first set position in the first voice data to obtain the sixth voice data corresponding to the first voice data; the fifth voice data is composed of a voice frame of a set waveform composition;
    所述在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,包括:The inserting corresponding third voice data at each first set position in the first voice data includes:
    根据所述设定波形,在所述第六语音数据中定位出每段第二语音数据对应的第五语音数据;Locate the fifth voice data corresponding to each piece of second voice data in the sixth voice data according to the set waveform;
    将定位出的每段第二语音数据对应的第五语音数据替换为对应的第三语音数据。Replace the fifth voice data corresponding to each segment of the second voice data that is located with the corresponding third voice data.
  3. 根据权利要求2所述的方法,其中,所述在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:The method according to claim 2, wherein the inserting a piece of fifth voice data at each first set position in the first voice data comprises:
    生成设定时长的第五语音数据;Generate the fifth voice data of the set duration;
    在在所述第一语音数据中的每个第一设定位置插入所述设定时长的第五语音数据。Inserting the fifth voice data of the set duration in each first set position in the first voice data.
  4. 根据权利要求2所述的方法,其中,所述在所述第一语音数据中的每个第一设定位置插入一段第五语音数据,包括:The method according to claim 2, wherein the inserting a piece of fifth voice data at each first set position in the first voice data comprises:
    确定所述至少两段第二语音数据中的每段第二语音数据的时长;Determining the duration of each piece of second voice data in the at least two pieces of second voice data;
    为所述至少两段第二语音数据中的每段第二语音数据生成对应的第五语音数据;生成的每段第五语音数据的时长与对应的第二语音数据的时长相同;Generating corresponding fifth voice data for each piece of second voice data in the at least two pieces of second voice data; the duration of each generated piece of fifth voice data is the same as the duration of the corresponding second voice data;
    在所述第一语音数据中的每个第一设定位置插入对应的第五语音数据。Insert corresponding fifth voice data at each first set position in the first voice data.
  5. 根据权利要求2至4任一项所述的方法,其中,组成所述第五语音数据的设定波形的语音帧包括以下任意一项:The method according to any one of claims 2 to 4, wherein the voice frames constituting the set waveform of the fifth voice data include any one of the following:
    静音帧;Mute frame
    白噪声帧。White noise frame.
  6. 根据权利要求1所述的方法,其中,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    获取与所述第一语音数据匹配的第一图像数据;Acquiring first image data matching the first voice data;
    将所述第一图像数据划分成至少两段第二图像数据;每段第二图像数据与所述至少两段第二语音数据中的每段第二语音数据一一对应;Dividing the first image data into at least two pieces of second image data; each piece of second image data has a one-to-one correspondence with each piece of second voice data in the at least two pieces of second voice data;
    在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,得到所述第一图像数据对应的第四图像数据;其中,Insert corresponding third image data at each second set position in the first image data to obtain fourth image data corresponding to the first image data; wherein,
    每个第二设定位置分别对应所述至少两段第二图像数据中的每段第二图像数据的结束位置;所述第四图像数据用于与所述第四语音数据同步播放。Each second set position corresponds to the end position of each piece of second image data in the at least two pieces of second image data; the fourth image data is used to play synchronously with the fourth voice data.
  7. 根据权利要求6所述的方法,其中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:The method according to claim 6, wherein said inserting corresponding third image data at each second set position in said first image data comprises:
    确定所述至少两段第二图像数据中的每段第二图像数据的时长;Determining the duration of each piece of second image data in the at least two pieces of second image data;
    为所述至少两段第二图像数据中的每段第二图像数据生成对应的第三图像数据;生成的每段第三图像数据的时长与对应的第二图像数据的时长相同;Generating corresponding third image data for each piece of second image data in the at least two pieces of second image data; the duration of each piece of third image data generated is the same as the duration of the corresponding second image data;
    在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
  8. 根据权利要求6所述的方法,其中,所述在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据,包括:The method according to claim 6, wherein said inserting corresponding third image data at each second set position in said first image data comprises:
    复制所述至少两段第二图像数据中的每段第二图像数据,得到每段第二图像数据对应的第三图像数据;Copy each piece of second image data in the at least two pieces of second image data to obtain third image data corresponding to each piece of second image data;
    在所述第一图像数据中的每个第二设定位置插入对应的第三图像数据。Inserting corresponding third image data at each second set position in the first image data.
  9. 一种数据处理装置,包括:A data processing device includes:
    第一获取单元,配置为获取第一语音数据;The first acquiring unit is configured to acquire first voice data;
    第一划分单元,配置为将所述第一语音数据划分成至少两段第二语音数据;The first dividing unit is configured to divide the first voice data into at least two pieces of second voice data;
    第二获取单元,配置为获取所述至少两段第二语音数据中的每段第二语音数据对应的第三语音数据;所述第三语音数据通过对对应的第二语音数据进行翻译得到;The second acquiring unit is configured to acquire third voice data corresponding to each piece of second voice data in the at least two pieces of second voice data; the third voice data is obtained by translating the corresponding second voice data;
    第一插入单元,配置为在所述第一语音数据中的每个第一设定位置插入对应的第三语音数据,得到所述第一语音数据对应的第四语音数据;其中,The first inserting unit is configured to insert corresponding third voice data at each first set position in the first voice data to obtain fourth voice data corresponding to the first voice data; wherein,
    每个第一设定位置分别对应所述至少两段第二语音数据中的每段第二语音数据的结束位置。Each first set position corresponds to the end position of each piece of second voice data in the at least two pieces of second voice data.
  10. 一种电子设备,包括存储器、处理器及存储在所述存储器上并 可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时,实现权利要求1至8任一项所述方法的步骤。An electronic device comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and when the processor executes the computer program, any one of claims 1 to 8 is implemented The steps of the method.
  11. 一种存储介质,其上存储有计算机指令,所述计算机指令被处理器执行时实现权利要求1至8任一项所述方法的步骤。A storage medium having computer instructions stored thereon, and when the computer instructions are executed by a processor, the steps of the method according to any one of claims 1 to 8 are realized.
PCT/CN2019/122789 2019-12-03 2019-12-03 Data processing method and apparatus, electronic device, and storage medium WO2021109000A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2019/122789 WO2021109000A1 (en) 2019-12-03 2019-12-03 Data processing method and apparatus, electronic device, and storage medium
CN201980100991.1A CN114503192A (en) 2019-12-03 2019-12-03 Data processing method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/122789 WO2021109000A1 (en) 2019-12-03 2019-12-03 Data processing method and apparatus, electronic device, and storage medium

Publications (1)

Publication Number Publication Date
WO2021109000A1 true WO2021109000A1 (en) 2021-06-10

Family

ID=76221325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/122789 WO2021109000A1 (en) 2019-12-03 2019-12-03 Data processing method and apparatus, electronic device, and storage medium

Country Status (2)

Country Link
CN (1) CN114503192A (en)
WO (1) WO2021109000A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
US20150081274A1 (en) * 2013-09-19 2015-03-19 Kabushiki Kaisha Toshiba System and method for translating speech, and non-transitory computer readable medium thereof
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
CN107910004A (en) * 2017-11-10 2018-04-13 科大讯飞股份有限公司 Voiced translation processing method and processing device
CN108090051A (en) * 2017-12-20 2018-05-29 深圳市沃特沃德股份有限公司 The interpretation method and translator of continuous long voice document
CN110322869A (en) * 2019-05-21 2019-10-11 平安科技(深圳)有限公司 Meeting subangle color phoneme synthesizing method, device, computer equipment and storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081274A1 (en) * 2013-09-19 2015-03-19 Kabushiki Kaisha Toshiba System and method for translating speech, and non-transitory computer readable medium thereof
CN104252861A (en) * 2014-09-11 2014-12-31 百度在线网络技术(北京)有限公司 Video voice conversion method, video voice conversion device and server
CN107305541A (en) * 2016-04-20 2017-10-31 科大讯飞股份有限公司 Speech recognition text segmentation method and device
CN107910004A (en) * 2017-11-10 2018-04-13 科大讯飞股份有限公司 Voiced translation processing method and processing device
CN108090051A (en) * 2017-12-20 2018-05-29 深圳市沃特沃德股份有限公司 The interpretation method and translator of continuous long voice document
CN110322869A (en) * 2019-05-21 2019-10-11 平安科技(深圳)有限公司 Meeting subangle color phoneme synthesizing method, device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN114503192A (en) 2022-05-13

Similar Documents

Publication Publication Date Title
US11115541B2 (en) Post-teleconference playback using non-destructive audio transport
Czyzewski et al. An audio-visual corpus for multimodal automatic speech recognition
US10334384B2 (en) Scheduling playback of audio in a virtual acoustic space
US20200127865A1 (en) Post-conference playback system having higher perceived quality than originally heard in the conference
US10516782B2 (en) Conference searching and playback of search results
US10522151B2 (en) Conference segmentation based on conversational dynamics
US5884267A (en) Automated speech alignment for image synthesis
US11076052B2 (en) Selective conference digest
CN111050201B (en) Data processing method and device, electronic equipment and storage medium
WO2021083071A1 (en) Method, device, and medium for speech conversion, file generation, broadcasting, and voice processing
CA3060748A1 (en) Automated transcript generation from multi-channel audio
US20180027351A1 (en) Optimized virtual scene layout for spatial meeting playback
CN112400325A (en) Data-driven audio enhancement
US20180190266A1 (en) Conference word cloud
KR20070020252A (en) Method of and system for modifying messages
JP2021503094A (en) Speech translation method and translation device
WO2021120190A1 (en) Data processing method and apparatus, electronic device, and storage medium
KR20200027331A (en) Voice synthesis device
Somayazulu et al. Self-supervised visual acoustic matching
WO2021109000A1 (en) Data processing method and apparatus, electronic device, and storage medium
WO2023142590A1 (en) Sign language video generation method and apparatus, computer device, and storage medium
WO2021092733A1 (en) Subtitle display method and apparatus, electronic device and storage medium
Waghmare et al. A Comparative Study of the Various Emotional Speech Databases
US20210266279A1 (en) Audio-First Health Conversations Platform
CN113593568B (en) Method, system, device, equipment and storage medium for converting voice into text

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19955049

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 10.11.2022)

122 Ep: pct application non-entry in european phase

Ref document number: 19955049

Country of ref document: EP

Kind code of ref document: A1