CN117082291A

CN117082291A - Call voice synthesis method and device, electronic equipment and storage medium

Info

Publication number: CN117082291A
Application number: CN202310946545.0A
Authority: CN
Inventors: 黄海龙; 吴凯; 祝伟; 陈庆; 王劲鹏; 许永涛; 庞亚淳; 陈秀红; 钟浩钦; 罗显捷
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Internet Co Ltd
Priority date: 2023-07-28
Filing date: 2023-07-28
Publication date: 2023-11-17

Abstract

The invention provides a call voice synthesis method, a call voice synthesis device, electronic equipment and a storage medium, and relates to the technical field of communication. The method comprises the following steps: taking the collected one-way call record as a first sound channel audio file; performing audio transcoding on the first channel audio file synthesized by the call voice to obtain a first channel audio stream conforming to a target format; copying a target silence frame which is stored in advance and accords with a call voice synthesis target format to obtain a second audio stream which is matched with the length of a first channel audio stream synthesized by call voice; and synthesizing the talking voice into a first channel audio stream and a second channel audio stream to synthesize a talking voice file. The invention can achieve the effect of synthesizing the talking voice file by only storing the target mute frame in advance, and has the advantages of less memory space occupation, low equipment energy consumption, simple talking voice file synthesizing step and higher synthesizing success rate and efficiency.

Description

Call voice synthesis method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of communications technologies, and in particular, to a method and apparatus for synthesizing talking voice, an electronic device, and a storage medium.

Background

Currently, there is a phenomenon of one-way call in voice call, that is, a user does not hear the voice of the other party or hears the voice of the user himself. For example, the signal is in a poor space such as an elevator when the power is turned on; or when a call is made between numbers of different operators, connection establishment failure between partial network elements may exist, which may cause a phenomenon that a call has unidirectional call.

In the virtual number service, many users need to record the call process, for example, staff such as express delivery, takeaway and the like need to record according to the specification of enterprises, but one-way call can cause that only one side of the media layer record has sound, when the sound-to-two-channel record result file is synthesized, the sound-to-two-channel record result file needs to be synthesized according to the fact that a calling party is on a left channel and a called party is on a right channel, and only one side of the sound-to-two-channel record cannot be directly combined into the required two-channel result file.

In voice call services, users also have recording needs for e.g. voice sample collection, natural voice processing, etc. However, if a unidirectional call occurs, the success rate of recording is reduced, and even recording cannot be performed.

Disclosure of Invention

The invention provides a call voice synthesis method, a call voice synthesis device, electronic equipment and a storage medium, which aim to solve one of the technical problems in the related technology at least to a certain extent.

The technical scheme of the invention is as follows:

in order to achieve the above object, an embodiment of a first aspect of the present invention provides a method for synthesizing talking voice, including:

taking the collected one-way call record as a first sound channel audio file;

performing audio transcoding on the first channel audio file to obtain a first channel audio stream conforming to a target format;

copying a pre-stored target silence frame conforming to the target format to obtain a second channel audio stream matched with the length of the first channel audio stream;

and synthesizing the first channel audio stream and the second channel audio stream into a call voice file.

To achieve the above object, a second aspect of the present invention provides a call speech synthesis apparatus, including:

the acquisition module is used for taking the collected one-way call record as a first sound channel audio file;

the transcoding module is used for carrying out audio transcoding processing on the first channel audio file so as to obtain a first channel audio stream conforming to a target format;

the copying module is used for copying a pre-stored target mute frame conforming to the target format so as to obtain a second channel audio stream matched with the length of the first channel audio stream;

and the synthesis module is used for synthesizing the first channel audio stream and the second channel audio stream into a call voice file.

To achieve the above object, an embodiment of a third aspect of the present invention provides an electronic device, including: a processor; a memory for storing executable instructions of the processor; the processor is configured to execute the instructions to implement a call voice synthesis method according to an embodiment of the first aspect of the present invention.

In order to achieve the above object, a fourth aspect of the present invention provides a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a call speech synthesis method according to the first aspect of the present invention.

The technical scheme provided by the embodiment of the invention at least has the following beneficial effects:

taking the collected one-way call record as a first sound channel audio file; performing audio transcoding on the first channel audio file synthesized by the call voice to obtain a first channel audio stream conforming to a target format; copying a target silence frame which is stored in advance and accords with a call voice synthesis target format to obtain a second audio stream which is matched with the length of a first channel audio stream synthesized by call voice; and synthesizing the talking voice into a first channel audio stream and a second channel audio stream to synthesize a talking voice file. The invention can achieve the effect of synthesizing the talking voice file by only storing the target mute frame in advance, and has the advantages of less memory space occupation, low equipment energy consumption, simple talking voice file synthesizing step and higher synthesizing success rate and efficiency.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

fig. 1 is a flow chart of a method for synthesizing talking voice according to an embodiment of the present invention;

fig. 2 is a flow chart of another method for synthesizing talking voice according to an embodiment of the present invention;

FIG. 3 is a flow chart of a filter pipeline construction provided by an embodiment of the invention;

FIG. 4 is a flowchart of a filter pipeline process according to an embodiment of the present invention;

FIG. 5 is a schematic diagram of a voice packet according to an embodiment of the present invention;

fig. 6 is a schematic structural diagram of a call voice synthesizing apparatus according to an embodiment of the present invention;

fig. 7 is an electronic device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The following describes a call voice synthesis method and apparatus according to an embodiment of the present invention with reference to the accompanying drawings.

Fig. 1 is a flow chart of a method for synthesizing talking voice according to an embodiment of the present invention.

In the virtual number service, one-way call can cause that only one side of the media layer record has sound, when the sound-to-two-channel record result file is synthesized, the sound-to-two-channel record result file is synthesized according to the fact that a calling party is on a left channel and a called party is on a right channel, and only one side of the sound-to-two-channel record cannot be directly synthesized.

In a voice call service, a recording code is generally an Adaptive Multi-Rate (abbreviated AMR) code format file, and a sound combination result is a waveform audio file format (Waveform Audio File Format, abbreviated WAV) file encoded by uncompressed pulse code modulation (Pulse Code Modulation, abbreviated PCM) or a compressed dynamic image expert compression standard audio layer 3 (Moving Picture Experts Group Audio Layer III, abbreviated MP 3) file, where the recording code and the result code are inconsistent, and the recording and sound combination process needs transcoding. The sound combination requires that the duration of two sound recordings is consistent, and if the duration is inconsistent, the sound combination process is finished by ending the sound recording at one side with the shorter duration when the sound recording at one side reads the tail, so that the sound recording at one side with the longer duration is lost.

At present, the following method is generally adopted to solve the problem of voice mixing of one-way call:

the method comprises the following steps: the method comprises the steps of firstly calculating the recording time length of one side of a call record, secondly generating a mute file with the same time length as the recording code, and finally generating the call voice file by decoding, encoding and synthesizing double channels.

The second method is as follows: and storing mute files with various time lengths, firstly acquiring the time length T1 of the sound recording on the recording side, and then selecting the mute file with the time length more than or equal to T1 as the sound recording on the non-recording side to carry out sound combination.

Firstly, generating a mute file on one side in a redundancy way, encoding and decoding are needed to be carried out on two voices, so that the energy consumption is high and the efficiency is low; the second method is characterized in that because the call duration is uncertain, more mute files with different durations are required to be stored, the storage space occupation is larger, in addition, under the condition that the mute files are too short, voice splicing is also required, the processing steps are complex, and the efficiency is also not high.

In view of the above problems, an embodiment of the present invention provides a method for synthesizing talking voice, as shown in fig. 1, including the following steps:

and 101, taking the collected one-way call record as a first channel audio file.

In the embodiment of the invention, when in call recording, the calling end and the called end record respectively. Under the condition of unidirectional call, only single-ended call recording can be acquired, voice data do not exist on the other side, and recording cannot be acquired. And taking the collected unidirectional call record (namely, the call record on the recording side) as a first sound channel audio file.

Step 102, performing audio transcoding processing on the first channel audio file to obtain a first channel audio stream conforming to the target format.

Optionally, the target format is a preset synthesized voice coding format, and may be various formats, for example, a wav file coded by an uncompressed pcm.

The first channel audio file is transcoded into the first channel audio stream conforming to the target format, so that the performance consumption of encoding and decoding is reduced in the subsequent sound combination process.

And 103, copying the pre-stored target silence frame conforming to the target format to obtain a second channel audio stream matched with the length of the first channel audio stream.

Alternatively, the target silence frame may be a silence frame file conforming to the target format, and a silence file with a certain duration is formed by silence frames of one frame, and may be obtained by repeatedly writing a plurality of silence frames of a single frame.

Alternatively, in the case of determining the length of the first channel audio stream, the second channel audio stream consistent with the length of the first channel audio stream may be generated by repeatedly writing a plurality of target silence frames.

It should be appreciated that the pre-stored target silence frame is in a target format that is consistent with the format of the first channel audio stream for subsequent efficient synthesis without format conversion.

Step 104, synthesizing the first channel audio stream and the second channel audio stream into a call voice file.

Optionally, the first channel audio stream and the second channel audio stream are respectively written into the corresponding channels to synthesize a dual-channel data stream, namely the talking voice file.

Because the first channel audio stream and the second channel audio stream both accord with the target format, encoding and decoding are not needed when the audio is combined, the audio can be combined rapidly, and the audio combining efficiency is improved.

In this embodiment, the collected unidirectional call record is used as a first channel audio file; performing audio transcoding on the first channel audio file to obtain a first channel audio stream conforming to a target format; copying a pre-stored target mute frame conforming to a target format to obtain a second audio stream matched with the length of the first audio stream; and synthesizing the first channel audio stream and the second channel audio stream into a call voice file. According to the embodiment of the invention, the first channel audio file is transcoded to obtain the first channel audio stream, the target mute frame is duplicated to obtain the second channel audio stream based on the length of the first channel audio stream, and finally the first channel audio stream and the second channel audio stream are synthesized into the call voice file, so that only the first channel audio file is transcoded, the processing steps are simple, and the energy consumption is lower; and only the target mute frame is stored, so that the storage pressure is low, and the energy consumption is reduced; meanwhile, the step of synthesizing the talking voice file is simple, and the success rate and the efficiency of synthesizing the voice are higher.

In order to clearly illustrate the above embodiment, another call voice synthesis method is provided in this embodiment, and fig. 2 is a flow chart of another call voice synthesis method provided in the embodiment of the present invention.

As shown in fig. 2, the method may include the steps of:

step 201, taking the collected one-way call record as a first channel audio file.

In the embodiment of the present invention, the specific process of step 201 may be referred to the related description in step 101 in the above embodiment, which is not repeated here.

Step 202, creating a filter tube.

In the embodiment of the invention, the filter pipeline is created based on the requirement of the double-channel sound combination, and comprises a transcoding filter and a sound combination filter, and the audio data to be processed is input into the filter pipeline, so that the double-channel sound combination can be output.

Referring to fig. 3, the filter pipeline is a filter chain diagram composed of a plurality of filters, wherein the first input filter represents the beginning of the whole complex filter, and serves as a source of the filter channel to receive externally input data. The last output filter indicates the end of the whole filter, both of which are indispensable, but the two filters do not process audio. The actual audio processing is a transcoding filter and a sound combining filter, wherein the transcoding filter is an implementation filter for encoding and decoding input audio into a target format, and the sound combining filter is for writing two audio into two channels to synthesize a file.

According to the order that the input filter is arranged at the first, the output filter is arranged at the last, the transcoding filter for encoding and decoding processing is firstly executed, then the audio combining filter is executed, the input filter, the transcoding filter, the audio combining filter and the output filter are added and linked in sequence to form a filter link, and the filter link is the process path of audio processing.

After the filter is linked, initializing the filter pipeline, namely configuring the filter pipeline, and after the initialization is finished, inputting the audio data to be processed into the filter pipeline for processing, and outputting the processed data. Writing the data obtained from the filter channel into a file, adding a file head and a file tail to the file to obtain a result-encoded audio file, and finally releasing the filter pipeline to finish the sound combination operation.

After the filter tube is established, two input audios are input into the filter tube, an output audio is synthesized through the transcoding filter and the group photo filter, audio data are transmitted in the buffer, and the audio data can be executed in one flow after being linked with the transcoding filter and the group photo filter, so that the transmission efficiency is high.

Step 203, the first channel audio file is input into a filter pipeline, so as to perform audio transcoding processing on the first channel audio file through a transcoding filter in the filter pipeline, so as to obtain a first channel audio stream.

In the embodiment of the invention, one side with record is recorded as A1 and the other side without voice is recorded as A2 under one-way call, and because the record exists in A1, A1 is used as main audio and A2 is used as auxiliary audio. And decoding the main audio A1 to obtain sampling data, and then carrying out resampling, encoding and other operations to obtain a first channel audio stream output by the transcoding filter.

The transcoding filter is a loop in the filter pipeline and is used for encoding and decoding the main audio A1 into a target format, and the obtained first channel audio stream accords with the target format.

Step 204, inputting the pre-stored monaural blank file as a second channel audio file into a filter pipeline, so as to copy the target mute frame based on the length of the first channel audio stream through a transcoding filter in the filter pipeline, and obtain a second channel audio stream corresponding to the second channel audio file.

In the embodiment of the invention, two files are stored in a cache device of a memory, wherein one file is an empty file conforming to a target format, and the empty file is a single-channel audio file with the duration of 0; the other is a mute frame file, and only one mute frame of one frame is stored in the buffer device, and one frame is a group of voice data in one sampling period. The files are all cached in the caching device in the form of byte arrays, the caching device is in the service memory, and the service memory is automatically loaded when the service is started.

Since the target format can be multiple, files with multiple formats and a frame of mute frame file with corresponding formats can be cached.

Alternatively, for non-compressed pcm encoded wav files, one frame of speech data is of fixed length; for compressed format files, a silence frame of constant code rate, defined as a fixed compression ratio, is defined, and after definition, the length of one frame of speech data is also determined.

It should be understood that a mute audio file of a certain duration is composed of mute frames of a frame, and when a plurality of mute frames are required, it can be implemented by repeatedly writing a plurality of mute frames of a single frame.

Further, a target silence frame is selected from the buffered silence frames of the plurality of formats.

As a possible implementation, based on the target format adopted by the first channel audio stream, a mute frame conforming to the target format is determined from prestored mute frames in a plurality of formats.

As another possible implementation, the format context is read for a pre-stored mono empty file; and determining the mute frames conforming to the target format from a plurality of pre-stored mute frames according to the target format specified in the format context.

Although the slave audio A2 is free of voice data, the slave audio A2 may occupy one voice channel as one sound channel of the synthesized sound. A corresponding coded blank file is copied from the buffer instead of the slave audio A2 as a second audio file.

Optionally, referring to fig. 4, in the transcoding filter, the main audio needs to be subjected to audio decoding to obtain sampling data, and then resampling and audio encoding operations are performed to obtain a resulting audio stream (i.e. a first channel audio stream) conforming to the target format; the secondary audio is an empty file, the sampling rate and format data of the secondary audio are obtained after the format context is read, the resulting audio stream (namely, the first channel audio stream) of the primary audio is read, the duration of the primary audio is obtained, and the target mute frame sound with matched duration is copied from the buffer device to serve as the output audio stream (namely, the second channel audio stream) of the secondary audio.

Specifically, a format context of the second audio file is read, wherein the format context comprises a sampling rate and a code rate; determining the number of silence frames to be repeatedly duplicated according to the audio time length of a single target silence frame under the sampling rate and the code rate and according to the time length of the audio stream of the first channel; and copying the target mute frames conforming to the number of the mute frames from the buffer memory to obtain a second channel audio stream which is matched with the duration of the first channel audio stream.

As a possible implementation manner, a ratio of a time length of the first channel audio stream to an audio time length of the target mute frame is calculated, so as to obtain a mute frame number needing repeated copying.

Optionally, each time audio transcoding is performed to obtain a frame in the first channel audio stream, the target silence frame is copied once as a corresponding frame in the second channel audio stream.

It should be noted that, the slave audio A2 does not perform actual audio transcoding in the transcoding filter, but is only used as a slave audio, and obtains the corresponding number of silence frames according to the master audio, and further copies the target silence frames according to the number of silence frames to obtain the second channel audio stream.

By storing the mute frame files in a plurality of formats and selecting the target mute frame which accords with the target format, transcoding and other operations are not needed when the call voice file is synthesized subsequently, so that the consumption of encoding and decoding can be reduced. Meanwhile, the mute frame files in a plurality of formats are stored, and the mute frame files are cached once and multiplexed for a plurality of times, so that the generation of redundant files is reduced.

And step 205, synthesizing the first channel audio stream and the second channel audio stream into a call voice file through a sound combining filter in the filter pipeline.

In the audio combining filter, two audio streams are input and one audio stream is output.

Optionally, the first channel audio stream and the second channel audio stream are synthesized into the call voice file by a sound combining filter in the filter pipeline.

Specifically, referring to fig. 4, according to the call directions to which the first channel audio stream and the second channel audio stream belong, a channel layout is performed to determine channels corresponding to the first channel audio stream and the second channel audio stream; synthesizing a single data stream of two channels based on channels corresponding to the first channel audio stream and the second channel audio stream; and writing the double-channel single data stream into the voice combining file to obtain the call voice file.

As a possible implementation manner, a channel layout is set, a left channel layout and a right channel layout are determined, and output is a binaural stereo, then the first channel audio stream obtained in step 203 and the second channel audio stream obtained in step 204 are input into a sound combining filter in a filter pipeline through a buffer zone, in the sound combining filter, channels corresponding to the first channel audio stream and the second channel audio stream are determined through the channel layout, and are stored as a voice packet as shown in fig. 5, in one voice data packet, two channels are provided, the left side represents a left channel, and the first channel audio stream is filled; the right side represents the right channel, filled with a second audio stream; circularly reading the audio data packet, synthesizing a double-channel single data stream, and finally, placing the synthesized single data stream into a buffer area until the data is acquired, thereby completing data output; and obtaining output data, writing the output data into the voice-combining file, and adding a file header and a file tail into the voice-combining file to obtain a call voice file.

Optionally, after obtaining the call voice file, releasing the filter tube.

In the embodiment of the invention, the filter pipe is constructed, the main audio is output through the transcoding filter to obtain the first channel voice stream, the first channel voice stream and the second channel voice stream are synthesized through the voice combining filter based on the corresponding target mute frame copied from the audio context as the second channel voice stream, the conversation voice file is obtained, only the transcoding operation is carried out on the main audio, the copy of the mute frame number is only needed from the audio, the encoding and decoding energy consumption is reduced, the transcoding and the voice combining are executed in one flow, and the voice combining efficiency is very high.

In order to realize the embodiment, the invention also provides a call voice synthesis device.

Fig. 6 is a schematic structural diagram of a call voice synthesis apparatus according to an embodiment of the present invention.

As shown in fig. 6, the talking voice synthesizing apparatus 600 includes: an acquisition module 610, a transcoding module 620, a replication module 630, and a synthesis module 640.

The acquiring module 610 is configured to use the collected unidirectional call recording as a first channel audio file;

a transcoding module 620, configured to perform audio transcoding processing on the first channel audio file to obtain a first channel audio stream conforming to the target format;

the copying module 630 is configured to copy a pre-stored target silence frame that conforms to a target format, so as to obtain a second channel audio stream that matches the length of the first channel audio stream;

and a synthesis module 640, configured to synthesize the first channel audio stream and the second channel audio stream into a call voice file.

Further, in one possible implementation of the embodiment of the present invention, the transcoding module 620 is further configured to:

and inputting the first channel audio file into a filter pipeline to perform audio transcoding processing on the first channel audio file through a transcoding filter in the filter pipeline so as to obtain a first channel audio stream.

Further, in a possible implementation of the embodiment of the present invention, the replication module 630 is further configured to:

and inputting the prestored single-channel blank file into a filter pipeline as a second channel audio file so as to copy the target mute frame based on the length of the first channel audio stream through a transcoding filter in the filter pipeline to obtain a second channel audio stream corresponding to the second channel audio file.

Every time audio transcoding is performed to obtain a frame in the first channel audio stream, the target silence frame is copied once to serve as a corresponding frame in the second channel audio stream.

Taking a prestored single-channel blank file as a second sound file, and reading the format context of the second sound file, wherein the format context comprises a sampling rate and a code rate;

determining the number of silence frames to be repeatedly duplicated according to the audio time length of a single target silence frame under the sampling rate and the code rate and according to the time length of the audio stream of the first channel;

and copying the target mute frames conforming to the number of the mute frames from the buffer memory to obtain a second channel audio stream which is matched with the duration of the first channel audio stream.

Further, in one possible implementation manner of the embodiment of the present invention, the pre-stored silence frames are in multiple formats, and the talking voice synthesis apparatus is further configured to:

determining a mute frame conforming to a target format from prestored mute frames in a plurality of formats based on the target format adopted by the first channel audio stream; or,

reading a format context for a pre-stored monaural blank file; and determining the mute frames conforming to the target format from a plurality of pre-stored mute frames according to the target format specified in the format context.

Further, in one possible implementation of the embodiment of the present invention, the synthesis module 640 is further configured to:

and synthesizing the first channel audio stream and the second channel audio stream into a call voice file through a sound combining filter in the filter pipeline.

Carrying out channel layout according to the communication direction of the first channel audio stream and the second channel audio stream so as to determine channels corresponding to the first channel audio stream and the second channel audio stream;

synthesizing a single data stream of two channels based on channels corresponding to the first channel audio stream and the second channel audio stream;

and writing the double-channel single data stream into the voice combining file to obtain the call voice file.

It should be noted that the foregoing explanation of the embodiment of the method for synthesizing talking voice is also applicable to the talking voice synthesizing apparatus of this embodiment, and will not be repeated here.

Based on the foregoing embodiments, the embodiment of the present invention further provides an electronic device, and fig. 7 is a block diagram of an electronic device 700 provided by the embodiment of the present invention.

As shown in fig. 7, the electronic device 700 includes:

a memory 701 and a processor 702, a bus 703 connecting different components (including the memory 701 and the processor 702), the memory 701 storing a computer program, and the processor 702 implementing the call speech synthesis method according to the embodiment of the invention when executing the program.

Bus 703 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 700 typically includes a variety of electronic device readable media. Such media can be any available media that is accessible by electronic device 700 and includes both volatile and nonvolatile media, removable and non-removable media.

Memory 701 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 704 and/or cache memory 705. Electronic device 700 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 706 may be used to read from and write to non-removable, nonvolatile magnetic media (not shown in FIG. 7, commonly referred to as a "hard drive"). Although not shown in fig. 7, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 703 through one or more data medium interfaces. Memory 701 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of embodiments of the invention.

A program/utility 708 having a set (at least one) of program modules 707 may be stored in, for example, memory 701, such program modules 707 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 707 generally perform the functions and/or methods of the embodiments described herein.

The electronic device 700 may also communicate with one or more external devices 709 (e.g., keyboard, pointing device, display 711, etc.), with one or more devices that enable a user to interact with the electronic device 700, and/or with any device (e.g., network card, modem, etc.) that enables the electronic device 700 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 712. Also, the electronic device 700 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN) and/or a public network, such as the Internet, through a network adapter 713. As shown in fig. 7, the network adapter 713 communicates with other modules of the electronic device 700 via the bus 703. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 700, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor 702 executes various functional applications and data processing by running programs stored in the memory 701.

In order to implement the above-described embodiments, the present invention also proposes a computer-readable storage medium, which when executed by a processor of an electronic device, enables the electronic device to perform a call speech synthesis method as previously described, and optionally the computer-readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, or the like.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and additional implementations are included within the scope of the preferred embodiment of the present invention in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order from that shown or discussed, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the embodiments of the present invention.

Logic and/or steps represented in the flowcharts or otherwise described herein, e.g., a ordered listing of executable instructions for implementing logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions. For the purposes of this description, a "computer-readable medium" can be any means that can contain, store, communicate, propagate, or transport the program for use by or in connection with the instruction execution system, apparatus, or device. More specific examples (a non-exhaustive list) of the computer-readable medium would include the following: an electrical connection (electronic device) having one or more wires, a portable computer diskette (magnetic device), a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber device, and a portable compact disc read-only memory (CDROM). In addition, the computer readable medium may even be paper or other suitable medium on which the program is printed, as the program may be electronically captured, via, for instance, optical scanning of the paper or other medium, then compiled, interpreted or otherwise processed in a suitable manner, if necessary, and then stored in a computer memory.

It is to be understood that portions of the present invention may be implemented in hardware, software, firmware, or a combination thereof. In the above-described embodiments, the various steps or methods may be implemented in software or firmware stored in a memory and executed by a suitable instruction execution system. As with the other embodiments, if implemented in hardware, may be implemented using any one or combination of the following techniques, as is well known in the art: discrete logic circuits having logic gates for implementing logic functions on data signals, application specific integrated circuits having suitable combinational logic gates, programmable Gate Arrays (PGAs), field Programmable Gate Arrays (FPGAs), and the like.

Those of ordinary skill in the art will appreciate that all or a portion of the steps carried out in the method of the above-described embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and where the program, when executed, includes one or a combination of the steps of the method embodiments.

In addition, each functional unit in the embodiments of the present invention may be integrated in one processing module, or each unit may exist alone physically, or two or more units may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules may also be stored in a computer readable storage medium if implemented in the form of software functional modules and sold or used as a stand-alone product.

The above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, or the like. While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. A call voice synthesis method is characterized by comprising the following steps:

taking the collected one-way call record as a first sound channel audio file;

2. The method of claim 1, wherein copying the pre-stored target silence frames conforming to the target format to obtain a second channel audio stream that matches the first channel audio stream length comprises:

and copying the target mute frame once every time audio transcoding is carried out to obtain a frame in the first channel audio stream, so as to be used as a corresponding frame in the second channel audio stream.

3. The method of claim 1, wherein copying the pre-stored target silence frames conforming to the target format to obtain a second channel audio stream that matches the first channel audio stream length comprises:

taking a prestored single-channel blank file as a second audio file, and reading a format context of the second audio file, wherein the format context comprises a sampling rate and a code rate;

determining the number of silence frames to be repeatedly duplicated according to the audio duration of a single target silence frame under the sampling rate and the code rate and according to the time length of the first channel audio stream;

and copying target mute frames conforming to the mute frame number from the buffer memory to obtain a second channel audio stream matched with the duration of the first channel audio stream.

4. A method according to any of claims 1-3, characterized in that pre-stored silence frames are in a plurality of formats; the method further comprises the steps of:

reading a format context for a pre-stored monaural blank file; and determining a mute frame conforming to the target format from a plurality of pre-stored mute frames according to the target format specified in the format context.

5. A method according to any one of claims 1-3, wherein said audio transcoding of said first channel audio file to obtain a first channel audio stream conforming to a target format comprises:

inputting the first channel audio file into a filter pipeline to perform audio transcoding processing on the first channel audio file through a transcoding filter in the filter pipeline so as to obtain the first channel audio stream;

the copying the pre-stored target silence frame conforming to the target format to obtain a second audio stream matched with the length of the first audio stream, including:

and inputting a prestored single-channel blank file into the filter pipeline as a second channel audio file, so as to copy the target mute frame based on the length of the first channel audio stream through the transcoding filter in the filter pipeline, and obtaining a second channel audio stream corresponding to the second channel audio file.

6. The method of claim 5, wherein synthesizing the first channel audio stream and the second channel audio stream into a call voice file comprises:

7. A method according to any of claims 1-3, wherein said synthesizing the first channel audio stream and the second channel audio stream into a call voice file comprises:

carrying out channel layout according to the conversation direction of the first channel audio stream and the second channel audio stream so as to determine channels corresponding to the first channel audio stream and the second channel audio stream;

synthesizing a single-channel dual-channel audio stream based on channels corresponding to the first channel audio stream and the second channel audio stream;

and writing the double-channel single data stream into a voice file to obtain a call voice file.

8. A call speech synthesis apparatus, comprising:

9. An electronic device, comprising:

a processor;

a memory for storing executable instructions of the processor;

wherein the processor is configured to execute the instructions to implement the telephony speech synthesis method of any of claims 1 to 7.

10. A computer readable storage medium, characterized in that instructions in the computer readable storage medium, when executed by a processor of an electronic device, enable the electronic device to perform the telephony voice synthesis method of any of claims 1 to 7.